What AI Crawlers Actually Extract From Your Site
Your page returns 200. It renders fine in the browser. AI tools still ignore it. We see this all the time — and it's almost never the reason teams think it is.
On This Page
The Real Problem
A typical failing page looks like this:
HTML size: 6 KB
Visible text: <100 words
Script tags: 20+
Root container: <div id="root"></div> (empty)That page is effectively invisible to AI crawlers. It's not malformed. It's not blocked. It just contains no content the moment it leaves the server.
This breaks in production when teams assume "rendered UI = crawlable content." It's not. The browser will eventually render it. ChatGPT, Claude, Perplexity, and Google's AI Overviews won't.
The hard rule:
If your content isn't in the initial HTML response, it doesn't exist to AI systems. There is no "later." There is no retry queue. The extraction happens once, on the bytes that come back from your server.
What AI Crawlers Actually Extract
AI crawlers do not behave like browsers. They do not wait for hydration. They do not reconstruct your component tree. They fetch HTML and extract a small set of structural elements:
- Headings (
h1–h3) - Paragraphs and inline text
- Lists (
ul,ol) - Links (anchor text + destination)
- Basic metadata (
title, meta description, JSON-LD)
Everything else is ignored. If your HTML is <div id="root"></div> plus a script bundle, the crawler sees:
Headings: 0
Paragraphs: 0
Lists: 0
Structure: none
→ Failed extraction.You can verify this in seconds — see the Quick Test below or jump to the HTTP Bot Comparison tool.
What Most Guides Get Wrong
Most SEO guidance still assumes:
JavaScript rendering will complete in time
Crawlers will retry or wait for content
"Eventually" the content appears and gets indexed
That is false in real systems. AI crawlers do not wait for client-side rendering. They do not retry broken pages. They do not execute complex app logic. They take the response, extract what's there, and move on.
"But Lighthouse passes"
Lighthouse runs a full headless browser with execution time. AI crawlers don't. A green Lighthouse score tells you nothing about what GPTBot or ClaudeBot actually receives. We've seen 100/100 Lighthouse scores on pages that ship 0 words to bots.
What We See in Production
Five failure modes show up in nearly every audit we run. Most sites have at least two of them.
1. Script Shell Pages
HTML size: 5 KB
Text content: 40 words
Scripts: React + vendor bundles (180 KB)
Result: Crawler extracts nothing. Page is ignored.The default failure mode of every Vite/CRA SPA. Read more in Script Shell Pages.
2. Deep Link Failures
/pricing → initial response: 404 (or empty index.html)
→ SPA rewrites to correct route after JS loads
Result: Crawler records a 404 with zero content.This happens constantly on Vite + client-side routers without proper rewrite rules. The bot never sees the rewritten page.
3. Hydration Crashes
Page loads HTML shell
JS throws TypeError during hydration
UI never renders, error boundary swallows it
Result: Crawler extracts empty HTML.Silent and lethal. We covered the full pattern in Hydration Crashes: The Silent Killer.
4. JS-Only Content
Pricing table: fetched via API on mount
Docs content: loaded dynamically from CMS
HTML contains: <Header />, <Footer />, empty <main />
Result: AI extraction misses your entire product.The header and footer are extractable. The thing you sell isn't.
5. High Script-to-Content Ratio
HTML size: 120 KB
Scripts: ~90 KB
Text: ~5 KB
Ratio: 18:1 scripts to content
Result: Extraction quality drops. AI systems deprioritize the page.Even when content exists, drowning it in script tags reduces extraction confidence. See also: Your HTML Is Only 4KB.
Solutions: Prerender vs SSR vs Edge
Three real options. Each has tradeoffs. Pick based on what actually breaks in your stack — not on what's trendy.
Prerender
Static HTML snapshots generated ahead of time.
Works for:
Simple marketing pages with rare changes.
Fails when:
- • Content changes frequently
- • Cache invalidation breaks
- • Dynamic routes exist
SSR
HTML generated per request on your server.
Works for:
Sites that can keep SSR stable in production.
Fails when:
- • Routing mismatches
- • Partial rendering
- • Performance limits cut execution
Edge Proxy (DataJelly)
Fully rendered HTML snapshots served to bots, kept in sync with the live app.
Why it's different:
- • Waits for router completion
- • Waits for DOM mutation settling
- • Waits for hydration completion
- • Doesn't rely on app execution at crawl time
Deeper comparison: Prerender vs SSR vs Edge Rendering and how Edge works.
Thresholds That Matter
These are the numbers we use in production audits. They're not theory — they're the cutoffs where extraction starts failing.
| Signal | Healthy | Warning | Failure |
|---|---|---|---|
| Raw HTML size | > 20 KB | 10–20 KB | < 10 KB |
| Visible text | > 300 words | 200–300 | < 200 words |
| Headings | h1 + 3+ h2/h3 | h1 only | none |
| Script-to-text ratio | < 3:1 | 3:1–10:1 | > 10:1 |
| Bot vs browser word count | within 20% | 20–50% gap | > 50% gap |
Practical Checklist
Run all five. If any fail, your site is leaking AI visibility right now.
Validate HTML quality
HTML > 20 KB, visible text > 300 words, real headings present.
Detect script shell pages
Empty root container + high script ratio + minimal text nodes = shell.
Test deep links
curl /pricing, /docs, /blog/anything. Must return 200 with real content.
Check rendering completeness
Compare page source vs browser. Different content = extraction failure.
Monitor changes over time
HTML size drops > 50% or text drops > 40% = a deploy broke rendering.
The Page Validator and HTTP Bot Comparison automate most of this. Guard tracks these as production signals so a bad deploy doesn't go unnoticed for two weeks.
Quick Test: What Do Bots Actually See?
Most people guess. Don't.
Run this test and look at the actual response your site returns to bots.
Fetch your page as Googlebot
Use your terminal:
curl -A "Googlebot" https://yourdomain.comLook for:
- Real visible text (not just
<div id="root">) - Meaningful content in the HTML
- Page size (should not be tiny)
Compare bot vs browser
Now test what a real browser gets:
curl -A "Mozilla/5.0" https://yourdomain.comIf these responses are different, Google is indexing a different page than your users see.
Stop guessing — measure it.
Real example: 253 words vs 13,547
We see this constantly. Here's a real example from production: Googlebot saw 253 words and 2 KB of HTML. A browser saw 13,547 words and 77.5 KB. Same URL — completely different content.

If your HTML doesn't contain the content, Google doesn't either.
Compare Googlebot vs browser on your site → HTTP Debug ToolCheck for common failure signals
We see this all the time in production:
- HTML under ~1KB → usually empty shell
- Visible text under ~200 characters → thin or missing content
- Missing <title> or <h1> → weak or broken page
- Large difference between bot vs browser HTML → rendering issue
Use the DataJelly Visibility Test (Recommended)
You can run this without touching curl. It shows you:
- Raw HTML returned to bots (Googlebot, Bing, GPTBot, etc.)
- Fully rendered browser version
- Side-by-side differences in word count, HTML size, links, and content
What this test tells you (no guessing)
After running this, you'll know:
- Whether your HTML is actually indexable
- Whether bots are seeing partial content
- Whether rendering is breaking in production
This is the difference between "I think SEO is set up" and "I know what Google is indexing."
If you don't understand why this happens, read: Why Google Can't See Your SPA
If this test fails
You have three real options:
SSR
Works if you can keep it stable in production
Prerendering
Breaks with dynamic content and scale
Edge Rendering
Reflects real production output without app changes
If you do nothing, you will not rank consistently. Learn how Edge Rendering works →
This issue doesn't show up in Lighthouse. It shows up in rankings.
AI crawlers do not interpret your app. They extract text from HTML.
If your content is not present in the initial HTML response: it will not be extracted, it will not be understood, and it will not be cited. You either serve structured content immediately, or you don't exist to AI systems. This is not an SEO nuance. It's a hard failure mode.
How DataJelly Fixes This at the Visibility Layer
DataJelly Edge serves complete HTML snapshots to bots — generated only after router completion, DOM settling, and hydration finish. AI Markdown gives crawlers a clean, structured extraction format with no UI noise. Works with React, Vite, and Lovable SPAs. No app rewrite required.
Instead of hoping crawlers execute your app, you control exactly what they extract.