Blog

Edge

Latest

April 2026

What AI Crawlers Actually Extract From Your Site

Your page returns 200. It renders fine in the browser. AI tools still ignore it. We see this all the time — and it's almost never the reason teams think it is.

Reading progress0%

The Real Problem

A typical failing page looks like this:

HTML size:        6 KB
Visible text:     <100 words
Script tags:      20+
Root container:   <div id="root"></div>  (empty)

That page is effectively invisible to AI crawlers. It's not malformed. It's not blocked. It just contains no content the moment it leaves the server.

This breaks in production when teams assume "rendered UI = crawlable content." It's not. The browser will eventually render it. ChatGPT, Claude, Perplexity, and Google's AI Overviews won't.

The hard rule:

If your content isn't in the initial HTML response, it doesn't exist to AI systems. There is no "later." There is no retry queue. The extraction happens once, on the bytes that come back from your server.

What AI Crawlers Actually Extract

AI crawlers do not behave like browsers. They do not wait for hydration. They do not reconstruct your component tree. They fetch HTML and extract a small set of structural elements:

Headings (h1–h3)
Paragraphs and inline text
Lists (ul, ol)
Links (anchor text + destination)
Basic metadata (title, meta description, JSON-LD)

Everything else is ignored. If your HTML is <div id="root"></div> plus a script bundle, the crawler sees:

Headings:    0
Paragraphs:  0
Lists:       0
Structure:   none

→ Failed extraction.

You can verify this in seconds — see the Quick Test below or jump to the HTTP Bot Comparison tool.

What Most Guides Get Wrong

Most SEO guidance still assumes:

JavaScript rendering will complete in time

Crawlers will retry or wait for content

"Eventually" the content appears and gets indexed

That is false in real systems. AI crawlers do not wait for client-side rendering. They do not retry broken pages. They do not execute complex app logic. They take the response, extract what's there, and move on.

"But Lighthouse passes"

Lighthouse runs a full headless browser with execution time. AI crawlers don't. A green Lighthouse score tells you nothing about what GPTBot or ClaudeBot actually receives. We've seen 100/100 Lighthouse scores on pages that ship 0 words to bots.

What We See in Production

Five failure modes show up in nearly every audit we run. Most sites have at least two of them.

1. Script Shell Pages

HTML size:     5 KB
Text content:  40 words
Scripts:       React + vendor bundles (180 KB)
Result:        Crawler extracts nothing. Page is ignored.

The default failure mode of every Vite/CRA SPA. Read more in Script Shell Pages.

2. Deep Link Failures

/pricing    →  initial response: 404 (or empty index.html)
            →  SPA rewrites to correct route after JS loads
Result:        Crawler records a 404 with zero content.

This happens constantly on Vite + client-side routers without proper rewrite rules. The bot never sees the rewritten page.

3. Hydration Crashes

Page loads HTML shell
JS throws TypeError during hydration
UI never renders, error boundary swallows it
Result:        Crawler extracts empty HTML.

Silent and lethal. We covered the full pattern in Hydration Crashes: The Silent Killer.

4. JS-Only Content

Pricing table:   fetched via API on mount
Docs content:    loaded dynamically from CMS
HTML contains:   <Header />, <Footer />, empty <main />
Result:          AI extraction misses your entire product.

The header and footer are extractable. The thing you sell isn't.

5. High Script-to-Content Ratio

HTML size:   120 KB
Scripts:     ~90 KB
Text:        ~5 KB
Ratio:       18:1 scripts to content
Result:      Extraction quality drops. AI systems deprioritize the page.

Even when content exists, drowning it in script tags reduces extraction confidence. See also: Your HTML Is Only 4KB.

Solutions: Prerender vs SSR vs Edge

Three real options. Each has tradeoffs. Pick based on what actually breaks in your stack — not on what's trendy.

Prerender

Static HTML snapshots generated ahead of time.

Works for:

Simple marketing pages with rare changes.

Fails when:

• Content changes frequently
• Cache invalidation breaks
• Dynamic routes exist

SSR

HTML generated per request on your server.

Works for:

Sites that can keep SSR stable in production.

Fails when:

• Routing mismatches
• Partial rendering
• Performance limits cut execution

Edge Proxy (DataJelly)

Fully rendered HTML snapshots served to bots, kept in sync with the live app.

Why it's different:

• Waits for router completion
• Waits for DOM mutation settling
• Waits for hydration completion
• Doesn't rely on app execution at crawl time

Deeper comparison: Prerender vs SSR vs Edge Rendering and how Edge works.

Thresholds That Matter

These are the numbers we use in production audits. They're not theory — they're the cutoffs where extraction starts failing.

Signal	Healthy	Warning	Failure
Raw HTML size	> 20 KB	10–20 KB	< 10 KB
Visible text	> 300 words	200–300	< 200 words
Headings	h1 + 3+ h2/h3	h1 only	none
Script-to-text ratio	< 3:1	3:1–10:1	> 10:1
Bot vs browser word count	within 20%	20–50% gap	> 50% gap

Practical Checklist

Run all five. If any fail, your site is leaking AI visibility right now.

Validate HTML quality

HTML > 20 KB, visible text > 300 words, real headings present.

Detect script shell pages

Empty root container + high script ratio + minimal text nodes = shell.

Test deep links

curl /pricing, /docs, /blog/anything. Must return 200 with real content.

Check rendering completeness

Compare page source vs browser. Different content = extraction failure.

Monitor changes over time

HTML size drops > 50% or text drops > 40% = a deploy broke rendering.

The Page Validator and HTTP Bot Comparison automate most of this. Guard tracks these as production signals so a bad deploy doesn't go unnoticed for two weeks.

Quick Test: What Do Bots Actually See?

~30 seconds

Most people guess. Don't.

Run this test and look at the actual response your site returns to bots.

Fetch your page as Googlebot

Use your terminal:

curl -A "Googlebot" https://yourdomain.com

Look for:

Real visible text (not just <div id="root">)
Meaningful content in the HTML
Page size (should not be tiny)

Compare bot vs browser

Now test what a real browser gets:

curl -A "Mozilla/5.0" https://yourdomain.com

If these responses are different, Google is indexing a different page than your users see.

Stop guessing — measure it.

Real example: 253 words vs 13,547

We see this constantly. Here's a real example from production: Googlebot saw 253 words and 2 KB of HTML. A browser saw 13,547 words and 77.5 KB. Same URL — completely different content.

Bot vs browser comparison showing 253 words for Googlebot vs 13,547 words for a rendered browser on the same URL

If your HTML doesn't contain the content, Google doesn't either.

Compare Googlebot vs browser on your site → HTTP Debug Tool

Check for common failure signals

We see this all the time in production:

HTML under ~1KB → usually empty shell
Visible text under ~200 characters → thin or missing content
Missing <title> or <h1> → weak or broken page
Large difference between bot vs browser HTML → rendering issue

Use the DataJelly Visibility Test (Recommended)

You can run this without touching curl. It shows you:

Raw HTML returned to bots (Googlebot, Bing, GPTBot, etc.)
Fully rendered browser version
Side-by-side differences in word count, HTML size, links, and content

Run Visibility Test — Free

What this test tells you (no guessing)

After running this, you'll know:

Whether your HTML is actually indexable
Whether bots are seeing partial content
Whether rendering is breaking in production

This is the difference between "I think SEO is set up" and "I know what Google is indexing."

If you don't understand why this happens, read: Why Google Can't See Your SPA

If this test fails

You have three real options:

SSR

Works if you can keep it stable in production

Prerendering

Breaks with dynamic content and scale

Edge Rendering

Reflects real production output without app changes

If you do nothing, you will not rank consistently. Learn how Edge Rendering works →

This issue doesn't show up in Lighthouse. It shows up in rankings.

Run the Test Ask a Question

AI crawlers do not interpret your app. They extract text from HTML.

If your content is not present in the initial HTML response: it will not be extracted, it will not be understood, and it will not be cited. You either serve structured content immediately, or you don't exist to AI systems. This is not an SEO nuance. It's a hard failure mode.

How DataJelly Fixes This at the Visibility Layer

DataJelly Edge serves complete HTML snapshots to bots — generated only after router completion, DOM settling, and hydration finish. AI Markdown gives crawlers a clean, structured extraction format with no UI noise. Works with React, Vite, and Lovable SPAs. No app rewrite required.

Instead of hoping crawlers execute your app, you control exactly what they extract.

Run Visibility Test — Free Start 7-Day Free Trial Ask a Question

Related Diagnostic Tools

Visibility Test

Compare bot vs browser HTML side-by-side

Page Validator

Check bot-readiness and content extraction

HTTP Bot Comparison

Compare Googlebot vs browser responses

Site Crawler

Audit content quality across your site

FAQ

Blog

Edge

Latest

April 2026

What AI Crawlers Actually Extract From Your Site

Your page returns 200. It renders fine in the browser. AI tools still ignore it. We see this all the time — and it's almost never the reason teams think it is.

Reading progress0%

The Real Problem

A typical failing page looks like this:

HTML size:        6 KB
Visible text:     <100 words
Script tags:      20+
Root container:   <div id="root"></div>  (empty)

That page is effectively invisible to AI crawlers. It's not malformed. It's not blocked. It just contains no content the moment it leaves the server.

This breaks in production when teams assume "rendered UI = crawlable content." It's not. The browser will eventually render it. ChatGPT, Claude, Perplexity, and Google's AI Overviews won't.

The hard rule:

What AI Crawlers Actually Extract

AI crawlers do not behave like browsers. They do not wait for hydration. They do not reconstruct your component tree. They fetch HTML and extract a small set of structural elements:

Headings (h1–h3)
Paragraphs and inline text
Lists (ul, ol)
Links (anchor text + destination)
Basic metadata (title, meta description, JSON-LD)

Everything else is ignored. If your HTML is <div id="root"></div> plus a script bundle, the crawler sees:

Headings:    0
Paragraphs:  0
Lists:       0
Structure:   none

→ Failed extraction.

You can verify this in seconds — see the Quick Test below or jump to the HTTP Bot Comparison tool.

What Most Guides Get Wrong

Most SEO guidance still assumes:

JavaScript rendering will complete in time

Crawlers will retry or wait for content

"Eventually" the content appears and gets indexed

"But Lighthouse passes"

What We See in Production

Five failure modes show up in nearly every audit we run. Most sites have at least two of them.

1. Script Shell Pages

HTML size:     5 KB
Text content:  40 words
Scripts:       React + vendor bundles (180 KB)
Result:        Crawler extracts nothing. Page is ignored.

The default failure mode of every Vite/CRA SPA. Read more in Script Shell Pages.

2. Deep Link Failures

/pricing    →  initial response: 404 (or empty index.html)
            →  SPA rewrites to correct route after JS loads
Result:        Crawler records a 404 with zero content.

This happens constantly on Vite + client-side routers without proper rewrite rules. The bot never sees the rewritten page.

3. Hydration Crashes

Page loads HTML shell
JS throws TypeError during hydration
UI never renders, error boundary swallows it
Result:        Crawler extracts empty HTML.

Silent and lethal. We covered the full pattern in Hydration Crashes: The Silent Killer.

4. JS-Only Content

Pricing table:   fetched via API on mount
Docs content:    loaded dynamically from CMS
HTML contains:   <Header />, <Footer />, empty <main />
Result:          AI extraction misses your entire product.

The header and footer are extractable. The thing you sell isn't.

5. High Script-to-Content Ratio

HTML size:   120 KB
Scripts:     ~90 KB
Text:        ~5 KB
Ratio:       18:1 scripts to content
Result:      Extraction quality drops. AI systems deprioritize the page.

Even when content exists, drowning it in script tags reduces extraction confidence. See also: Your HTML Is Only 4KB.

Solutions: Prerender vs SSR vs Edge

Three real options. Each has tradeoffs. Pick based on what actually breaks in your stack — not on what's trendy.

Prerender

Static HTML snapshots generated ahead of time.

Works for:

Simple marketing pages with rare changes.

Fails when:

• Content changes frequently
• Cache invalidation breaks
• Dynamic routes exist

SSR

HTML generated per request on your server.

Works for:

Sites that can keep SSR stable in production.

Fails when:

• Routing mismatches
• Partial rendering
• Performance limits cut execution

Edge Proxy (DataJelly)

Fully rendered HTML snapshots served to bots, kept in sync with the live app.

Why it's different:

• Waits for router completion
• Waits for DOM mutation settling
• Waits for hydration completion
• Doesn't rely on app execution at crawl time

Deeper comparison: Prerender vs SSR vs Edge Rendering and how Edge works.

Thresholds That Matter

These are the numbers we use in production audits. They're not theory — they're the cutoffs where extraction starts failing.

Signal	Healthy	Warning	Failure
Raw HTML size	> 20 KB	10–20 KB	< 10 KB
Visible text	> 300 words	200–300	< 200 words
Headings	h1 + 3+ h2/h3	h1 only	none
Script-to-text ratio	< 3:1	3:1–10:1	> 10:1
Bot vs browser word count	within 20%	20–50% gap	> 50% gap

Practical Checklist

Run all five. If any fail, your site is leaking AI visibility right now.

Validate HTML quality

HTML > 20 KB, visible text > 300 words, real headings present.

Detect script shell pages

Empty root container + high script ratio + minimal text nodes = shell.

Test deep links

curl /pricing, /docs, /blog/anything. Must return 200 with real content.

Check rendering completeness

Compare page source vs browser. Different content = extraction failure.

Monitor changes over time

HTML size drops > 50% or text drops > 40% = a deploy broke rendering.

The Page Validator and HTTP Bot Comparison automate most of this. Guard tracks these as production signals so a bad deploy doesn't go unnoticed for two weeks.

Quick Test: What Do Bots Actually See?

~30 seconds

Most people guess. Don't.

Run this test and look at the actual response your site returns to bots.

Fetch your page as Googlebot

Use your terminal:

curl -A "Googlebot" https://yourdomain.com

Look for:

Real visible text (not just <div id="root">)
Meaningful content in the HTML
Page size (should not be tiny)

Compare bot vs browser

Now test what a real browser gets:

curl -A "Mozilla/5.0" https://yourdomain.com

If these responses are different, Google is indexing a different page than your users see.

Stop guessing — measure it.

Real example: 253 words vs 13,547

We see this constantly. Here's a real example from production: Googlebot saw 253 words and 2 KB of HTML. A browser saw 13,547 words and 77.5 KB. Same URL — completely different content.

If your HTML doesn't contain the content, Google doesn't either.

Compare Googlebot vs browser on your site → HTTP Debug Tool

Check for common failure signals

We see this all the time in production:

HTML under ~1KB → usually empty shell
Visible text under ~200 characters → thin or missing content
Missing <title> or <h1> → weak or broken page
Large difference between bot vs browser HTML → rendering issue

Use the DataJelly Visibility Test (Recommended)

You can run this without touching curl. It shows you:

Raw HTML returned to bots (Googlebot, Bing, GPTBot, etc.)
Fully rendered browser version
Side-by-side differences in word count, HTML size, links, and content

Run Visibility Test — Free

What this test tells you (no guessing)

After running this, you'll know:

Whether your HTML is actually indexable
Whether bots are seeing partial content
Whether rendering is breaking in production

This is the difference between "I think SEO is set up" and "I know what Google is indexing."

If you don't understand why this happens, read: Why Google Can't See Your SPA

If this test fails

You have three real options:

SSR

Works if you can keep it stable in production

Prerendering

Breaks with dynamic content and scale

Edge Rendering

Reflects real production output without app changes

If you do nothing, you will not rank consistently. Learn how Edge Rendering works →

This issue doesn't show up in Lighthouse. It shows up in rankings.

Run the Test Ask a Question

AI crawlers do not interpret your app. They extract text from HTML.

How DataJelly Fixes This at the Visibility Layer

Instead of hoping crawlers execute your app, you control exactly what they extract.

Run Visibility Test — Free Start 7-Day Free Trial Ask a Question

Related Diagnostic Tools

Visibility Test

Compare bot vs browser HTML side-by-side

Page Validator

Check bot-readiness and content extraction

HTTP Bot Comparison

Compare Googlebot vs browser responses

Site Crawler

Audit content quality across your site

On This Page

The Real Problem

What AI Crawlers Actually Extract

What Most Guides Get Wrong

What We See in Production

1. Script Shell Pages

2. Deep Link Failures

3. Hydration Crashes

4. JS-Only Content

5. High Script-to-Content Ratio

Solutions: Prerender vs SSR vs Edge

Prerender

SSR

Edge Proxy (DataJelly)

Thresholds That Matter

Practical Checklist

Quick Test: What Do Bots Actually See?

Fetch your page as Googlebot

Compare bot vs browser

Real example: 253 words vs 13,547

Check for common failure signals

Use the DataJelly Visibility Test (Recommended)

What this test tells you (no guessing)

If this test fails

How DataJelly Fixes This at the Visibility Layer

Related Diagnostic Tools

FAQ

What do AI crawlers actually read from my site?

Do AI crawlers execute JavaScript?

Why does my site look fine but AI tools ignore it?

How can I check what AI crawlers see?

What is a script shell page?

Does SSR solve this problem?

What is AI Markdown?

Related Reading

On This Page

The Real Problem

What AI Crawlers Actually Extract

What Most Guides Get Wrong

What We See in Production

1. Script Shell Pages

2. Deep Link Failures

3. Hydration Crashes

4. JS-Only Content

5. High Script-to-Content Ratio

Solutions: Prerender vs SSR vs Edge

Prerender

SSR

Edge Proxy (DataJelly)

Thresholds That Matter

Practical Checklist

Quick Test: What Do Bots Actually See?

Fetch your page as Googlebot

Compare bot vs browser

Real example: 253 words vs 13,547

Check for common failure signals

Use the DataJelly Visibility Test (Recommended)

What this test tells you (no guessing)

If this test fails

How DataJelly Fixes This at the Visibility Layer

Related Diagnostic Tools

FAQ

What do AI crawlers actually read from my site?

Do AI crawlers execute JavaScript?

Why does my site look fine but AI tools ignore it?

How can I check what AI crawlers see?

What is a script shell page?

Does SSR solve this problem?

What is AI Markdown?

Related Reading