How AI Crawlers Actually Read Your Website

The Real Problem

Browser view

3,000–8,000

words, full UI

Raw HTML response

4–12 KB

mostly <script> tags

AI crawler view

usable content

We see this constantly on React, Vite, and Lovable builds. The browser shows a fully interactive page with thousands of words of content. The raw HTML response — what AI crawlers actually receive — contains almost nothing.

This is not a minor SEO issue. This is missing HTML.

If your <body> doesn't contain real text on first response, AI systems ignore it. Not "eventually index it." Not "figure it out later." Ignore it.

What's Actually Happening

AI crawlers behave like fast HTTP clients, not browsers. They don't open Chrome. They don't wait for your React app to hydrate. They don't call your APIs.

Here's what they actually do:

1Fetch HTML — a single HTTP GET request
2Extract visible text + structure — headings, paragraphs, lists, links
3Convert to internal format — often Markdown-like for downstream processing
4Store embeddings — for retrieval and citation in AI responses

They do not wait for hydration. They do not run your React app. They do not call your APIs.

Concrete Example

Request your page with curl:

curl -s https://yourdomain.com | wc -c
# HTML size: 7 KB

curl -s https://yourdomain.com | grep -oP '(?<=<body>).*(?=</body>)' | wc -w
# <body> text: ~20 words

That's the entire page to an AI crawler. After hydration, your browser shows 5,000+ words. The crawler never sees any of it.

💡 This is the same fundamental gap we cover in Why Google Can't See Your SPA — but AI crawlers are even less forgiving because they almost never attempt JavaScript execution.

What Most Guides Get Wrong

Most SEO content still assumes things that are flatly wrong for AI crawlers:

Bots execute JavaScript → AI crawlers almost never do

Rendering eventually happens → AI crawlers have near-zero delay tolerance

Content gets picked up later → if it's not in the first HTML response, it doesn't exist

If you're reading a guide that says "Google renders JavaScript" and assumes the same applies to ChatGPT, Claude, or Perplexity — it's wrong. These systems optimize for fast extraction, not full rendering.

What Breaks in Production

These are not rare edge cases. We see these failures on production sites every week. They're standard failure patterns for JavaScript apps.

Script Shell Pages

HTML: 5–15 KB — almost entirely <script> tags
Visible text: under 50 words
This is the exact pattern Guard flags as script_shell_only

The AI crawler receives a page that is functionally empty. Zero indexable content.

Partial Hydration

Header renders server-side → visible
Main content injected via JS → invisible to crawlers
Page looks "fine" to humans — <h1> present but body text missing

The crawler captures an incomplete page. Your heading says "Pricing" but there's no pricing content.

Broken Deep Links

/pricing, /features, /docs all return the same shell HTML
Content loaded via client-side router — never present in initial response
Crawler sees: no pricing content, no product info, no links

JS Bundle Failure

One script fails (network timeout or CDN issue)
Browser retries → user sees the page eventually
Crawler gets broken render → zero content

Guard flags this as critical_bundle_failure. The page is effectively dead.

CDN / Bot Blocking

Cloudflare or other CDN returns 403 for non-browser user agents
Crawler never reaches your origin server
Result: zero crawlable content, zero visibility

This is surprisingly common. Your CDN's bot protection is actively blocking the systems you want to be visible to.

The result in every case: HTML under 10 KB, visible text under 100 words, zero internal links. That page is effectively dead to AI.

How AI Crawlers Differ from Search Engines

Behavior	Googlebot	AI Crawlers
JS execution	Sometimes (queued)	Almost never
Render delay tolerance	Seconds to minutes	Near zero
HTML dependency	Medium	Absolute
Output	Search index	Structured summaries & embeddings
Retry behavior	Will revisit	Usually one-shot

The key difference: AI crawlers optimize for fast extraction, not full rendering. If your content isn't in the initial HTML, it doesn't exist in their pipeline. For a deeper look at how different bots behave, see our Bots Guide.

What Content Formats Actually Work

AI systems consistently extract content from these formats. Everything else degrades.

1. Real HTML Text

500–1,000+ words in <body>
Semantic tags: <h1>, <p>, <ul>
Content present in HTML — not injected via JavaScript

2. Clean Structure

Headings properly nested (H1 → H2 → H3)
Lists instead of div soup
Links visible in HTML (not generated by JS event handlers)

3. Markdown-Friendly Content

Internally, most AI pipelines convert HTML → Markdown before processing. If your HTML relies on JavaScript, uses dynamic rendering, or lacks structure — it degrades heavily during this conversion.

This is exactly why DataJelly generates AI Markdown snapshots — clean, structured Markdown served directly to AI crawlers, reducing token usage by up to 91% while preserving content hierarchy.

Solutions Compared

Prerendering

Works if:

• Under 100 routes
• Content rarely changes

Breaks when:

• Dynamic pages
• Stale builds
• Route explosion

SSR

Works if:

• Server always returns full HTML
• Hydration doesn't break

Breaks when:

• Slow backend
• Partial renders
• Caching inconsistencies

Edge Rendering

What actually works:

• Fully rendered HTML at request time
• Structured Markdown for AI
• Zero app changes required
• No hydration dependency

This is exactly what DataJelly's edge proxy + snapshot system does:

HTML snapshots for search bots — fully rendered, real content
AI Markdown for AI crawlers — structured, token-efficient, citation-ready
Zero reliance on client-side rendering

For a deeper comparison, read Prerender vs SSR vs Edge Rendering.

Practical Checklist

Run these against your site. If any fail, AI crawlers are seeing a broken page.

1. Raw HTML size

curl your page and check total size

HTML > 20 KB → good

HTML < 10 KB → problem (likely empty shell)

2. Text density

Check word count in <body>

1,000+ words → safe

< 200 words → likely invisible to AI

3. Script ratio

Check what percentage of HTML is <script> tags

Content dominates HTML

70%+ <script> → broken for AI crawlers

4. Deep link test

Test /pricing, /features, /docs individually

Each returns full HTML with real content

All return same root shell → client-side routing issue

5. Bot simulation

Remove browser headers and request your page

Same content regardless of headers

Different response → you have bot blocking or cloaking

Want to automate this? The HTTP Debug Tool runs these checks for you.

Quick Test

Quick Test: What Do Bots Actually See?

~30 seconds

Most people guess. Don't.

Run this test and look at the actual response your site returns to bots.

Fetch your page as Googlebot

Use your terminal:

curl -A "Googlebot" https://yourdomain.com

Look for:

Real visible text (not just <div id="root">)
Meaningful content in the HTML
Page size (should not be tiny)

Compare bot vs browser

Now test what a real browser gets:

curl -A "Mozilla/5.0" https://yourdomain.com

If these responses are different, Google is indexing a different page than your users see.

Stop guessing — measure it.

Real example: 253 words vs 13,547

We see this constantly. Here's a real example from production: Googlebot saw 253 words and 2 KB of HTML. A browser saw 13,547 words and 77.5 KB. Same URL — completely different content.

Bot vs browser comparison showing 253 words for Googlebot vs 13,547 words for a rendered browser on the same URL

If your HTML doesn't contain the content, Google doesn't either.

Compare Googlebot vs browser on your site → HTTP Debug Tool

Check for common failure signals

We see this all the time in production:

HTML under ~1KB → usually empty shell
Visible text under ~200 characters → thin or missing content
Missing <title> or <h1> → weak or broken page
Large difference between bot vs browser HTML → rendering issue

Use the DataJelly Visibility Test (Recommended)

You can run this without touching curl. It shows you:

Raw HTML returned to bots (Googlebot, Bing, GPTBot, etc.)
Fully rendered browser version
Side-by-side differences in word count, HTML size, links, and content

Run Visibility Test — Free

What this test tells you (no guessing)

After running this, you'll know:

Whether your HTML is actually indexable
Whether bots are seeing partial content
Whether rendering is breaking in production

This is the difference between "I think SEO is set up" and "I know what Google is indexing."

If you don't understand why this happens, read: Why Google Can't See Your SPA

If this test fails

You have three real options:

SSR

Works if you can keep it stable in production

Prerendering

Breaks with dynamic content and scale

Edge Rendering

Reflects real production output without app changes

If you do nothing, you will not rank consistently. Learn how Edge Rendering works →

This issue doesn't show up in Lighthouse. It shows up in rankings.

Run the Test Ask a Question

The Bottom Line

AI crawlers don't "figure it out later." They read exactly what you return in the first HTML response.

If your page is under 10 KB, under 100 words, and script-heavy — it does not exist to AI.

The fix is not tweaking SEO metadata. The fix is: return real HTML, return structured content, and stop depending on client-side rendering to do the heavy lifting.

Run Visibility Test — Free Talk to Our Team Start 7-Day Free Trial

FAQ

The Real Problem

Browser view

3,000–8,000

words, full UI

Raw HTML response

4–12 KB

mostly <script> tags

AI crawler view

usable content

This is not a minor SEO issue. This is missing HTML.

If your <body> doesn't contain real text on first response, AI systems ignore it. Not "eventually index it." Not "figure it out later." Ignore it.

What's Actually Happening

AI crawlers behave like fast HTTP clients, not browsers. They don't open Chrome. They don't wait for your React app to hydrate. They don't call your APIs.

Here's what they actually do:

1Fetch HTML — a single HTTP GET request
2Extract visible text + structure — headings, paragraphs, lists, links
3Convert to internal format — often Markdown-like for downstream processing
4Store embeddings — for retrieval and citation in AI responses

They do not wait for hydration. They do not run your React app. They do not call your APIs.

Concrete Example

Request your page with curl:

curl -s https://yourdomain.com | wc -c
# HTML size: 7 KB

curl -s https://yourdomain.com | grep -oP '(?<=<body>).*(?=</body>)' | wc -w
# <body> text: ~20 words

That's the entire page to an AI crawler. After hydration, your browser shows 5,000+ words. The crawler never sees any of it.

💡 This is the same fundamental gap we cover in Why Google Can't See Your SPA — but AI crawlers are even less forgiving because they almost never attempt JavaScript execution.

What Most Guides Get Wrong

Most SEO content still assumes things that are flatly wrong for AI crawlers:

Bots execute JavaScript → AI crawlers almost never do

Rendering eventually happens → AI crawlers have near-zero delay tolerance

Content gets picked up later → if it's not in the first HTML response, it doesn't exist

What Breaks in Production

These are not rare edge cases. We see these failures on production sites every week. They're standard failure patterns for JavaScript apps.

Script Shell Pages

HTML: 5–15 KB — almost entirely <script> tags
Visible text: under 50 words
This is the exact pattern Guard flags as script_shell_only

The AI crawler receives a page that is functionally empty. Zero indexable content.

Partial Hydration

Header renders server-side → visible
Main content injected via JS → invisible to crawlers
Page looks "fine" to humans — <h1> present but body text missing

The crawler captures an incomplete page. Your heading says "Pricing" but there's no pricing content.

Broken Deep Links

/pricing, /features, /docs all return the same shell HTML
Content loaded via client-side router — never present in initial response
Crawler sees: no pricing content, no product info, no links

JS Bundle Failure

One script fails (network timeout or CDN issue)
Browser retries → user sees the page eventually
Crawler gets broken render → zero content

Guard flags this as critical_bundle_failure. The page is effectively dead.

CDN / Bot Blocking

Cloudflare or other CDN returns 403 for non-browser user agents
Crawler never reaches your origin server
Result: zero crawlable content, zero visibility

This is surprisingly common. Your CDN's bot protection is actively blocking the systems you want to be visible to.

The result in every case: HTML under 10 KB, visible text under 100 words, zero internal links. That page is effectively dead to AI.

How AI Crawlers Differ from Search Engines

Behavior	Googlebot	AI Crawlers
JS execution	Sometimes (queued)	Almost never
Render delay tolerance	Seconds to minutes	Near zero
HTML dependency	Medium	Absolute
Output	Search index	Structured summaries & embeddings
Retry behavior	Will revisit	Usually one-shot

What Content Formats Actually Work

AI systems consistently extract content from these formats. Everything else degrades.

1. Real HTML Text

500–1,000+ words in <body>
Semantic tags: <h1>, <p>, <ul>
Content present in HTML — not injected via JavaScript

2. Clean Structure

Headings properly nested (H1 → H2 → H3)
Lists instead of div soup
Links visible in HTML (not generated by JS event handlers)

3. Markdown-Friendly Content

This is exactly why DataJelly generates AI Markdown snapshots — clean, structured Markdown served directly to AI crawlers, reducing token usage by up to 91% while preserving content hierarchy.

Solutions Compared

Prerendering

Works if:

• Under 100 routes
• Content rarely changes

Breaks when:

• Dynamic pages
• Stale builds
• Route explosion

SSR

Works if:

• Server always returns full HTML
• Hydration doesn't break

Breaks when:

• Slow backend
• Partial renders
• Caching inconsistencies

Edge Rendering

What actually works:

• Fully rendered HTML at request time
• Structured Markdown for AI
• Zero app changes required
• No hydration dependency

This is exactly what DataJelly's edge proxy + snapshot system does:

HTML snapshots for search bots — fully rendered, real content
AI Markdown for AI crawlers — structured, token-efficient, citation-ready
Zero reliance on client-side rendering

For a deeper comparison, read Prerender vs SSR vs Edge Rendering.

Practical Checklist

Run these against your site. If any fail, AI crawlers are seeing a broken page.

1. Raw HTML size

curl your page and check total size

HTML > 20 KB → good

HTML < 10 KB → problem (likely empty shell)

2. Text density

Check word count in <body>

1,000+ words → safe

< 200 words → likely invisible to AI

3. Script ratio

Check what percentage of HTML is <script> tags

Content dominates HTML

70%+ <script> → broken for AI crawlers

4. Deep link test

Test /pricing, /features, /docs individually

Each returns full HTML with real content

All return same root shell → client-side routing issue

5. Bot simulation

Remove browser headers and request your page

Same content regardless of headers

Different response → you have bot blocking or cloaking

Want to automate this? The HTTP Debug Tool runs these checks for you.

Quick Test

Quick Test: What Do Bots Actually See?

~30 seconds

Most people guess. Don't.

Run this test and look at the actual response your site returns to bots.

Fetch your page as Googlebot

Use your terminal:

curl -A "Googlebot" https://yourdomain.com

Look for:

Real visible text (not just <div id="root">)
Meaningful content in the HTML
Page size (should not be tiny)

Compare bot vs browser

Now test what a real browser gets:

curl -A "Mozilla/5.0" https://yourdomain.com

If these responses are different, Google is indexing a different page than your users see.

Stop guessing — measure it.

Real example: 253 words vs 13,547

We see this constantly. Here's a real example from production: Googlebot saw 253 words and 2 KB of HTML. A browser saw 13,547 words and 77.5 KB. Same URL — completely different content.

If your HTML doesn't contain the content, Google doesn't either.

Compare Googlebot vs browser on your site → HTTP Debug Tool

Check for common failure signals

We see this all the time in production:

HTML under ~1KB → usually empty shell
Visible text under ~200 characters → thin or missing content
Missing <title> or <h1> → weak or broken page
Large difference between bot vs browser HTML → rendering issue

Use the DataJelly Visibility Test (Recommended)

You can run this without touching curl. It shows you:

Raw HTML returned to bots (Googlebot, Bing, GPTBot, etc.)
Fully rendered browser version
Side-by-side differences in word count, HTML size, links, and content

Run Visibility Test — Free

What this test tells you (no guessing)

After running this, you'll know:

Whether your HTML is actually indexable
Whether bots are seeing partial content
Whether rendering is breaking in production

This is the difference between "I think SEO is set up" and "I know what Google is indexing."

If you don't understand why this happens, read: Why Google Can't See Your SPA

If this test fails

You have three real options:

SSR

Works if you can keep it stable in production

Prerendering

Breaks with dynamic content and scale

Edge Rendering

Reflects real production output without app changes

If you do nothing, you will not rank consistently. Learn how Edge Rendering works →

This issue doesn't show up in Lighthouse. It shows up in rankings.

Run the Test Ask a Question

The Bottom Line

AI crawlers don't "figure it out later." They read exactly what you return in the first HTML response.

If your page is under 10 KB, under 100 words, and script-heavy — it does not exist to AI.

The fix is not tweaking SEO metadata. The fix is: return real HTML, return structured content, and stop depending on client-side rendering to do the heavy lifting.

Run Visibility Test — Free Talk to Our Team Start 7-Day Free Trial

On This Page

The Real Problem

What's Actually Happening

Concrete Example

What Most Guides Get Wrong

What Breaks in Production

Script Shell Pages

Partial Hydration

Broken Deep Links

JS Bundle Failure

CDN / Bot Blocking

How AI Crawlers Differ from Search Engines

What Content Formats Actually Work

1. Real HTML Text

2. Clean Structure

3. Markdown-Friendly Content

Solutions Compared

Prerendering

SSR

Edge Rendering

Practical Checklist

1. Raw HTML size

2. Text density

3. Script ratio

4. Deep link test

5. Bot simulation

Quick Test

Quick Test: What Do Bots Actually See?

Fetch your page as Googlebot

Compare bot vs browser

Real example: 253 words vs 13,547

Check for common failure signals

Use the DataJelly Visibility Test (Recommended)

What this test tells you (no guessing)

If this test fails

The Bottom Line

FAQ

Do AI crawlers execute JavaScript?

Why does my site work in the browser but not for AI?

How can I verify what AI crawlers actually see?

Is SSR enough to fix AI visibility?

What is AI Markdown?

Why do SPAs fail for AI crawlers?

What's the fastest fix for AI visibility?

Related Reading

On This Page

The Real Problem

What's Actually Happening

Concrete Example

What Most Guides Get Wrong

What Breaks in Production

Script Shell Pages

Partial Hydration

Broken Deep Links

JS Bundle Failure

CDN / Bot Blocking

How AI Crawlers Differ from Search Engines

What Content Formats Actually Work

1. Real HTML Text

2. Clean Structure

3. Markdown-Friendly Content

Solutions Compared

Prerendering

SSR

Edge Rendering

Practical Checklist

1. Raw HTML size

2. Text density

3. Script ratio

4. Deep link test

5. Bot simulation

Quick Test

Quick Test: What Do Bots Actually See?

Fetch your page as Googlebot

Compare bot vs browser

Real example: 253 words vs 13,547

Check for common failure signals

Use the DataJelly Visibility Test (Recommended)

What this test tells you (no guessing)

If this test fails