DJ
DataJelly
Visibility Test
EdgeGuard
PricingSEO ToolsGuidesGet Started
Dashboard
Back to Blog
Blog
Edge
Latest
April 2026

What AI Crawlers Actually Extract From Your Site

Your page returns 200. It renders fine in the browser. AI tools still ignore it. We see this all the time — and it's almost never the reason teams think it is.

Reading progress0%

On This Page

The Real Problem

A typical failing page looks like this:

HTML size:        6 KB
Visible text:     <100 words
Script tags:      20+
Root container:   <div id="root"></div>  (empty)

That page is effectively invisible to AI crawlers. It's not malformed. It's not blocked. It just contains no content the moment it leaves the server.

This breaks in production when teams assume "rendered UI = crawlable content." It's not. The browser will eventually render it. ChatGPT, Claude, Perplexity, and Google's AI Overviews won't.

The hard rule:

If your content isn't in the initial HTML response, it doesn't exist to AI systems. There is no "later." There is no retry queue. The extraction happens once, on the bytes that come back from your server.

What AI Crawlers Actually Extract

AI crawlers do not behave like browsers. They do not wait for hydration. They do not reconstruct your component tree. They fetch HTML and extract a small set of structural elements:

  • Headings (h1–h3)
  • Paragraphs and inline text
  • Lists (ul, ol)
  • Links (anchor text + destination)
  • Basic metadata (title, meta description, JSON-LD)

Everything else is ignored. If your HTML is <div id="root"></div> plus a script bundle, the crawler sees:

Headings:    0
Paragraphs:  0
Lists:       0
Structure:   none

→ Failed extraction.

You can verify this in seconds — see the Quick Test below or jump to the HTTP Bot Comparison tool.

What Most Guides Get Wrong

Most SEO guidance still assumes:

JavaScript rendering will complete in time

Crawlers will retry or wait for content

"Eventually" the content appears and gets indexed

That is false in real systems. AI crawlers do not wait for client-side rendering. They do not retry broken pages. They do not execute complex app logic. They take the response, extract what's there, and move on.

"But Lighthouse passes"

Lighthouse runs a full headless browser with execution time. AI crawlers don't. A green Lighthouse score tells you nothing about what GPTBot or ClaudeBot actually receives. We've seen 100/100 Lighthouse scores on pages that ship 0 words to bots.

What We See in Production

Five failure modes show up in nearly every audit we run. Most sites have at least two of them.

1. Script Shell Pages

HTML size:     5 KB
Text content:  40 words
Scripts:       React + vendor bundles (180 KB)
Result:        Crawler extracts nothing. Page is ignored.

The default failure mode of every Vite/CRA SPA. Read more in Script Shell Pages.

2. Deep Link Failures

/pricing    →  initial response: 404 (or empty index.html)
            →  SPA rewrites to correct route after JS loads
Result:        Crawler records a 404 with zero content.

This happens constantly on Vite + client-side routers without proper rewrite rules. The bot never sees the rewritten page.

3. Hydration Crashes

Page loads HTML shell
JS throws TypeError during hydration
UI never renders, error boundary swallows it
Result:        Crawler extracts empty HTML.

Silent and lethal. We covered the full pattern in Hydration Crashes: The Silent Killer.

4. JS-Only Content

Pricing table:   fetched via API on mount
Docs content:    loaded dynamically from CMS
HTML contains:   <Header />, <Footer />, empty <main />
Result:          AI extraction misses your entire product.

The header and footer are extractable. The thing you sell isn't.

5. High Script-to-Content Ratio

HTML size:   120 KB
Scripts:     ~90 KB
Text:        ~5 KB
Ratio:       18:1 scripts to content
Result:      Extraction quality drops. AI systems deprioritize the page.

Even when content exists, drowning it in script tags reduces extraction confidence. See also: Your HTML Is Only 4KB.

Solutions: Prerender vs SSR vs Edge

Three real options. Each has tradeoffs. Pick based on what actually breaks in your stack — not on what's trendy.

Prerender

Static HTML snapshots generated ahead of time.

Works for:

Simple marketing pages with rare changes.

Fails when:

  • • Content changes frequently
  • • Cache invalidation breaks
  • • Dynamic routes exist

SSR

HTML generated per request on your server.

Works for:

Sites that can keep SSR stable in production.

Fails when:

  • • Routing mismatches
  • • Partial rendering
  • • Performance limits cut execution

Edge Proxy (DataJelly)

Fully rendered HTML snapshots served to bots, kept in sync with the live app.

Why it's different:

  • • Waits for router completion
  • • Waits for DOM mutation settling
  • • Waits for hydration completion
  • • Doesn't rely on app execution at crawl time

Deeper comparison: Prerender vs SSR vs Edge Rendering and how Edge works.

Thresholds That Matter

These are the numbers we use in production audits. They're not theory — they're the cutoffs where extraction starts failing.

SignalHealthyWarningFailure
Raw HTML size> 20 KB10–20 KB< 10 KB
Visible text> 300 words200–300< 200 words
Headingsh1 + 3+ h2/h3h1 onlynone
Script-to-text ratio< 3:13:1–10:1> 10:1
Bot vs browser word countwithin 20%20–50% gap> 50% gap

Practical Checklist

Run all five. If any fail, your site is leaking AI visibility right now.

1

Validate HTML quality

HTML > 20 KB, visible text > 300 words, real headings present.

2

Detect script shell pages

Empty root container + high script ratio + minimal text nodes = shell.

3

Test deep links

curl /pricing, /docs, /blog/anything. Must return 200 with real content.

4

Check rendering completeness

Compare page source vs browser. Different content = extraction failure.

5

Monitor changes over time

HTML size drops > 50% or text drops > 40% = a deploy broke rendering.

The Page Validator and HTTP Bot Comparison automate most of this. Guard tracks these as production signals so a bad deploy doesn't go unnoticed for two weeks.

Quick Test: What Do Bots Actually See?

~30 seconds

Most people guess. Don't.

Run this test and look at the actual response your site returns to bots.

1

Fetch your page as Googlebot

Use your terminal:

curl -A "Googlebot" https://yourdomain.com

Look for:

  • Real visible text (not just <div id="root">)
  • Meaningful content in the HTML
  • Page size (should not be tiny)
2

Compare bot vs browser

Now test what a real browser gets:

curl -A "Mozilla/5.0" https://yourdomain.com

If these responses are different, Google is indexing a different page than your users see.

Stop guessing — measure it.

Real example: 253 words vs 13,547

We see this constantly. Here's a real example from production: Googlebot saw 253 words and 2 KB of HTML. A browser saw 13,547 words and 77.5 KB. Same URL — completely different content.

Bot vs browser comparison showing 253 words for Googlebot vs 13,547 words for a rendered browser on the same URL

If your HTML doesn't contain the content, Google doesn't either.

Compare Googlebot vs browser on your site → HTTP Debug Tool
3

Check for common failure signals

We see this all the time in production:

  • HTML under ~1KB → usually empty shell
  • Visible text under ~200 characters → thin or missing content
  • Missing <title> or <h1> → weak or broken page
  • Large difference between bot vs browser HTML → rendering issue

Use the DataJelly Visibility Test (Recommended)

You can run this without touching curl. It shows you:

  • Raw HTML returned to bots (Googlebot, Bing, GPTBot, etc.)
  • Fully rendered browser version
  • Side-by-side differences in word count, HTML size, links, and content
Run Visibility Test — Free

What this test tells you (no guessing)

After running this, you'll know:

  • Whether your HTML is actually indexable
  • Whether bots are seeing partial content
  • Whether rendering is breaking in production

This is the difference between "I think SEO is set up" and "I know what Google is indexing."

If you don't understand why this happens, read: Why Google Can't See Your SPA

If this test fails

You have three real options:

SSR

Works if you can keep it stable in production

Prerendering

Breaks with dynamic content and scale

Edge Rendering

Reflects real production output without app changes

If you do nothing, you will not rank consistently. Learn how Edge Rendering works →

This issue doesn't show up in Lighthouse. It shows up in rankings.

Run the TestAsk a Question

AI crawlers do not interpret your app. They extract text from HTML.

If your content is not present in the initial HTML response: it will not be extracted, it will not be understood, and it will not be cited. You either serve structured content immediately, or you don't exist to AI systems. This is not an SEO nuance. It's a hard failure mode.

How DataJelly Fixes This at the Visibility Layer

DataJelly Edge serves complete HTML snapshots to bots — generated only after router completion, DOM settling, and hydration finish. AI Markdown gives crawlers a clean, structured extraction format with no UI noise. Works with React, Vite, and Lovable SPAs. No app rewrite required.

Instead of hoping crawlers execute your app, you control exactly what they extract.

Run Visibility Test — FreeStart 7-Day Free TrialAsk a Question

Related Diagnostic Tools

Visibility Test

Compare bot vs browser HTML side-by-side

Page Validator

Check bot-readiness and content extraction

HTTP Bot Comparison

Compare Googlebot vs browser responses

Site Crawler

Audit content quality across your site

FAQ

Related Reading

Your HTML Is Only 4KB

Why empty HTML shells are the root failure mode for AI visibility.

Script Shell Pages

The default failure mode of every Vite/CRA SPA — and how to detect it.

Your Site Loads — But Google Sees Nothing

200 OK with empty HTML. The silent rendering failure no monitoring catches.

How to Debug SEO Issues in a React App

Curl commands, real thresholds, and the three fixes that actually work.

Hydration Crashes: The Silent Killer

Why your site returns 200 but every button is dead — and bots see nothing.

Prerender vs SSR vs Edge Rendering

What actually works for SEO — with real production data.

Reading progress0%

On This Page

DataJelly

SEO snapshots for modern SPAs. Making JavaScript applications search engine friendly with enterprise-grade reliability.

Product

  • DataJelly Edge
  • DataJelly Guard
  • Pricing
  • SEO Tools
  • Visibility Test
  • Dashboard

Resources

  • Blog
  • Guides
  • Getting Started
  • Prerendering
  • SPA SEO Guide

Company

  • About Us
  • Contact
  • Terms of Service
  • Privacy Policy

© 2026 DataJelly. All rights reserved. Built with love for the modern web.