DataJelly Guard • JavaScript SEO Pillar Guide

How to Test What Google Actually Sees

Chrome showing a page is not enough. Lighthouse passing is not enough. HTTP 200 is not enough. User-visible pages can still be thin, canonicalized wrong, noindexed, missing crawlable links, or unreadable to crawlers.

Do not ask “does the page load?” Ask “what did the crawler receive?”

Need a fast check? Run the AI + Search Visibility Test.

Need page-level monitoring after deploys? See DataJelly Guard.

What “what Google sees” actually means

Final resolved URL and redirect path
HTTP status
Raw HTML before JavaScript
Rendered DOM after JavaScript
Title, meta description, canonical, robots directives
Visible text and internal links
Structured data
Resources required to render
Search Console selected canonical and indexing status

The real problem

A route can pass shallow checks and still fail indexing

Passes:

HTTP status: 200
Browser screenshot: looks fine
Deploy checks: passed
Lighthouse: acceptable
Backend logs: quiet

Googlebot raw fetch shows:

HTML size: 4–8 KB
Visible text: under 100 characters
Empty root div
No H1
No internal links
Missing product copy
Canonical missing or points elsewhere

Result: Crawled — currently not indexed, duplicate/canonical confusion, poor AI crawler extraction, or thin content classification.

This is not an indexing mystery. It is a crawler-output problem.

The fastest way to test what Google sees

Confirm final URL and redirects.
Fetch raw HTML with a normal user agent.
Fetch raw HTML with Googlebot user agent.
Compare raw HTML to rendered DOM.
Check title, H1, canonical, noindex, and internal links.
Check failed JS/CSS/API requests.
Check Search Console URL Inspection.
Check AI crawler-readable output if AI visibility matters (for example via DataJelly Edge).

Testing mental model: four layers

A. Transport layer

Proves: URL resolution and status behavior. Does not prove: indexable content quality. Common failures: mixed host variants, redirect chains, duplicate 200 URLs.

B. Raw HTML layer

Proves: pre-JS crawler-visible text and directives. Does not prove: post-render UX. Common failures: empty app shell, missing H1, weak links, missing canonical.

C. Rendered DOM layer

Proves: what users and rendered output show after JS. Does not prove: crawler got same signals. Common failures: content only appears after client hydration.

D. Crawler interpretation layer

Proves: Google-selected canonical/indexing outcome. Does not prove: deploy-by-deploy stability. Common failures: Crawled — currently not indexed, duplicate clusters, delayed recovery.

Step 1: Check final URL and redirects

If URL resolution is noisy, every downstream SEO signal becomes noisy.

# macOS/Linux
curl -sIL https://example.com/page
curl -sIL http://example.com/page
curl -sIL https://www.example.com/page
curl -sIL "https://example.com/page?utm_source=test"

# Windows (single-line)
curl.exe -s -I -L https://example.com/page

Failure examples: http and https both 200, www and non-www both 200, trailing-slash variants both indexable, UTM URLs indexable, sitemap URLs redirecting, internal links using mixed variants. Healthy state: one canonical host, one final URL, clean 301/308 redirects, sitemap points to final URLs.

Step 2: Fetch raw HTML like a crawler

Raw HTML is what the crawler receives before JavaScript rendering. This is critical for SPAs, React/Vite/Lovable routes, link discovery, title/canonical/noindex extraction, and thin-content detection.

curl -s https://example.com/page -o raw.html
curl -s -A "Googlebot/2.1 (+http://www.google.com/bot.html)" https://example.com/page -o googlebot.html
curl -sI https://example.com/page
wc -c raw.html googlebot.html

Debug thresholds: raw HTML under 10 KB can be fine for minimal SSR pages, but risky for JavaScript-heavy content pages with little text. Visible text under 200 characters is a serious warning. Empty root div + script bundle dependency is a crawler visibility risk.

Step 3: Compare raw HTML vs rendered DOM

Version	What it represents	Healthy	Risk
Raw HTML	Crawler pre-JS response	Meaningful text, title, canonical, links	Empty app shell
Rendered DOM	Browser after JS	Full page copy and CTA	User-only content gap
Googlebot fetch	Crawler-style request	Same critical signals as users	Thin or missing content

Compare: raw and rendered visible text length, word count, H1 presence, internal link count, title/canonical/noindex, and key CTA selector presence.

Step 4: Check visible text and word thresholds

These are debugging thresholds, not ranking laws.

Healthy

Visible text over 1,000 chars
Word count over 300
Clear H1 and body sections
Internal links present

Risk

200–1,000 visible chars
Mostly nav/footer text
Low internal links
Thin product copy

Broken

Under 200 visible chars
Empty root shell
Loading state only
No H1
No useful text

Step 5: Check title, meta description, H1, and body copy

Why it matters: These are primary visibility signals for relevance and extraction.

Healthy: Healthy: title/H1 match intent, unique description, clear body sections.

Failure signals: Failure: placeholder title, missing H1, generic or duplicate body copy.

Step 6: Check canonical and noindex

Why it matters: Canonical and robots directives decide index eligibility and consolidation.

Healthy: Healthy: self-canonical on indexable pages, intentional robots directives.

Failure signals: Failure: canonical points elsewhere, accidental noindex, conflicting robots tags.

Step 7: Check internal links and sitemap consistency

Why it matters: Crawl paths and sitemap alignment help discovery and canonical stability.

Healthy: Healthy: crawlable internal links and sitemap URLs that resolve directly.

Failure signals: Failure: orphan pages, JS-only links, sitemap entries that redirect.

Step 8: Check structured data

Why it matters: Structured data improves machine interpretation and rich result eligibility.

Healthy: Healthy: valid schema matching visible content.

Failure signals: Failure: broken JSON-LD, schema not present in crawler-visible output.

Step 9: Check JS/CSS/API resource failures

Why it matters: Render integrity depends on resource health.

Healthy: Healthy: critical assets load, low console/resource error rate.

Failure signals: Failure: blocked JS, failed APIs, hydration crashes, missing CSS.

Step 10: Check Search Console URL Inspection

Why it matters: Search Console confirms Google’s chosen outcome for the inspected URL.

Healthy: Healthy: selected canonical matches target URL and page is indexed.

Failure signals: Failure: Crawled — currently not indexed, Google-selected canonical mismatch.

Step 11: Check AI crawler-readable output

Why it matters: If AI visibility matters, extraction must be reliable beyond browser rendering.

Healthy: Healthy: meaningful crawler-readable content and stable AI Markdown where available.

Failure signals: Failure: AI crawlers extract little text, miss key claims, or lose citation context.

What browser screenshots miss

Screenshots do not prove raw HTML quality.
Screenshots do not prove Googlebot received the same response.
Screenshots do not clearly expose canonical/noindex mistakes.
Screenshots do not prove internal links are crawlable.
Screenshots do not prove Search Console selected the right canonical.
Screenshots do not prove AI crawlers can extract the page.

What Search Console can and cannot tell you

Search Console can show

Inspected URL status
Crawled/indexed state
Selected canonical
Crawl timing
Rendered screenshot in URL Inspection
Indexing exclusions

Search Console cannot reliably show

Every deploy preserved content quality
Every key page still has enough visible text
CTAs/forms still function
AI crawlers can read the page
The exact deploy where regression started

How Guard helps

Guard gives page-level monitoring for production visibility signals and output regressions across scans: page snapshot history, rendered output changes, raw HTML/rendered output signals where available, visible text length, word count, HTML bytes, title/H1/canonical/noindex drift, resource errors, console errors, DOM/content changes, and Core Web Vitals regressions.

Guard does not replace Search Console. It catches page-output changes before they turn into indexing and visibility problems, especially for JavaScript-heavy deployments (see JavaScript production monitoring and why pages break after deploy).

Common mistakes that create false confidence

If your team keeps seeing 200 OK with weak visibility, review why Google can't see your JavaScript site and the Lovable SEO guide.

Relying only on Chrome.
Relying only on Lighthouse.
Assuming Google always renders JavaScript fully.
Testing only the homepage.
Ignoring raw HTML.
Ignoring canonical/noindex directives.
Ignoring internal links.
Ignoring failed JS/API requests.
Checking GSC too late.
Treating Crawled — currently not indexed as random.
Assuming AI crawlers behave like Googlebot.
Using sitemap submission as a fix for thin output.

Full test checklist

URL / redirects

Single canonical host
301/308 normalization
No duplicate indexable variants
No noisy parameters indexable

Raw HTML

HTML byte size sanity
Visible text presence
Title/H1 in source
Canonical/noindex in source

Rendered DOM

Main copy present
Key CTA present
H1/sections stable
Rendered output not loader-only

SEO signals

Title and meta quality
Canonical intent
Robots directives
Structured data validity

Links / crawl paths

Internal links crawlable
No orphan key pages
Sitemap contains final URLs
Anchor text relevance

Resources / JavaScript

Critical JS/CSS loads
API dependencies succeed
Console errors monitored
Hydration failures absent

Search Console

URL Inspection state
Google-selected canonical
Indexing exclusion reason
Recrawl after fixes

AI crawler output

Main claims extractable
Citation context present
AI Markdown available where used
No JS-only critical text

Performance / Core Web Vitals

TTFB stable
LCP regression watch
Layout stability
Error spikes after deploy

Conversion path

Primary CTA works
Form submit works
Important buttons visible
No blocker modal/errors

Command snippets for quick checks

# Redirect chain and final URL
curl -sIL https://example.com/page

# Raw HTML and Googlebot-style fetch
curl -s https://example.com/page -o raw.html
curl -s -A "Googlebot/2.1 (+http://www.google.com/bot.html)" https://example.com/page -o gbot.html

# Headers only
curl -sI https://example.com/page

# Compare byte size
wc -c raw.html gbot.html

# Quick signal extraction
rg -in "<title|rel="canonical"|name="robots"|<h1" raw.html gbot.html

# Windows one-line fetch
curl.exe -s -A "Googlebot/2.1 (+http://www.google.com/bot.html)" https://example.com/page -o gbot.html

FAQ

How do I see what Googlebot sees?

Check final URL resolution, fetch raw HTML, fetch with a Googlebot user agent, compare with rendered output, then confirm URL Inspection in Search Console.

Is curl enough to test Googlebot?

No. Curl is the raw-response layer. You still need rendered output checks and Search Console canonical/indexing state.

Why does my page show in Chrome but not index?

Chrome proves user rendering. It does not prove crawler-visible text, links, canonical integrity, or noindex state.

Does Google render JavaScript?

Yes, but not as a guarantee for every route and every deploy. Strong raw HTML still reduces crawler risk on JavaScript-heavy sites.

What does Crawled — currently not indexed mean?

Google fetched the page but did not add it to the index, often due to thin output, duplication, weak canonical signals, or low-value content.

How do I compare raw HTML and rendered DOM?

Compare visible text length, word count, H1, internal links, canonical/noindex, and key CTA selectors in both versions.

What visible text length is risky?

For important content pages, under 200 visible characters is a serious warning. 200–1,000 is risk. Over 1,000 is usually healthier.

Should I test with JavaScript disabled?

Yes. It is a fast stress test for whether critical copy, links, and directives exist before rendering.

Why do AI crawlers miss my content?

Many AI crawlers use lightweight extraction and may not execute complex client rendering reliably, so thin raw HTML reduces AI citation quality.

What should I check after every deploy?

Final URL, raw HTML bytes, visible text length, title/H1/canonical/noindex, resource failures, and Search Console state on key routes.

Does Guard replace Search Console?

No. Guard provides page-level monitoring and catches production output regressions before they become Search Console problems.

What pages should I test first?

Start with revenue pages, top landing pages, pages with recent ranking drops, and high-intent routes that depend on JavaScript rendering.