DataJelly Guard • JavaScript SEO Pillar Guide
How to Test What Google Actually Sees
Chrome showing a page is not enough. Lighthouse passing is not enough. HTTP 200 is not enough. User-visible pages can still be thin, canonicalized wrong, noindexed, missing crawlable links, or unreadable to crawlers.
Do not ask “does the page load?” Ask “what did the crawler receive?”
Need page-level monitoring after deploy? See Guard.
What “what Google sees” actually means
- Final resolved URL and redirect path
- HTTP status
- Raw HTML before JavaScript
- Rendered DOM after JavaScript
- Title, meta description, canonical, robots directives
- Visible text and internal links
- Structured data
- Resources required to render
- Search Console selected canonical and indexing status
The real problem
A route can pass shallow checks and still fail indexing
Passes:
- HTTP status: 200
- Browser screenshot: looks fine
- Deploy checks: passed
- Lighthouse: acceptable
- Backend logs: quiet
Googlebot raw fetch shows:
- HTML size: 4–8 KB
- Visible text: under 100 characters
- Empty root div
- No H1
- No internal links
- Missing product copy
- Canonical missing or points elsewhere
Result: Crawled — currently not indexed, duplicate/canonical confusion, poor AI crawler extraction, or thin content classification.
This is not an indexing mystery. It is a crawler-output problem.
The fastest way to test what Google sees
- Confirm final URL and redirects.
- Fetch raw HTML with a normal user agent.
- Fetch raw HTML with Googlebot user agent.
- Compare raw HTML to rendered DOM.
- Check title, H1, canonical, noindex, and internal links.
- Check failed JS/CSS/API requests.
- Check Search Console URL Inspection.
- Check AI crawler-readable output if AI visibility matters.
Testing mental model: four layers
A. Transport layer
Proves: URL resolution and status behavior. Does not prove: indexable content quality. Common failures: mixed host variants, redirect chains, duplicate 200 URLs.
B. Raw HTML layer
Proves: pre-JS crawler-visible text and directives. Does not prove: post-render UX. Common failures: empty app shell, missing H1, weak links, missing canonical.
C. Rendered DOM layer
Proves: what users and rendered output show after JS. Does not prove: crawler got same signals. Common failures: content only appears after client hydration.
D. Crawler interpretation layer
Proves: Google-selected canonical/indexing outcome. Does not prove: deploy-by-deploy stability. Common failures: Crawled — currently not indexed, duplicate clusters, delayed recovery.
Step 1: Check final URL and redirects
If URL resolution is noisy, every downstream SEO signal becomes noisy.
# macOS/Linux
curl -sIL https://example.com/page
curl -sIL http://example.com/page
curl -sIL https://www.example.com/page
curl -sIL "https://example.com/page?utm_source=test"
# Windows (single-line)
curl.exe -s -I -L https://example.com/pageFailure examples: http and https both 200, www and non-www both 200, trailing-slash variants both indexable, UTM URLs indexable, sitemap URLs redirecting, internal links using mixed variants. Healthy state: one canonical host, one final URL, clean 301/308 redirects, sitemap points to final URLs.
Step 2: Fetch raw HTML like a crawler
Raw HTML is what the crawler receives before JavaScript rendering. This is critical for SPAs, React/Vite/Lovable routes, link discovery, title/canonical/noindex extraction, and thin-content detection.
curl -s https://example.com/page -o raw.html
curl -s -A "Googlebot/2.1 (+http://www.google.com/bot.html)" https://example.com/page -o googlebot.html
curl -sI https://example.com/page
wc -c raw.html googlebot.htmlDebug thresholds: raw HTML under 10 KB can be fine for minimal SSR pages, but risky for JavaScript-heavy content pages with little text. Visible text under 200 characters is a serious warning. Empty root div + script bundle dependency is a crawler visibility risk.
Step 3: Compare raw HTML vs rendered DOM
| Version | What it represents | Healthy | Risk |
|---|---|---|---|
| Raw HTML | Crawler pre-JS response | Meaningful text, title, canonical, links | Empty app shell |
| Rendered DOM | Browser after JS | Full page copy and CTA | User-only content gap |
| Googlebot fetch | Crawler-style request | Same critical signals as users | Thin or missing content |
Compare: raw and rendered visible text length, word count, H1 presence, internal link count, title/canonical/noindex, and key CTA selector presence.
Step 4: Check visible text and word thresholds
These are debugging thresholds, not ranking laws.
Healthy
- Visible text over 1,000 chars
- Word count over 300
- Clear H1 and body sections
- Internal links present
Risk
- 200–1,000 visible chars
- Mostly nav/footer text
- Low internal links
- Thin product copy
Broken
- Under 200 visible chars
- Empty root shell
- Loading state only
- No H1
- No useful text
Step 5: Check title, meta description, H1, and body copy
Why it matters: These are primary visibility signals for relevance and extraction.
Healthy: Healthy: title/H1 match intent, unique description, clear body sections.
Failure signals: Failure: placeholder title, missing H1, generic or duplicate body copy.
Step 6: Check canonical and noindex
Why it matters: Canonical and robots directives decide index eligibility and consolidation.
Healthy: Healthy: self-canonical on indexable pages, intentional robots directives.
Failure signals: Failure: canonical points elsewhere, accidental noindex, conflicting robots tags.
Step 7: Check internal links and sitemap consistency
Why it matters: Crawl paths and sitemap alignment help discovery and canonical stability.
Healthy: Healthy: crawlable internal links and sitemap URLs that resolve directly.
Failure signals: Failure: orphan pages, JS-only links, sitemap entries that redirect.
Step 8: Check structured data
Why it matters: Structured data improves machine interpretation and rich result eligibility.
Healthy: Healthy: valid schema matching visible content.
Failure signals: Failure: broken JSON-LD, schema not present in crawler-visible output.
Step 9: Check JS/CSS/API resource failures
Why it matters: Render integrity depends on resource health.
Healthy: Healthy: critical assets load, low console/resource error rate.
Failure signals: Failure: blocked JS, failed APIs, hydration crashes, missing CSS.
Step 10: Check Search Console URL Inspection
Why it matters: Search Console confirms Google’s chosen outcome for the inspected URL.
Healthy: Healthy: selected canonical matches target URL and page is indexed.
Failure signals: Failure: Crawled — currently not indexed, Google-selected canonical mismatch.
Step 11: Check AI crawler-readable output
Why it matters: If AI visibility matters, extraction must be reliable beyond browser rendering.
Healthy: Healthy: meaningful crawler-readable content and stable AI Markdown where available.
Failure signals: Failure: AI crawlers extract little text, miss key claims, or lose citation context.
What browser screenshots miss
- Screenshots do not prove raw HTML quality.
- Screenshots do not prove Googlebot received the same response.
- Screenshots do not clearly expose canonical/noindex mistakes.
- Screenshots do not prove internal links are crawlable.
- Screenshots do not prove Search Console selected the right canonical.
- Screenshots do not prove AI crawlers can extract the page.
What Search Console can and cannot tell you
Search Console can show
- Inspected URL status
- Crawled/indexed state
- Selected canonical
- Crawl timing
- Rendered screenshot in URL Inspection
- Indexing exclusions
Search Console cannot reliably show
- Every deploy preserved content quality
- Every key page still has enough visible text
- CTAs/forms still function
- AI crawlers can read the page
- The exact deploy where regression started
How Guard helps
Guard gives page-level monitoring for production visibility signals and output regressions across scans: page snapshot history, rendered output changes, raw HTML/rendered output signals where available, visible text length, word count, HTML bytes, title/H1/canonical/noindex drift, resource errors, console errors, DOM/content changes, and Core Web Vitals regressions.
Guard does not replace Search Console. It catches page-output changes before they turn into indexing and visibility problems.
Common mistakes that create false confidence
- Relying only on Chrome.
- Relying only on Lighthouse.
- Assuming Google always renders JavaScript fully.
- Testing only the homepage.
- Ignoring raw HTML.
- Ignoring canonical/noindex directives.
- Ignoring internal links.
- Ignoring failed JS/API requests.
- Checking GSC too late.
- Treating Crawled — currently not indexed as random.
- Assuming AI crawlers behave like Googlebot.
- Using sitemap submission as a fix for thin output.
Full test checklist
URL / redirects
- Single canonical host
- 301/308 normalization
- No duplicate indexable variants
- No noisy parameters indexable
Raw HTML
- HTML byte size sanity
- Visible text presence
- Title/H1 in source
- Canonical/noindex in source
Rendered DOM
- Main copy present
- Key CTA present
- H1/sections stable
- Rendered output not loader-only
SEO signals
- Title and meta quality
- Canonical intent
- Robots directives
- Structured data validity
Links / crawl paths
- Internal links crawlable
- No orphan key pages
- Sitemap contains final URLs
- Anchor text relevance
Resources / JavaScript
- Critical JS/CSS loads
- API dependencies succeed
- Console errors monitored
- Hydration failures absent
Search Console
- URL Inspection state
- Google-selected canonical
- Indexing exclusion reason
- Recrawl after fixes
AI crawler output
- Main claims extractable
- Citation context present
- AI Markdown available where used
- No JS-only critical text
Performance / Core Web Vitals
- TTFB stable
- LCP regression watch
- Layout stability
- Error spikes after deploy
Conversion path
- Primary CTA works
- Form submit works
- Important buttons visible
- No blocker modal/errors
Command snippets for quick checks
# Redirect chain and final URL
curl -sIL https://example.com/page
# Raw HTML and Googlebot-style fetch
curl -s https://example.com/page -o raw.html
curl -s -A "Googlebot/2.1 (+http://www.google.com/bot.html)" https://example.com/page -o gbot.html
# Headers only
curl -sI https://example.com/page
# Compare byte size
wc -c raw.html gbot.html
# Quick signal extraction
rg -in "<title|rel="canonical"|name="robots"|<h1" raw.html gbot.html
# Windows one-line fetch
curl.exe -s -A "Googlebot/2.1 (+http://www.google.com/bot.html)" https://example.com/page -o gbot.htmlFAQ
How do I see what Googlebot sees?
Check final URL resolution, fetch raw HTML, fetch with a Googlebot user agent, compare with rendered output, then confirm URL Inspection in Search Console.
Is curl enough to test Googlebot?
No. Curl is the raw-response layer. You still need rendered output checks and Search Console canonical/indexing state.
Why does my page show in Chrome but not index?
Chrome proves user rendering. It does not prove crawler-visible text, links, canonical integrity, or noindex state.
Does Google render JavaScript?
Yes, but not as a guarantee for every route and every deploy. Strong raw HTML still reduces crawler risk on JavaScript-heavy sites.
What does Crawled — currently not indexed mean?
Google fetched the page but did not add it to the index, often due to thin output, duplication, weak canonical signals, or low-value content.
How do I compare raw HTML and rendered DOM?
Compare visible text length, word count, H1, internal links, canonical/noindex, and key CTA selectors in both versions.
What visible text length is risky?
For important content pages, under 200 visible characters is a serious warning. 200–1,000 is risk. Over 1,000 is usually healthier.
Should I test with JavaScript disabled?
Yes. It is a fast stress test for whether critical copy, links, and directives exist before rendering.
Why do AI crawlers miss my content?
Many AI crawlers use lightweight extraction and may not execute complex client rendering reliably, so thin raw HTML reduces AI citation quality.
What should I check after every deploy?
Final URL, raw HTML bytes, visible text length, title/H1/canonical/noindex, resource failures, and Search Console state on key routes.
Does Guard replace Search Console?
No. Guard provides page-level monitoring and catches production output regressions before they become Search Console problems.
What pages should I test first?
Start with revenue pages, top landing pages, pages with recent ranking drops, and high-intent routes that depend on JavaScript rendering.