Before LLMs Can Cite You, They Have to Read You

Most GEO advice starts with content. Write authoritatively. Use structured data. Build topical depth.

That's not wrong. But there's a layer underneath all of it that most guides skip entirely: whether the bot can read your site at all.

When we were building CiteVista — our GEO & AEO analytics platform for tracking brand visibility across LLMs — we audited our own site against the same criteria we use to evaluate clients. What we found was that several technical decisions, made early and often without much thought, directly determined whether LLMs could parse our content, understand our entity, and eventually cite us.

This is what we fixed, why it matters, and what you should check before spending time on content optimization.

1. Server-Side Rendering: If the Content Lives in JavaScript, Bots Don't See It

This is the most common and most damaging oversight we see.

Many modern websites render content client-side — meaning the HTML that arrives at the browser is mostly empty, and JavaScript fills it in after the page loads. This works fine for users. It fails for crawlers.

LLM training crawlers, search engine bots, and AI retrieval agents typically do not execute JavaScript. They read the raw HTML response. If your content — your product descriptions, your blog posts, your FAQ answers — only appears after JavaScript runs, those bots see a blank page.

Google has explicitly stated in its developer documentation that while Googlebot can render JavaScript, crawling and rendering are separate processes — and rendering happens later, with no guarantee of timing. For LLM crawlers like GPTBot and ClaudeBot, there is no rendering step at all.

CiteVista is built on Next.js with server-side rendering enabled. Every page delivers its full content in the initial HTML response. When GPTBot or ClaudeBot hits our URL, the content is already there — no JavaScript execution required.

If you're using a client-side framework like React, Vue, or Angular without SSR configured, this is the first thing to fix. It doesn't matter how well-structured your schema is if the content it describes isn't in the HTML.

2. Schema.org Markup: Telling Bots What Your Content Means

Bots can read your content once it's in the HTML. But reading and understanding are different things. Schema.org markup bridges that gap — it tells crawlers not just what your page contains, but what it means.

JSON-LD is the only format worth using in 2026.

There are three ways to implement schema: JSON-LD, Microdata, and RDFa. Microdata and RDFa embed structured data directly into HTML attributes, which means your content and your schema are entangled. Change one, you risk breaking the other.

Google's structured data documentation explicitly recommends JSON-LD as the preferred format. It lives in a separate <script type="application/ld+json"> block, independent of your HTML structure, easy to update, and reliably parsed by all major crawlers.

What CiteVista has implemented:

Organization schema — site-wide, in layout.tsx. Defines CiteVista as an entity: name, URL, description, founders with LinkedIn URLs. This runs on every page.
SoftwareApplication schema — also site-wide. Categorizes CiteVista as a BusinessApplication, includes pricing tiers. Helps LLMs understand what type of product we are.
FAQPage schema — on the homepage only. Maps to the GEO, AEO, and Semantic Authority definitions already visible on the page. Enables Google rich results and gives LLMs structured Q&A content to cite.
Article schema — dynamically generated on each insights post. Includes headline, description, publish date, authors with LinkedIn URLs, publisher, and canonical URL.

The sameAs property is underused and important.

Within your Organization schema, sameAs tells crawlers that your entity also exists at other URLs — your LinkedIn company page, your Product Hunt listing, your Crunchbase profile. We added our LinkedIn company page to CiteVista's Organization schema as soon as we created it.

Without sameAs, a bot sees your website as an isolated entity. With it, the entity becomes a node in a larger graph — connected to verified, indexable profiles on authoritative platforms. That's a meaningfully stronger citation signal.

3. LLM-Specific Crawlers in robots.txt

Most robots.txt files are written for Googlebot. That made sense five years ago. It doesn't anymore.

LLM providers run their own crawlers, and they respect robots.txt directives. OpenAI introduced GPTBot in 2023 and published its user agent string specifically so site owners could control access. Anthropic's ClaudeBot and Perplexity's PerplexityBot followed the same pattern.

If you haven't explicitly addressed them in your robots.txt, you may be blocking them accidentally through overly broad Disallow rules — or simply missing the opportunity to signal that your content is open for LLM indexing.

CiteVista's robots.txt explicitly allows all major LLM crawlers:

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Applebot-Extended
Allow: /

Sitemap: https://www.citevista.com/sitemap.xml

If you want to block specific sections — say, a private app subdirectory — you can add targeted Disallow rules per bot. But for most marketing and content pages, explicit Allow is the right call.

4. HTML Hierarchy: H1, H2, H3 as a Topic Signal

Clean heading structure isn't just good UX — it's how crawlers parse the topic architecture of your page.

An LLM crawler reading your page uses heading tags to understand what the page is about (H1), what subtopics it covers (H2), and what detail lives under each subtopic (H3). If your headings are inconsistent, decorative, or missing entirely, the crawler has to infer topic structure from paragraph text — which is lower confidence.

The rules we follow for CiteVista:

One H1 per page, matching or closely reflecting the page title
H2 for major sections — these become the primary topic signals
H3 for subsections within each H2 topic
No skipping levels (H1 → H3 with no H2 in between)
No using heading tags purely for visual styling

This matters especially for insights articles, where the heading structure tells the crawler what the article covers before it reads a single body paragraph.

5. URL Structure: Readable by Humans and Bots

URLs carry semantic signal. A URL like citevista.com/insights/prompt-is-not-the-query tells a crawler what the page is about before it reads a single character of content. A URL like citevista.com/post?id=4821 tells it nothing.

Our URL pattern for insights: /insights/[descriptive-slug]. Short, lowercase, hyphens instead of underscores, no query parameters. This is consistent with Google's URL structure best practices and applies equally to LLM crawlers that use URL patterns as a lightweight relevance signal.

6. Content-to-Code Ratio: Don't Bury Your Content

Every HTML page is a mix of content and code — markup, inline styles, script tags, tracking pixels. Crawlers allocate a fixed processing budget per page. If most of that page is code rather than content, crawlers spend their budget parsing structure instead of reading your actual text.

This is especially relevant for heavily JavaScript-dependent pages, pages with large inline SVGs, or pages where tracking and analytics scripts dominate the <head>.

The fix isn't to remove functionality — it's to move non-content elements where they belong: external stylesheets, deferred scripts, and server-side rendered markup that separates content from presentation cleanly.

CiteVista's pages load full content in the initial HTML with minimal inline code. External assets are loaded asynchronously after content is delivered.

7. Canonical URLs: Telling Crawlers Which Page Is the Authoritative One

If your content is accessible at multiple URLs — http vs https, www vs non-www, trailing slash vs no trailing slash — crawlers may treat these as separate pages with duplicate content. That splits crawl budget and can confuse entity attribution.

The canonical tag (<link rel="canonical" href="...">) explicitly declares which URL is the authoritative version. Every page on CiteVista has a canonical tag pointing to the www version with https. This is a one-time implementation with permanent value.

8. Meta Descriptions as Citation Snippets

Meta descriptions don't directly affect rankings. But they serve a specific function in the LLM context: they're the most likely candidate for what an LLM uses as a page summary when generating a citation.

When Perplexity or ChatGPT cites a page, it often pulls a summary of that page's content. The meta description is a structured, author-controlled signal for what that summary should say. If you don't write one, the LLM generates its own — which may or may not represent your content accurately.

Every CiteVista page has a manually written meta description that includes our entity name, the topic of the page, and a clear statement of what the reader will find. It reads like the first sentence of a citation, because that's what it may become.

9. Internal Linking: Topic Cluster Signals

LLMs don't evaluate pages in isolation. They evaluate sites — and internal link structure is one of the signals that tells them what a site is authoritative about.

If your homepage links to your articles, and your articles link to related articles, you're building a topic graph. A crawler following those links sees a coherent cluster of content on related topics and assigns higher topical authority to the domain. The absence of internal linking means every page is evaluated independently, with no amplifying signal from the rest of the site.

On CiteVista, each insights article links back to the insights index, and future articles will cross-link where topics overlap. This is a long-term compounding investment.

10. Open Graph Tags: Social and Crawler Signal

Open Graph meta tags — og:title, og:description, og:image, og:url — were designed for social media previews. But when your content is shared on platforms that LLMs heavily index — Reddit, LinkedIn, Twitter/X — the Open Graph data determines how that shared link is represented.

A well-written og:description on a Reddit post showing a preview of your article is a small but meaningful signal in the broader entity graph. Every CiteVista page includes complete Open Graph tags that match the page's canonical metadata.

The Order of Operations

None of the content, schema, or authority-building work matters if the foundation is broken. Based on what we've built and measured at CiteVista, the sequence is:

Make content server-side rendered and accessible in raw HTML
Implement clean URL structure and canonical tags
Add JSON-LD schema — Organization first, then page-specific types
Configure robots.txt explicitly for LLM crawlers
Clean up HTML hierarchy
Write meta descriptions as citation-ready summaries
Build internal linking structure
Add Open Graph tags

Get these right, and your content strategy has a foundation to stand on. Skip them, and even the best content may never reach the citation layer.