Content is user-generated and unverified.

Structured Data's Critical Role in AI Search: A Comprehensive Research Report

Schema markup has officially crossed from SEO enhancement to AI requirement. In March 2025, Microsoft's Fabrice Canel confirmed at SMX Munich that "schema markup helps Microsoft's LLMs understand content"—the first explicit confirmation from a major search engine that structured data directly feeds AI systems. This report compiles research, official statements, and case studies demonstrating that structured data now serves as the semantic layer powering AI-generated answers across platforms.

The evidence is substantial: Microsoft research shows knowledge graph-grounded LLMs achieve 300% higher accuracy than unstructured approaches; Google's Knowledge Graph (containing 500 billion facts) relies heavily on schema.org markup; and controlled experiments show pages with well-implemented schema are the only ones appearing in AI Overviews. While Google officially states no "special" schema is required for AI features, their representatives consistently emphasize that structured data helps systems "understand pages better" and "indirectly leads to better ranks."


Microsoft officially confirms schema markup feeds their LLMs

Microsoft has provided the most direct confirmation that structured data directly impacts AI search systems. At SMX Munich in March 2025, Fabrice Canel, Principal Program Manager at Microsoft Bing, stated definitively:

"Schema markup helps Microsoft's LLMs understand content." — Fabrice Canel, SMX Munich, March 2025

This statement, reported by Search Engine Land and confirmed on LinkedIn, represents the clearest public acknowledgment from any major search engine that structured data feeds directly into AI language models, including Bing Chat and Microsoft Copilot.

Microsoft's technical infrastructure

Microsoft's Prometheus model combines the Bing index with OpenAI's GPT models through a technique called "grounding." According to official documentation:

"Selecting the relevant internal queries and leveraging the respective Bing search results is a critical component of Prometheus since it provides relevant and fresh information to the model, enabling it to answer recent questions and reduce inaccuracies—this method is called grounding." — Microsoft Bing Blog, February 2023

Microsoft Research has also published extensively on GraphRAG (Graph Retrieval-Augmented Generation), which uses LLM-generated knowledge graphs to enhance answer quality. Their February 2024 research paper states:

"GraphRAG uses LLM-generated knowledge graphs to provide substantial improvements in question-and-answer performance when conducting document analysis of complex information." — Jonathan Larson & Steven Truitt, Microsoft Research, February 2024

The official GraphRAG GitHub repository (released July 2024) documents this approach: https://github.com/microsoft/graphrag

Bing's official structured data position

Bing's webmaster documentation explicitly states: "Bing works hard to understand the content of a page and one of the clues that Bing uses is structured data." Bing was a founding member of Schema.org in 2011 alongside Google, Yahoo, and Yandex.

Canel has also emphasized the combination of structured data with IndexNow for AI optimization:

"Gen AIs value fresh content in particular, partly as a reference check of their LLM training data. Use the API at indexnow.org to push that information as it's published or updated." — Fabrice Canel, SMX Munich 2025

Source URLs:


Google confirms structured data helps AI understanding—but says no special markup required

Google's official position is more nuanced than Microsoft's. According to their December 2025 AI Features documentation:

"You don't need to create new machine readable files, AI text files, or markup to appear in these features. There's also no special schema.org structured data that you need to add." — Google Search Central, December 2025

However, Google representatives consistently emphasize that structured data helps their systems understand content, which logically extends to AI features.

John Mueller's statements on schema and LLMs

When asked directly whether schema markup helps LLMs understand entities, Mueller responded (via Reddit):

"This question will stick with us for the next year and longer, and the short answer is yes, no, and it depends." — John Mueller, Google Search Advocate

Mueller clarified that some features depend heavily on structured data (like Shopping results for pricing, shipping, availability), while in other cases it enriches results. He also confirmed Google uses RAG and grounding for AI Overviews:

At Google Search Central Live Madrid (2025), Mueller explained the process: User enters question → Search finds relevant information → Information "grounds" the LLM → LLM creates answer with supporting links.

On schema's future, Mueller stated clearly: "Google is not killing schema." He noted that Google regularly retires redundant markup types while keeping essential ones, emphasizing "evergreen structured data that communicates meaning."

Source: https://www.seroundtable.com/mueller-schema-helps-llms-google-40693.html

Gary Illyes' frequently-cited quote

The most viral quote about structured data's value comes from Gary Illyes at Pubcon 2017:

"[Schema markup] will help us understand your pages better, and indirectly, it leads to better ranks in some sense, because we can rank easier... Add structured data to your pages because during indexing, we will be able to better understand what your site is about." — Gary Illyes, Google Analyst, Pubcon 2017

Illyes also encouraged broad adoption: "Don't just think about the structured data that we documented on developers.google.com. Think about any schema.org schema that you could use on your pages."

Source: http://www.thesempost.com/adding-structured-data-helps-google-understand-rank-webpages-better/

Danny Sullivan's balanced perspective

The former Google Search Liaison offered measured guidance:

"It's not 'structured data and you win AI.' It simply supports how systems understand and present content, just as it already does across Search features." — Danny Sullivan, Search Off the Record Podcast, December 2025

Sullivan also coined "Good SEO is good GEO" (Generative Engine Optimization), suggesting AI optimization isn't fundamentally different from traditional SEO.

Source: https://searchengineland.com/google-danny-sullivan-seo-for-ai-is-still-seo-466368

Ryan Levering on computational efficiency

Perhaps most revealing, Google's Software Engineer for Structured Data stated:

"A lot of our systems run much better with structured data... it's computationally cheaper than extracting it." — Ryan Levering, Google Search Central Live New York, March 2025

This implies Google's AI systems prefer structured data when available simply because it's easier to process.

Source: https://www.searchenginejournal.com/factors-to-consider-when-implementing-schema-markup-at-scale/543935/

The Knowledge Graph connection

Google's Knowledge Graph, launched in 2012, contained 500 billion facts on 5 billion entities by May 2020. According to industry analysis, AI Overviews rely on this Knowledge Graph, which is "heavily populated by structured data pulled in from public websites."

Google Research has published extensively on retrieval-augmented approaches, including the foundational REALM paper (2020): https://arxiv.org/abs/2002.08909


OpenAI and Perplexity lack official documentation on structured data

Unlike Microsoft and Google, neither OpenAI nor Perplexity provides official documentation about how their systems process structured data from websites. This represents a significant gap in industry transparency.

What we know about OpenAI's crawlers

OpenAI operates three crawlers:

  • GPTBot — Training data collection
  • OAI-SearchBot — ChatGPT search results
  • ChatGPT-User — User-initiated browsing

Critical technical limitation: According to Vercel and MERJ research cited by Daydream, OpenAI's crawlers do NOT execute JavaScript: "Their joint analysis tracked over half a billion GPTBot fetches and found zero evidence of JavaScript execution."

This means only server-side rendered JSON-LD in static HTML can be accessed by OpenAI's systems.

Community testing on the OpenAI Developer Forum (June 2025) shows mixed results. One user reported: "My own tests show that when a page includes schema markup, ChatGPT's answers include details (emails, trial lengths, certifications) that are only present in the JSON-LD—not visible in plain HTML."

However, official OpenAI documentation provides no confirmation.

Source: https://community.openai.com/t/does-chatgpt-s-browsing-tool-extract-json-ld-schema-along-with-visible-html/1281878

Perplexity's crawler controversy

Cloudflare's August 2025 investigation revealed concerning practices:

"We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked... Both their declared and undeclared crawlers were attempting to access the content for scraping contrary to the web crawling norms." — Cloudflare Research, August 2025

Traffic detected: 20-25M daily requests from declared crawler; 3-6M daily requests from stealth crawler.

Perplexity's official documentation states PerplexityBot "is not used to crawl content for AI foundation models," but provides zero guidance on structured data usage.

Source: https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/

Practical implications

Given the documentation gaps, industry best practices suggest implementing server-side rendered JSON-LD (since AI crawlers don't execute JavaScript) and focusing on schema types that research shows correlate with AI citations: Article, FAQPage, HowTo, Product, and Organization.


Academic research establishes the structured data–AI connection

The relationship between structured data, knowledge graphs, and AI systems is well-documented in academic literature spanning 25+ years.

Foundational papers

Tim Berners-Lee's Semantic Web Vision (2001)

"The Semantic Web is really data that is processable by machine... data with well defined meaning is exchanged, and computers and people work side by side in cooperation." — Tim Berners-Lee, James Hendler, Ora Lassila, Scientific American, May 2001

The Schema.org Paper (2016) R.V. Guha, Dan Brickley, and Steve Macbeth published the definitive reference: "Schema.org: Evolution of Structured Data on the Web" in Communications of the ACM. Key finding:

"Annotations in Schema.org are used as a data source for the Knowledge Graph, providing background information about well-known entities." — Guha et al., 2016, DOI: 10.1145/2844544

By 2024, over 45 million web domains use Schema.org markup, with over 450 billion Schema.org objects indexed.

Retrieval-Augmented Generation research

The foundational RAG paper by Patrick Lewis et al. (Meta AI, NeurIPS 2020) introduced the framework that "combines the generative capabilities of LLMs with external knowledge retrieved from a separate database."

GraphRAG research has advanced significantly. The ACM Transactions survey (2025) by Boci Peng et al. explains:

"GraphRAG leverages structural information across entities to enable more precise and comprehensive retrieval, capturing relational knowledge that traditional RAG fails to represent." — DOI: 10.1145/3777378

Knowledge-enhanced language models

Several influential papers demonstrate LLMs benefit from structured knowledge:

  • K-BERT (2019): Injects knowledge graph triples into sentences as domain knowledge
  • KG-BERT (2019): Achieves state-of-the-art on knowledge graph completion using entity descriptions
  • ERNIE 3.0 (Baidu, 2021): 10B parameter model trained on "4TB corpus consisting of plain texts and a large-scale knowledge graph," surpassing human performance on SuperGLUE (+0.8%)

Recent research from December 2024 ("Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data", arXiv:2412.10654) confirms:

"By grounding the reasoning processes of LLMs with KGs, we can enhance the factual accuracy of the generated text and reduce hallucinations."

The critical statistic

A benchmark study from Data.world (cited across industry sources) found:

"LLMs grounded in knowledge graphs achieve 300% higher accuracy compared to those relying solely on unstructured data."


Industry leaders provide compelling social proof

Mike King (iPullRank, Search Marketer of the Year 2020/2025)

"Structured data does come into play here. It's not that it's being trained on the structured data, but the structured data can be ingested during the RAG pipeline." — Mike King, SEO Week 2025

King also articulated the strategic shift: "SEO in the AI Mode world is no longer about chasing blue links. It's about building robust, retrievable, and reusable content artifacts that serve as input for machine synthesis."

Bill Slawski (Patent Analysis Expert)

"Schema markup allows for the entire web to be treated as a scattered database—with algorithms mining data from all over the web... to return the best answers to any query through the construction of relationship ontologies." — Bill Slawski, SMX Advanced

Lily Ray (Amsive Digital)

"Proper use of structured data can help with E-A-T for a number of reasons. For one, structured data helps establish and solidify the relationship between entities, particularly among the various places they are mentioned online." — Lily Ray, Search Engine Journal

Aleyda Solis (Orainti)

"Use author, organization structured data for brand and entity salience that reinforces citation metadata." — Aleyda Solis, AI Search Content Optimization Checklist, June 2025

Martha van Berkel (Schema App)

"Search engines leverage your Schema Markup and knowledge graph as data sources to train their machines and infer new knowledge. By developing your organization's knowledge graph, you can prime your organization's web data to be 'AI-ready'." — Schema App


Case studies show measurable AI visibility improvements

Search Engine Land controlled experiment (September 2025)

Methodology: Three identical single-page sites with (1) well-implemented schema, (2) poorly implemented schema, (3) no schema—tested on keywords matched for difficulty (KD:3) and volume (60/month).

Results:

  • Well-implemented schema: Only page to appear in AI Overview; achieved Position 3 (highest)
  • Poorly implemented schema: Ranked for 10 keywords, peaked at Position 8; NO AI Overview appearances
  • No schema: NOT indexed despite being crawled; zero rankings

Source: https://searchengineland.com/schema-ai-overviews-structured-data-visibility-462353

SMA Marketing Wikidata experiment (February 2025)

Methodology: Added Wikidata references to article schema; ran paired t-tests and chi-square tests.

Results:

  • AI Overview rankings: 18 → 30 (66% increase), statistically significant (p < 0.05)
  • CTR and clicks: Statistically significant increase
  • Perplexity & Copilot traffic: Significant increase

Source: https://www.smamarketing.net/blog/structured-data-ai-search-seo

AISO/Content Marketing Institute experiment

Methodology: Tested identical content on structured vs. unstructured pages; asked ChatGPT same questions.

Result: ChatGPT responses using structured pages scored 30% higher for accuracy, completeness, and presentation quality.

Source: https://contentmarketinginstitute.com/seo-for-content/structured-data-ai-engines

Additional documented results

MetricResultSource
E-commerce traffic increase35% post-schemaSchema App
Rakuten recipe traffic2.7x increaseGoogle case study
Rich results CTR20-30% higher vs. standardIndustry studies
AI response visibility8% → 67% in 60 daysRadiant Elephant/HubSpot
AI Overview citations from beyond top-1083.3%BrightEdge, Sept 2025

Emerging AI search tools leverage structured data

Brave AI Search

Brave's independent index (30+ billion pages) includes "Schema Enriched Web Results"—structured data about webpages optimized for AI parsing. Their API provides "rich metadata ready for AI parsing" with a 94.1% F1-score on SimpleQA benchmark.

You.com

Offers Web Search API optimized for LLMs with "structured, context-rich data" and "citation-backed results." Claims 95%+ SimpleQA accuracy; DuckDuckGo uses their API for breaking news.

Voice assistants

Google Assistant, Alexa, and Siri all rely heavily on structured data:

  • Speakable schema identifies sections for text-to-speech
  • LocalBusiness schema powers "near me" queries
  • FAQPage schema enables direct answer retrieval
  • Voice responses typically need ~29 words or fewer

Enterprise AI search

Glean, Microsoft 365 + Graphwise, and Altair RapidMiner all build knowledge graphs from structured data. Glean builds "unique knowledge graphs for each customer" using triplet structures (subject, predicate, object) to power generative AI responses.


Conclusion: Structured data has become foundational for AI visibility

The evidence points to a clear conclusion: structured data has evolved from an SEO enhancement to a foundational requirement for AI discoverability. While Google maintains that no "special" schema is needed for AI features, their own representatives confirm structured data helps systems understand content—and Microsoft has explicitly stated schema feeds their LLMs.

The most compelling findings include:

  • Microsoft's official confirmation that schema markup helps their LLMs understand content (Fabrice Canel, March 2025)
  • 300% accuracy improvement for LLMs grounded in knowledge graphs vs. unstructured data
  • Controlled experiments showing only pages with well-implemented schema appear in AI Overviews
  • Ryan Levering's admission that Google's systems "run much better with structured data" because it's "computationally cheaper"

For organizations seeking AI visibility, the strategic imperative is clear: implement comprehensive, server-side rendered JSON-LD schema markup—particularly Article, FAQPage, HowTo, Product, Organization, and Person types—while ensuring consistency between markup and visible content. The semantic layer powered by structured data has become the bridge between web content and AI understanding.

Content is user-generated and unverified.
    Structured Data's Role in AI Search: 2025 Research Report | Claude