Why does Google use different data sets for its Search AI and the Gemini app?

Google's search AI (SGE) often requires more real-time, up-to-date indexing data for immediate search results, while the general-purpose Gemini app uses a more static, pre-trained dataset, leading to variance in their knowledge bases.

Home

Google’s AI Training Data: Key Differences Between Search AI and the Gemini App Revealed

byMorgan H

August 13, 2025

Google Uses Different Training Data for Search AI vs. Gemini App, Company Reveals

August 13, 2025 – Google representatives have disclosed significant details about how the company handles content for training different AI systems, revealing that AI Overview and AI Mode use separate training approaches from the consumer Gemini app.

Table of Contents

Separate Training Pipelines Confirmed

In a candid discussion at an industry event, Google confirmed that blocking “Google Extended” – the company’s AI training crawler – affects only the Gemini app’s training data, not the AI systems powering search features like AI Overview and AI Mode.

“The model that we use for AI overviews and for AI mode is a custom Gemini model and that might mean that it was trained differently,” explained a Google representative. This revelation suggests publishers may have less control over their content’s use in search AI features than previously understood.

All Google AI Systems Rely on Search Index

Despite different training approaches, all of Google’s AI systems – including Gemini, AI Overview, and AI Mode – use Google Search for real-time information grounding rather than fetching content directly from websites. This means regular Googlebot crawling remains crucial for content visibility in AI responses.

Growing Concerns Over AI Training Loops

Google’s Gary Illyes acknowledged emerging challenges with AI-generated content entering training datasets, potentially creating problematic feedback loops. “Model training definitely needs to figure out how to exclude content that was generated by AI otherwise you end up in a training loop,” Illyes explained during the session.

However, the company indicated quality matters more than origin. “As long as the content quality is high which typically nowadays requires that the human reviews the generated content it is fine for model training,” Illyes noted, shifting the focus from “human created” to “human curated” content. He clarified: “I think the word human created is wrong. Basically it should be human curated.”

Publishers Split on AI Crawler Blocking

The discussion revealed a divide in publisher strategies, with some blocking all AI crawlers while others adopt a wait-and-see approach. Illyes suggested the latter strategy might be wiser, noting that Generation Z users increasingly prefer AI interfaces.

“There’s two schools of thought – one is let’s block all the AI crawlers, second is let’s see where this is going. I tend to fall in the second group personally,” Illyes stated. He emphasized potential future opportunities: “If they [Gen Z] are going to be the next user base large user base then perhaps we need to figure out how to get value out of these results for publishers as well. But if you are not in the results then how are you going to get out that revenue?”

New Standards in Development

Google is participating in an Internet Engineering Task Force (IETF) working group developing “AI preferences” standards that would give publishers more granular control over how their content is used for AI training. However, no timeline was provided for implementation.

Technical Clarifications on SEO Impact

The discussion also clarified several technical points with direct implications for SEO practitioners:

404 Pages and Crawl Budget: Illyes confirmed that “404 pages don’t consume crawl budget” to prevent competitors from manipulating crawling resources. However, he warned about server resource consumption: “My recommendation for those cases is that you basically try the simplest 404 page that you can afford to have and go with that instead of expensive computationally expensive pages.”

AI-Generated Images: When asked about penalties for sites using AI-generated images, Illyes was clear: “Nope… AI generated image doesn’t impact the SEO.” He noted potential benefits: “If anything you might get some traffic out of image search for them.”

Social Media Signals: Addressing a common SEO question, Illyes stated definitively: “The answer is no and for the future is also likely no” regarding social media engagement as ranking signals. He explained: “We need to be able to control our own signals and if we are looking at external signals… that’s not in our control.”

Personal Skepticism About Generative AI

In a candid moment, Illyes expressed personal reservations about generative AI: “I don’t like generative AI. I think predictive AI is incredibly valuable.” He praised specific applications like summarization but noted concerns about image generation and hallucinations: “As soon as you get to a topic that the AI is not familiar with, then it will start making really weird hallucinations.”

Revenue Model Uncertainty

When asked about potential advertising in AI Mode, Illyes acknowledged uncertainty about Google’s monetization strategy if AI interfaces replace traditional search: “I don’t know but it’s not my problem. We have people to figure these things out… My job is to answer these kind of questions and to the best of my ability.”

Key SEO Takeaways for Publishers

Based on Illyes’ insights, here are actionable recommendations for SEO practitioners:

Content Strategy

Focus on human curation: Ensure AI-generated content receives proper editorial oversight and fact-checking
Maintain content quality: High-quality, accurate content matters more than its origin (human vs. AI)
Consider blocking implications: Weigh the risks of blocking Google Extended against potential future AI-driven revenue opportunities

Technical Optimization

Simplify 404 pages: Use lightweight 404 pages to avoid server resource consumption during crawling
Leverage AI-generated images: No SEO penalty exists; these can potentially drive image search traffic
Don’t rely on social signals: Social media engagement won’t boost search rankings

Future Planning

Monitor Gen Z preferences: Prepare for a user base that increasingly prefers AI interfaces
Stay informed on standards: Watch for IETF AI preferences standards that may provide more content control options
Maintain crawlability: Ensure Googlebot can access your content for AI grounding purposes

Final Takeaways

The discussion reveals a complex and evolving landscape where traditional SEO strategies must adapt to AI-driven search experiences. Key insights include:

Control is limited: Publishers have less control over AI training than many assumed, with different Google AI systems using separate approaches
Quality trumps origin: Google prioritizes content accuracy and value over whether humans or AI created it
Future opportunity exists: Early blocking of AI crawlers may limit access to emerging revenue streams as user behavior shifts
Technical fundamentals remain: Core SEO principles around crawlability, server efficiency, and content quality continue to matter
Standards are coming: Industry-wide solutions for granular AI content control are in development but timing remains uncertain

As Illyes noted about the broader uncertainty: “There’s lots of considerations to be taken into account and we don’t yet know where we are going to go with it, but we are definitely working on it.”

For now, publishers should focus on creating high-quality, human-curated content while maintaining technical best practices and monitoring developments in AI content standards.

Click to rate this post!

[Total: 0 Average: 0]