Xtrusio AEO/GEO Audit

Gemini is blind to lakeFS.

ChatGPT and Claude aren’t.

20-query buyer-intent audit across ChatGPT, Claude & Gemini. lakeFS is cited on 50 of 60 responses (83.3%) — ranking #1 on 36 of those citations. No competitor holds a single first-place position. But Gemini misses lakeFS entirely on the 4 queries that are its core homepage use cases.

This report was generated using Xtrusio, an AI visibility and demand intelligence platform that analyzes how companies appear across modern AI systems such as ChatGPT, Gemini, Claude, Perplexity, and other generative engines.

The insights in this page are generated using Xtrusio’s proprietary research and content intelligence framework.

June 2026
20 Queries • 3 Platforms
lakeFS
95%
Claude
19 of 20 queries
15× #1 RANKINGS
90%
ChatGPT
18 of 20 queries
13× #1 RANKINGS
65%
Gemini
13 of 20 queries
⚠ 30-POINT GAP
The Core Problem

lakeFS dominates ChatGPT and Claude — but Gemini defaults to AWS-native answers for the exact queries lakeFS was built to own.

When enterprise data engineers ask Gemini “how do I add version control to my S3 data lake without migrating?” or “how do I give engineers isolated dev environments without copying terabytes?” — Gemini returns Lake Formation, IAM policies, and S3 Bucket Versioning. Not lakeFS. These are Q1 and Q2 of this audit. They are the first two questions any buyer asks. And lakeFS is invisible on both — on the platform that indexes Google’s own documentation and AWS guides most heavily.

83.3%
Composite Citation Rate
36
#1 Rankings (of 50 cited)
30pp
Claude vs Gemini Gap
Section 2

Platform Scorecard

lakeFS citation rate across AI platforms

lakeFS Citation Rate by Platform
Claude
95%
ChatGPT
90%
Gemini
65%
Competitor Comparison — Combined Citation Rates (all 3 platforms)
lakeFS
83%
Iceberg / Nessie
43%
Delta Lake
33%
MLflow
22%
DVC
17%
Claude & ChatGPT: Category Owned
On Claude (95%) and ChatGPT (90%), lakeFS holds an unchallenged #1 position across core use cases. No competitor takes first place on either platform in any question. The DVC acquisition narrative, agentic AI sandboxing, and heterogeneous lake versioning all return lakeFS as the primary recommendation.
Gemini: AWS-Native Defaults Winning
Gemini cites lakeFS on only 65% of queries and misses entirely on Q1, Q2, Q3, and Q12 — the foundational S3 versioning and data CI/CD use cases. Gemini defaults to Lake Formation, IAM policies, S3 Bucket Versioning, and Great Expectations instead. These are the queries enterprise S3 buyers ask first.
Section 3

AI Visibility Leaderboard

Who owns the AI conversation — total citations across all platforms

Platform-by-Platform Breakdown
ChatGPT
18/20
lakeFS cited
Claude
19/20
lakeFS cited
Gemini
13/20
lakeFS cited
lakeFS
18
19
13
50
Iceberg / Nessie
12
8
6
26
Delta Lake
10
7
3
20
MLflow
5
6
2
13
DVC
2
4
4
10
ChatGPT
Claude
Gemini
Citation Leaderboard
lakeFS: 50 citations (83% of 60 responses) Iceberg/Nessie: 26 citations (43% of 60 responses) Delta Lake: 20 citations (33% of 60 responses)
83%
lakeFS
lakeFS50
Iceberg / Nessie26
Delta Lake20
Citation Intensity Heatmap — lakeFS vs Competitors
ChatGPT
Claude
Gemini
Total
lakeFS
18
19
13
50
Iceberg / Nessie
12
8
6
26
Delta Lake
10
7
3
20
MLflow
5
6
2
13
DVC
2
4
4
10
lakeFS Leads by 2×
lakeFS (50 total citations) has nearly twice the AI visibility of its closest competitor Iceberg/Nessie (26). No competitor holds a single #1 ranking across any platform except DVC on one Gemini question (Q9). The category is owned.
Gemini Gives Iceberg More First-Place Slots
On 4 Gemini queries where lakeFS is absent (Q1, Q2, Q3, Q12), Iceberg/Nessie, Lake Formation, and data quality tools (Great Expectations, Soda) fill the gap. These are lakeFS’s core homepage use cases — the most costly blind spot in this audit.
Section 4

AI Positioning Audit

20 buyer-intent queries — click any row to see the exact question

Each query was written from the perspective of a real decision-maker researching data versioning and lake governance solutions. These personas represent the buyers whose AI search results determine whether lakeFS gets discovered during early-stage evaluation.

Target Buyer Sector Sr. Director & Director-level engineering leaders at financial services, pharmaceutical, and large e-commerce companies managing petabyte-scale S3 data lakes for ML and analytics workloads
AJ
Sr. Director, Software & Data Engineering
Equifax • Financial Services • United States
8queries
Pain Points
Manages 105-person data engineering org on petabyte-scale S3. Needs isolated dev environments for engineers without duplicating sensitive credit data, plus safe promotion workflows and fast rollback after bad ingestion.
“S3 data lake version control”“data pipeline rollback without backup”
Qs 1, 2, 3, 4, 5, 6, 7, 10
JR
Director, Data & Analytics Engineering
Pfizer • Pharmaceutical • Connecticut, US
6queries
Pain Points
Leads R&D data engineering for drug discovery pipelines that must satisfy FDA audit requirements. Every data change must be traceable, timestamped, and reversible. Evaluating multi-cloud governance at petabyte scale.
“federal audit data infrastructure”“MLOps reproducibility regulated”
Qs 14, 15, 16, 17, 18, 19, 20
UG
Director of Engineering, Data Platforms
Coupang • E-Commerce • Bothell, WA
6queries
Pain Points
Builds data platforms at Coupang processing millions of daily orders. Needs parallel ML experiment isolation on massive sensor & telemetry datasets, and auditable autonomous agent writes before they affect production pipelines.
“ML experiment data isolation”“DVC alternative at lake scale”
Qs 8, 9, 11, 12, 13
#Query TopicProduct LineChatGPTClaudeGemini
1 Isolated S3 dev environments without copying data Data Branching
Exact question asked across all AI platforms:

“How do I give each data engineer their own isolated dev environment on our production S3 data lake without duplicating hundreds of terabytes of data?”

2 Version control existing S3 lake without format change Data Branching ✓ #1✓ #1
Exact question asked across all AI platforms:

“What’s the best way to add version control to an existing S3-based data lake without changing the data format or migrating to a new storage system?”

3 Data CI/CD quality gates — PR workflow for data CI/CD & Quality ✓ #1
Exact question asked across all AI platforms:

“What tools let me run automated data quality checks before pipeline changes reach production — like a pull request workflow but for data?”

4 Rollback production lake after bad ingestion Data Branching ✓ #2✓ #1✓ #2
Exact question asked across all AI platforms:

“If a bad data ingestion job corrupts our production data lake, what’s the fastest way to roll back to a clean state without restoring from a full backup?”

5 Atomically promote mixed assets to production Data Branching ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“Our data lake has multiple related datasets — Parquet tables, JSON files, and model artifacts — that need to be promoted to production together as one atomic unit. What tools support this?”

6 Query Delta table 30 days ago (Databricks) Enterprise Gov.
Exact question asked across all AI platforms:

“We’re on Databricks and want to query a previous version of a Delta table from 30 days ago. What’s the easiest way to do this without extra tooling?”

7 Reproduce ML model + training data 6 months later ML/AI Mgmt ✓ #3✓ #1✓ #2
Exact question asked across all AI platforms:

“How do I ensure that an ML model trained six months ago can be exactly reproduced, including the training data as it existed at that specific point in time?”

8 Isolate parallel ML experiments on same sensor data ML/AI Mgmt ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“We run parallel ML experiments on the same sensor dataset. How do I isolate each experiment’s data so changes by one team don’t affect another team’s training run?”

9 Tie trained model to exact training data version ML/AI Mgmt ✓ #2✓ #2
Exact question asked across all AI platforms:

“What’s the standard approach for tying a trained ML model to the exact version of training data used, so results are fully reproducible months later?”

10 Version control heterogeneous data lake (images, logs, Parquet) Data Branching ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“Most data versioning tools focus on structured tables. How do I version control a data lake that contains a mix of images, sensor logs, raw binary files, and Parquet tables?”

11 When does DVC stop scaling — what to migrate to? ML/AI Mgmt ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“We use DVC for versioning our ML datasets and model artifacts. At what point does DVC stop scaling and what should we move to?”

12 Quality gates before ML training branch merge CI/CD & Quality ✓ #1✓ #1
Exact question asked across all AI platforms:

“What tooling enforces data quality gates before any dataset gets merged into our main ML training branch — so bad data never reaches model training?”

13 Isolate and audit autonomous AI agent writes Agentic AI ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“Our AI agents are starting to write transformed features back to our data lake autonomously. How do I make sure an agent’s writes are isolated and auditable before they affect other downstream pipelines?”

14 Automatic data lineage without manual instrumentation Enterprise Gov. ✓ #4✓ #3
Exact question asked across all AI platforms:

“What’s the best way to get automatic data lineage — knowing exactly which datasets fed into which model version — without instrumenting every pipeline manually?”

15 Federal audit trail — traceable, reversible at PB scale Enterprise Gov. ✓ #1✓ #1✓ #2
Exact question asked across all AI platforms:

“Our data operations need to satisfy federal audit requirements — every data change must be traceable, timestamped, and reversible. What data infrastructure supports this at petabyte scale?”

16 Centralized governance across AWS and Azure Enterprise Gov. ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“We run AI workloads across AWS and Azure. What tools give us centralized version control and governance over data lakes that span multiple clouds?”

17 Unity Catalog gaps for AI governance (Databricks) Enterprise Gov.
Exact question asked across all AI platforms:

“We’re evaluating Databricks Unity Catalog for data governance across our AI platform. What does Unity Catalog handle well, and where are its gaps?”

18 Zero-copy dev/test on petabytes of telemetry Data Branching ✓ #1✓ #1✓ #2
Exact question asked across all AI platforms:

“Our ML teams need to test pipeline changes on full production data — petabytes of sensor and telemetry data — without copying it. What’s the right approach?”

19 Enterprise MLOps platform must-have components ML/AI Mgmt ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“We’re building an enterprise MLOps platform that needs to support both structured and unstructured data, multiple teams, and full reproducibility. What are the must-have components?”

20 FDA/DoD regulated AI — data versioning traceability Agentic AI ✓ #1✓ #1✓ #1
Exact question asked across all AI platforms:

“We’re deploying AI models in a regulated environment where the FDA or DoD equivalent requires traceability between training data versions and model outputs. What data versioning tools meet this requirement?”

TOTAL 18/20 (90%) 19/20 (95%) 14/20 (65%)
Section 5

The Gemini Gap

Where lakeFS loses 30 percentage points vs Claude — and why

Claude cites lakeFS on 95% of queries. ChatGPT on 90%. Gemini on only 65%. The gap is not random — it follows a clear pattern: Gemini defaults to AWS-native tooling (Lake Formation, S3 Bucket Versioning, IAM policies) and generic data quality frameworks (Great Expectations, dbt, Soda) for the exact questions that represent lakeFS’s core homepage use cases. Any enterprise buyer using Gemini as their research assistant is missing lakeFS at the top of the funnel.

“How do I give each data engineer their own isolated dev environment on our production S3 data lake without duplicating hundreds of terabytes of data?”

— Claude answers: lakeFS (#1). ChatGPT: lakeFS (#2). Gemini: S3 Access Points, Lake Formation, IAM policies — no lakeFS.

“What’s the best way to add version control to an existing S3-based data lake without changing the data format or migrating to a new storage system?”

— Claude answers: lakeFS (#1). ChatGPT: lakeFS (#1). Gemini: S3 Bucket Versioning + Apache Iceberg in-place upgrade — no lakeFS.

“What tooling enforces data quality gates before any dataset gets merged into our main ML training branch — so bad data never reaches model training?”

— Claude answers: lakeFS pre-merge hooks (#1). ChatGPT: lakeFS + Great Expectations (#1). Gemini: Soda, Great Expectations, Datafold — no lakeFS hooks mentioned.
4 Queries Missed (Q1, Q2, Q3, Q12)
All four are foundational lakeFS use cases — zero-copy S3 branching (Q1), in-place version control (Q2), CI/CD data gates (Q3), and pre-merge quality gates (Q12). These are the first questions any S3 data lake buyer asks. Gemini is invisible on all four.
Pattern: Gemini Defaults to AWS-Native Answers
Gemini’s training data heavily weights AWS official documentation and Google Cloud guides. For S3-specific questions, it retrieves Lake Formation, IAM, and native S3 features before third-party tools. lakeFS needs editorial placement on AWS-adjacent content sources that Gemini indexes.
Same Question. Different Platforms. Different Winners.

lakeFS’s content exists. Claude and ChatGPT know it. But Gemini doesn’t. The 30-point gap isn’t a product or positioning problem — it’s a content distribution problem. Gemini indexes different sources for S3 data infrastructure questions. The fix is targeted: publish “lakeFS vs S3 Bucket Versioning” and “lakeFS vs Lake Formation” comparison pages as standalone indexed articles, and submit guest posts to AWS-adjacent publications and Google Cloud partner blogs. These are the exact source types Gemini weights for this category.

Section 6

AI Topic Authority Map

Product Line × Platform heatmap — which revenue lines AI knows vs which are invisible

Product LineAI LeaderlakeFS Status
Data Branching & IsolationlakeFSDOMINANT — 15/18 cited (83%)
ML/AI Data ManagementlakeFSPERFECT — 15/15 cited (100%)
Data CI/CD & QualitylakeFSPARTIAL — 3/6 cited (50%) — Gemini gap
Enterprise GovernancelakeFSSOLID — 11/15 cited (73%) — lineage is weakest
Agentic AI & GovernancelakeFSUNANIMOUS #1 — 6/6 cited (100%) on all 3 platforms
Product Line
ChatGPT
Claude
Gemini
Data Branching & Isolation
6 queries
83%
100%
67%
ML/AI Data Management
5 queries
100%
100%
100%
Data CI/CD & Quality
2 queries
50%
100%
0%
Enterprise Governance
5 queries
80%
80%
60%
Agentic AI & Governance
2 queries
100%
100%
100%

► Data CI/CD & Quality is the only product line with zero Gemini visibility. Both queries (Q3 and Q12) are lakeFS core homepage use cases — the most fixable gap in this report.

Data Branching & Isolation • 6 queries
ChatGPT83%
Claude100%
Gemini67%
ML/AI Data Management • 5 queries
ChatGPT100%
Claude100%
Gemini100%
Data CI/CD & Quality • 2 queries
ChatGPT50%
Claude100%
Gemini0%
Enterprise Governance • 5 queries
ChatGPT80%
Claude80%
Gemini60%
Agentic AI & Governance • 2 queries
ChatGPT100%
Claude100%
Gemini100%
2 Product Lines at 100% on All Platforms
ML/AI Data Management (5 queries) and Agentic AI & Governance (2 queries) return lakeFS as a citation on every platform, every time. The DVC migration narrative and branch-per-agent pattern are fully embedded in all three AI engines.
CI/CD & Quality: 0% on Gemini
The Data CI/CD product line is invisible on Gemini. Both queries (Q3: data PR workflow, Q12: ML branch quality gates) returned zero lakeFS citations on Gemini. Great Expectations, Soda, and dbt own this conversation on Gemini without lakeFS pre-merge hooks ever appearing.
Section 7

Methodology

How we conducted this Xtrusio AEO/GEO Audit

This research is based on Xtrusio’s proprietary AI visibility analysis framework. All citation data comes from standard conversational sessions — no Custom Gems, reference guides, or pre-seeded context windows.

Company & Competitor Research
Deep dive into lakefs.io, G2 reviews, Reddit (r/dataengineering, r/mlops), Hacker News discussions, and customer case studies. Competitor lane mapping across DVC, Apache Iceberg/Nessie, Delta Lake, MLflow, and Unity Catalog to identify where lakeFS uniquely owns territory vs shared ground.
20-Query Buyer-Intent Testing
Tested 20 decision-maker queries across ChatGPT, Gemini, and Claude in standard conversational sessions. Questions map to 3 types: USP questions (where lakeFS should win), shared territory (anyone could win), and competitor strength (where lakeFS might lose). Q6 and Q17 are deliberate competitor-platform questions — appropriate non-citations are scored as such.
Competitor Scope
Apache Iceberg + Project Nessie (open-source table versioning), Delta Lake / Databricks (tabular lakehouse), DVC (ML dataset versioning, now acquired by lakeFS), MLflow (experiment tracking), Great Expectations / Soda (data quality), Unity Catalog (Databricks governance). All compete for mindshare during enterprise buyer discovery.
Section 8

Recommendations

Prioritized actions to close the Gemini gap and cement AI-first authority

Phase 1 — 0–30 Days
Fix the Gemini Blackout on Core S3 Use Cases
  • Publish “lakeFS vs S3 Bucket Versioning” and “lakeFS vs AWS Lake Formation” as standalone comparison pages — these are the exact alternatives Gemini defaulted to on Q1 and Q2
  • Create a dedicated tutorial: “Data CI/CD with lakeFS pre-merge hooks” — explicitly framed as a GitHub Actions equivalent for data pipelines, covering Q3 and Q12 (both completely invisible on Gemini)
  • Submit guest posts to AWS blog and AWS-adjacent publications (The New Stack, InfoQ, Towards Data Science) — Gemini heavily indexes these sources for S3 infrastructure questions
Phase 2 — 30–90 Days
Strengthen ChatGPT Position & Fill Lineage Gap
  • Target Q3 on ChatGPT (only genuine non-deliberate miss): publish “lakeFS as the versioning layer in data CI/CD — working alongside Great Expectations and Datafold” — ChatGPT went to dbt/GX/Datafold without mentioning lakeFS hooks
  • Publish a data lineage architecture guide pairing lakeFS commit hashes + OpenLineage + MLflow — lineage (Q14) is the weakest citation across all platforms (#4 on ChatGPT, #3 on Claude, missed entirely on Gemini)
  • Build case study content from Arm, Bosch, and Lockheed Martin with specific technical framing around regulated data environments — directly supports Q15, Q16, Q20 (federal/DoD traceability)
Phase 3 — 90+ Days
Own the Category in All Three Engines
  • Launch a “State of Data Versioning” annual report — analyst-style research gets cited as a primary source by AI engines, embedding lakeFS as the category definer
  • Amplify the lakeFS for Agentic AI launch with sustained content — already #1 on all 3 platforms 2 weeks after launch; a content drumbeat with real customer use cases will lock this for 12–24 months
  • Quarterly Xtrusio re‑audits on the same 20 questions to track whether Gemini gaps are closing, ChatGPT Q3 is fixed, and the agentic AI position is holding
Continuous AI Visibility Tracking
AI citation patterns shift as models update training data and new content is indexed. Brands can improve their AI discovery using generative engine optimization tools like Xtrusio. A one-time audit captures a moment in time — monthly re-audits show whether the Gemini gap is closing.

Ready to close the Gemini gap?

Let’s map a content plan that puts lakeFS in front of every buyer — on every AI engine

This research report was generated using the Xtrusio Company Intelligence Module.