Xtrusio AEO/GEO Audit

Gemini is blind to lakeFS.

ChatGPT and Claude aren’t.

20-query buyer-intent audit across ChatGPT, Claude & Gemini. lakeFS is cited on 50 of 60 responses (83.3%) — ranking #1 on 36 of those citations. No competitor holds a single first-place position. But Gemini misses lakeFS entirely on the 4 queries that are its core homepage use cases.

This report was generated using Xtrusio, an AI visibility and demand intelligence platform that analyzes how companies appear across modern AI systems such as ChatGPT, Gemini, Claude, Perplexity, and other generative engines.

The insights in this page are generated using Xtrusio’s proprietary research and content intelligence framework.

June 2026

20 Queries • 3 Platforms

95%

Claude

19 of 20 queries

15× #1 RANKINGS

90%

ChatGPT

18 of 20 queries

13× #1 RANKINGS

65%

Gemini

13 of 20 queries

⚠ 30-POINT GAP

The Core Problem

lakeFS dominates ChatGPT and Claude — but Gemini defaults to AWS-native answers for the exact queries lakeFS was built to own.

When enterprise data engineers ask Gemini “how do I add version control to my S3 data lake without migrating?” or “how do I give engineers isolated dev environments without copying terabytes?” — Gemini returns Lake Formation, IAM policies, and S3 Bucket Versioning. Not lakeFS. These are Q1 and Q2 of this audit. They are the first two questions any buyer asks. And lakeFS is invisible on both — on the platform that indexes Google’s own documentation and AWS guides most heavily.

83.3%

Composite Citation Rate

#1 Rankings (of 50 cited)

30pp

Claude vs Gemini Gap

Section 2

Platform Scorecard

lakeFS citation rate across AI platforms

lakeFS Citation Rate by Platform

Claude

95%

ChatGPT

90%

Gemini

65%

Competitor Comparison — Combined Citation Rates (all 3 platforms)

lakeFS

83%

Iceberg / Nessie

43%

Delta Lake

33%

MLflow

22%

DVC

17%

Claude & ChatGPT: Category Owned

On Claude (95%) and ChatGPT (90%), lakeFS holds an unchallenged #1 position across core use cases. No competitor takes first place on either platform in any question. The DVC acquisition narrative, agentic AI sandboxing, and heterogeneous lake versioning all return lakeFS as the primary recommendation.

Gemini: AWS-Native Defaults Winning

Gemini cites lakeFS on only 65% of queries and misses entirely on Q1, Q2, Q3, and Q12 — the foundational S3 versioning and data CI/CD use cases. Gemini defaults to Lake Formation, IAM policies, S3 Bucket Versioning, and Great Expectations instead. These are the queries enterprise S3 buyers ask first.

Section 3

AI Visibility Leaderboard

Who owns the AI conversation — total citations across all platforms

Platform-by-Platform Breakdown

ChatGPT

18/20

lakeFS cited

Claude

19/20

lakeFS cited

Gemini

13/20

lakeFS cited

lakeFS

Iceberg / Nessie

Delta Lake

MLflow

DVC

ChatGPT

Claude

Gemini

Citation Leaderboard

83%

lakeFS

lakeFS50

Iceberg / Nessie26

Delta Lake20

Citation Intensity Heatmap — lakeFS vs Competitors

ChatGPT

Claude

Gemini

Total

lakeFS

Iceberg / Nessie

Delta Lake

MLflow

DVC

lakeFS Leads by 2×

lakeFS (50 total citations) has nearly twice the AI visibility of its closest competitor Iceberg/Nessie (26). No competitor holds a single #1 ranking across any platform except DVC on one Gemini question (Q9). The category is owned.

Gemini Gives Iceberg More First-Place Slots

On 4 Gemini queries where lakeFS is absent (Q1, Q2, Q3, Q12), Iceberg/Nessie, Lake Formation, and data quality tools (Great Expectations, Soda) fill the gap. These are lakeFS’s core homepage use cases — the most costly blind spot in this audit.

Section 4

AI Positioning Audit

20 buyer-intent queries — click any row to see the exact question

Each query was written from the perspective of a real decision-maker researching data versioning and lake governance solutions. These personas represent the buyers whose AI search results determine whether lakeFS gets discovered during early-stage evaluation.

Target Buyer Sector Sr. Director & Director-level engineering leaders at financial services, pharmaceutical, and large e-commerce companies managing petabyte-scale S3 data lakes for ML and analytics workloads

Anurag Jain ↗

Sr. Director, Software & Data Engineering

Equifax • Financial Services • United States

8queries

Pain Points

Manages 105-person data engineering org on petabyte-scale S3. Needs isolated dev environments for engineers without duplicating sensitive credit data, plus safe promotion workflows and fast rollback after bad ingestion.

“S3 data lake version control”“data pipeline rollback without backup”

Qs 1, 2, 3, 4, 5, 6, 7, 10

Jennifer Rola ↗

Director, Data & Analytics Engineering

Pfizer • Pharmaceutical • Connecticut, US

6queries

Pain Points

Leads R&D data engineering for drug discovery pipelines that must satisfy FDA audit requirements. Every data change must be traceable, timestamped, and reversible. Evaluating multi-cloud governance at petabyte scale.

“federal audit data infrastructure”“MLOps reproducibility regulated”

Qs 14, 15, 16, 17, 18, 19, 20

Uttam Garg ↗

Director of Engineering, Data Platforms

Coupang • E-Commerce • Bothell, WA

6queries

Pain Points

Builds data platforms at Coupang processing millions of daily orders. Needs parallel ML experiment isolation on massive sensor & telemetry datasets, and auditable autonomous agent writes before they affect production pipelines.

“ML experiment data isolation”“DVC alternative at lake scale”

Qs 8, 9, 11, 12, 13

#	Query Topic	Product Line	ChatGPT	Claude	Gemini
1	Isolated S3 dev environments without copying data	Data Branching	✓	✓	✗
Exact question asked across all AI platforms: “How do I give each data engineer their own isolated dev environment on our production S3 data lake without duplicating hundreds of terabytes of data?”
2	Version control existing S3 lake without format change	Data Branching	✓ #1	✓ #1	✗
Exact question asked across all AI platforms: “What’s the best way to add version control to an existing S3-based data lake without changing the data format or migrating to a new storage system?”
3	Data CI/CD quality gates — PR workflow for data	CI/CD & Quality	✗	✓ #1	✗
Exact question asked across all AI platforms: “What tools let me run automated data quality checks before pipeline changes reach production — like a pull request workflow but for data?”
4	Rollback production lake after bad ingestion	Data Branching	✓ #2	✓ #1	✓ #2
Exact question asked across all AI platforms: “If a bad data ingestion job corrupts our production data lake, what’s the fastest way to roll back to a clean state without restoring from a full backup?”
5	Atomically promote mixed assets to production	Data Branching	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “Our data lake has multiple related datasets — Parquet tables, JSON files, and model artifacts — that need to be promoted to production together as one atomic unit. What tools support this?”
6	Query Delta table 30 days ago (Databricks)	Enterprise Gov.	✗	✗	✗
Exact question asked across all AI platforms: “We’re on Databricks and want to query a previous version of a Delta table from 30 days ago. What’s the easiest way to do this without extra tooling?”
7	Reproduce ML model + training data 6 months later	ML/AI Mgmt	✓ #3	✓ #1	✓ #2
Exact question asked across all AI platforms: “How do I ensure that an ML model trained six months ago can be exactly reproduced, including the training data as it existed at that specific point in time?”
8	Isolate parallel ML experiments on same sensor data	ML/AI Mgmt	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “We run parallel ML experiments on the same sensor dataset. How do I isolate each experiment’s data so changes by one team don’t affect another team’s training run?”
9	Tie trained model to exact training data version	ML/AI Mgmt	✓ #2	✓	✓ #2
Exact question asked across all AI platforms: “What’s the standard approach for tying a trained ML model to the exact version of training data used, so results are fully reproducible months later?”
10	Version control heterogeneous data lake (images, logs, Parquet)	Data Branching	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “Most data versioning tools focus on structured tables. How do I version control a data lake that contains a mix of images, sensor logs, raw binary files, and Parquet tables?”
11	When does DVC stop scaling — what to migrate to?	ML/AI Mgmt	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “We use DVC for versioning our ML datasets and model artifacts. At what point does DVC stop scaling and what should we move to?”
12	Quality gates before ML training branch merge	CI/CD & Quality	✓ #1	✓ #1	✗
Exact question asked across all AI platforms: “What tooling enforces data quality gates before any dataset gets merged into our main ML training branch — so bad data never reaches model training?”
13	Isolate and audit autonomous AI agent writes	Agentic AI	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “Our AI agents are starting to write transformed features back to our data lake autonomously. How do I make sure an agent’s writes are isolated and auditable before they affect other downstream pipelines?”
14	Automatic data lineage without manual instrumentation	Enterprise Gov.	✓ #4	✓ #3	✗
Exact question asked across all AI platforms: “What’s the best way to get automatic data lineage — knowing exactly which datasets fed into which model version — without instrumenting every pipeline manually?”
15	Federal audit trail — traceable, reversible at PB scale	Enterprise Gov.	✓ #1	✓ #1	✓ #2
Exact question asked across all AI platforms: “Our data operations need to satisfy federal audit requirements — every data change must be traceable, timestamped, and reversible. What data infrastructure supports this at petabyte scale?”
16	Centralized governance across AWS and Azure	Enterprise Gov.	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “We run AI workloads across AWS and Azure. What tools give us centralized version control and governance over data lakes that span multiple clouds?”
17	Unity Catalog gaps for AI governance (Databricks)	Enterprise Gov.	✓	✓	✓
Exact question asked across all AI platforms: “We’re evaluating Databricks Unity Catalog for data governance across our AI platform. What does Unity Catalog handle well, and where are its gaps?”
18	Zero-copy dev/test on petabytes of telemetry	Data Branching	✓ #1	✓ #1	✓ #2
Exact question asked across all AI platforms: “Our ML teams need to test pipeline changes on full production data — petabytes of sensor and telemetry data — without copying it. What’s the right approach?”
19	Enterprise MLOps platform must-have components	ML/AI Mgmt	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “We’re building an enterprise MLOps platform that needs to support both structured and unstructured data, multiple teams, and full reproducibility. What are the must-have components?”
20	FDA/DoD regulated AI — data versioning traceability	Agentic AI	✓ #1	✓ #1	✓ #1
Exact question asked across all AI platforms: “We’re deploying AI models in a regulated environment where the FDA or DoD equivalent requires traceability between training data versions and model outputs. What data versioning tools meet this requirement?”
	TOTAL		18/20 (90%)	19/20 (95%)	14/20 (65%)

Section 5

The Gemini Gap

Where lakeFS loses 30 percentage points vs Claude — and why

Claude cites lakeFS on 95% of queries. ChatGPT on 90%. Gemini on only 65%. The gap is not random — it follows a clear pattern: Gemini defaults to AWS-native tooling (Lake Formation, S3 Bucket Versioning, IAM policies) and generic data quality frameworks (Great Expectations, dbt, Soda) for the exact questions that represent lakeFS’s core homepage use cases. Any enterprise buyer using Gemini as their research assistant is missing lakeFS at the top of the funnel.

“How do I give each data engineer their own isolated dev environment on our production S3 data lake without duplicating hundreds of terabytes of data?”

— Claude answers: lakeFS (#1). ChatGPT: lakeFS (#2). Gemini: S3 Access Points, Lake Formation, IAM policies — no lakeFS.

“What’s the best way to add version control to an existing S3-based data lake without changing the data format or migrating to a new storage system?”

— Claude answers: lakeFS (#1). ChatGPT: lakeFS (#1). Gemini: S3 Bucket Versioning + Apache Iceberg in-place upgrade — no lakeFS.

“What tooling enforces data quality gates before any dataset gets merged into our main ML training branch — so bad data never reaches model training?”

— Claude answers: lakeFS pre-merge hooks (#1). ChatGPT: lakeFS + Great Expectations (#1). Gemini: Soda, Great Expectations, Datafold — no lakeFS hooks mentioned.

4 Queries Missed (Q1, Q2, Q3, Q12)

All four are foundational lakeFS use cases — zero-copy S3 branching (Q1), in-place version control (Q2), CI/CD data gates (Q3), and pre-merge quality gates (Q12). These are the first questions any S3 data lake buyer asks. Gemini is invisible on all four.

Pattern: Gemini Defaults to AWS-Native Answers
Gemini’s training data heavily weights AWS official documentation and Google Cloud guides. For S3-specific questions, it retrieves Lake Formation, IAM, and native S3 features before third-party tools. lakeFS needs editorial placement on AWS-adjacent content sources that Gemini indexes.

Same Question. Different Platforms. Different Winners.

lakeFS’s content exists. Claude and ChatGPT know it. But Gemini doesn’t. The 30-point gap isn’t a product or positioning problem — it’s a content distribution problem. Gemini indexes different sources for S3 data infrastructure questions. The fix is targeted: publish “lakeFS vs S3 Bucket Versioning” and “lakeFS vs Lake Formation” comparison pages as standalone indexed articles, and submit guest posts to AWS-adjacent publications and Google Cloud partner blogs. These are the exact source types Gemini weights for this category.

Section 6

AI Topic Authority Map

Product Line × Platform heatmap — which revenue lines AI knows vs which are invisible

Product Line	AI Leader	lakeFS Status
Data Branching & Isolation	lakeFS	DOMINANT — 15/18 cited (83%)
ML/AI Data Management	lakeFS	PERFECT — 15/15 cited (100%)
Data CI/CD & Quality	lakeFS	PARTIAL — 3/6 cited (50%) — Gemini gap
Enterprise Governance	lakeFS	SOLID — 11/15 cited (73%) — lineage is weakest
Agentic AI & Governance	lakeFS	UNANIMOUS #1 — 6/6 cited (100%) on all 3 platforms

Product Line

ChatGPT

Claude

Gemini

Data Branching & Isolation
6 queries

83%

100%

67%

ML/AI Data Management
5 queries

100%

Data CI/CD & Quality
2 queries

50%

100%

Enterprise Governance
5 queries

80%

60%

Agentic AI & Governance
2 queries

100%

► Data CI/CD & Quality is the only product line with zero Gemini visibility. Both queries (Q3 and Q12) are lakeFS core homepage use cases — the most fixable gap in this report.

Data Branching & Isolation • 6 queries

ChatGPT83%

Claude100%

Gemini67%

ML/AI Data Management • 5 queries

ChatGPT100%

Claude100%

Gemini100%

Data CI/CD & Quality • 2 queries

ChatGPT50%

Claude100%

Gemini0%

Enterprise Governance • 5 queries

ChatGPT80%

Claude80%

Gemini60%

Agentic AI & Governance • 2 queries

ChatGPT100%

Claude100%

Gemini100%

2 Product Lines at 100% on All Platforms

ML/AI Data Management (5 queries) and Agentic AI & Governance (2 queries) return lakeFS as a citation on every platform, every time. The DVC migration narrative and branch-per-agent pattern are fully embedded in all three AI engines.

CI/CD & Quality: 0% on Gemini

The Data CI/CD product line is invisible on Gemini. Both queries (Q3: data PR workflow, Q12: ML branch quality gates) returned zero lakeFS citations on Gemini. Great Expectations, Soda, and dbt own this conversation on Gemini without lakeFS pre-merge hooks ever appearing.

Section 7

Methodology

How we conducted this Xtrusio AEO/GEO Audit

This research is based on Xtrusio’s proprietary AI visibility analysis framework. All citation data comes from standard conversational sessions — no Custom Gems, reference guides, or pre-seeded context windows.

Company & Competitor Research

Deep dive into lakefs.io, G2 reviews, Reddit (r/dataengineering, r/mlops), Hacker News discussions, and customer case studies. Competitor lane mapping across DVC, Apache Iceberg/Nessie, Delta Lake, MLflow, and Unity Catalog to identify where lakeFS uniquely owns territory vs shared ground.

20-Query Buyer-Intent Testing

Tested 20 decision-maker queries across ChatGPT, Gemini, and Claude in standard conversational sessions. Questions map to 3 types: USP questions (where lakeFS should win), shared territory (anyone could win), and competitor strength (where lakeFS might lose). Q6 and Q17 are deliberate competitor-platform questions — appropriate non-citations are scored as such.

Competitor Scope

Apache Iceberg + Project Nessie (open-source table versioning), Delta Lake / Databricks (tabular lakehouse), DVC (ML dataset versioning, now acquired by lakeFS), MLflow (experiment tracking), Great Expectations / Soda (data quality), Unity Catalog (Databricks governance). All compete for mindshare during enterprise buyer discovery.

Section 8

Recommendations

Prioritized actions to close the Gemini gap and cement AI-first authority

Phase 1 — 0–30 Days

Fix the Gemini Blackout on Core S3 Use Cases

Publish “lakeFS vs S3 Bucket Versioning” and “lakeFS vs AWS Lake Formation” as standalone comparison pages — these are the exact alternatives Gemini defaulted to on Q1 and Q2
Create a dedicated tutorial: “Data CI/CD with lakeFS pre-merge hooks” — explicitly framed as a GitHub Actions equivalent for data pipelines, covering Q3 and Q12 (both completely invisible on Gemini)
Submit guest posts to AWS blog and AWS-adjacent publications (The New Stack, InfoQ, Towards Data Science) — Gemini heavily indexes these sources for S3 infrastructure questions

Phase 2 — 30–90 Days

Strengthen ChatGPT Position & Fill Lineage Gap

Target Q3 on ChatGPT (only genuine non-deliberate miss): publish “lakeFS as the versioning layer in data CI/CD — working alongside Great Expectations and Datafold” — ChatGPT went to dbt/GX/Datafold without mentioning lakeFS hooks
Publish a data lineage architecture guide pairing lakeFS commit hashes + OpenLineage + MLflow — lineage (Q14) is the weakest citation across all platforms (#4 on ChatGPT, #3 on Claude, missed entirely on Gemini)
Build case study content from Arm, Bosch, and Lockheed Martin with specific technical framing around regulated data environments — directly supports Q15, Q16, Q20 (federal/DoD traceability)

Phase 3 — 90+ Days

Own the Category in All Three Engines

Launch a “State of Data Versioning” annual report — analyst-style research gets cited as a primary source by AI engines, embedding lakeFS as the category definer
Amplify the lakeFS for Agentic AI launch with sustained content — already #1 on all 3 platforms 2 weeks after launch; a content drumbeat with real customer use cases will lock this for 12–24 months
Quarterly Xtrusio re‑audits on the same 20 questions to track whether Gemini gaps are closing, ChatGPT Q3 is fixed, and the agentic AI position is holding

Continuous AI Visibility Tracking

AI citation patterns shift as models update training data and new content is indexed. Brands can improve their AI discovery using generative engine optimization tools like Xtrusio. A one-time audit captures a moment in time — monthly re-audits show whether the Gemini gap is closing.

Ready to close the Gemini gap?

Let’s map a content plan that puts lakeFS in front of every buyer — on every AI engine

Email Gaurav WhatsApp Gaurav

This research report was generated using the Xtrusio Company Intelligence Module.

Gemini is blind to lakeFS.

Platform Scorecard

AI Visibility Leaderboard

AI Positioning Audit

The Gemini Gap

AI Topic Authority Map

Methodology

Recommendations

Ready to close the Gemini gap?

Related Research