How LLMs Rank Brands: A Statistical Study of AI Visibility Across Claude, GPT-4o, and Gemini
How do AI models like ChatGPT, Claude, and Gemini decide which brands to recommend? We built an open-source framework to measure exactly that, testing 48 supplement brands across 3 LLMs with 144,000 data points.
The results reveal massive differences between models, and actionable insights for brands looking to improve their AI visibility.
The Problem: AI Search is a Black Box
When a user asks ChatGPT "Which supplement brands are good in Germany?", the model returns a list of recommendations. But unlike traditional SEO where you can see rankings in Google Search Console, there is no transparency into how LLMs rank brands.
We built the LLM Brand Visibility and Ranking Framework to solve this. It is a reproducible, statistical methodology for measuring brand visibility across multiple AI models.
Study Design
Our methodology follows scientific standards with proper statistical rigor:
- 3 LLM Models: Claude (Anthropic), GPT-4o (OpenAI), Gemini 2.5 Flash (Google)
- 48 Supplement Brands from the Visibly AI Brand Radar (German market)
- 100 Generic Prompts in 2 clusters (Informational, Commercial), no brand names in any prompt
- 10 Runs per Prompt per model (temperature=0.7) to measure variance
- 144,000 Total Evaluations (100 prompts x 3 models x 10 runs x 48 brands)
- Buyer Persona: 25-year-old gym-goer, nutrition-conscious, asking in first person (German)
All prompts explicitly ask for brand recommendations ("Welche Marken...") but never mention any specific brand. Brands are only detected in the output using 3-layer matching: exact match, domain match, and fuzzy match.
Key Finding 1: ESN and AG1 Dominate AI Recommendations
The top brands by mention rate across all three models:
| Rank | Brand | Mention Rate | Top-3 Rate |
|---|---|---|---|
| 1 | ESN | 28.3% | 9.1% |
| 2 | AG1 | 26.0% | 3.4% |
| 3 | Sunday Natural | 13.6% | 2.2% |
| 4 | Foodspring | 13.1% | 3.2% |
| 5 | Ritual | 4.6% | 0.4% |
ESN (Elite Sports Nutrients) leads with a 28.3% mention rate, meaning it appears in roughly 1 out of 3.5 AI responses about supplements. AG1 (Athletic Greens) follows closely at 26%, but with a dramatically different model distribution.
Key Finding 2: Models Disagree Massively on Brand Recommendations
The three models show fundamentally different behavior:
- Gemini is the most brand-heavy (4.7% average mention rate), recommending specific brands nearly 5x more than GPT
- Claude sits in the middle (2.2%), balanced between brand mentions and generic advice
- GPT-4o is the most conservative (1.0%), preferring generic supplement advice over specific brand recommendations
Key Finding 3: AG1 Has a 57% Spread Between Models
The most striking finding is brand volatility, how differently each model treats the same brand:
| Brand | Claude | Gemini | GPT | Spread |
|---|---|---|---|---|
| AG1 | 11.2% | 62.0% | 4.9% | 57.1% |
| Sunday Natural | 16.6% | 21.7% | 2.5% | 19.2% |
| Foodspring | 20.5% | 14.6% | 4.3% | 16.2% |
| ESN | 28.0% | 36.1% | 20.8% | 15.3% |
AG1 appears in 62% of all Gemini responses but only 5% of GPT responses. That is a 12x difference for the same brand on the same prompts. For brands, this means GEO (Generative Engine Optimization) cannot be a one-size-fits-all strategy. Each model requires a different approach.
Methodology: How We Measured This
Our framework uses a rigorous statistical approach:
- Generic Prompts Only - No brand names ever appear in any prompt. Brands are only detected in the LLM output.
- Buyer Persona - All prompts simulate a real user: 25 years old, gym-goer, nutrition-conscious, asking in first person German.
- 3-Layer Brand Detection - Exact match (regex with word boundaries), domain match (brand URL in response), fuzzy match (thefuzz library for typos and compound words).
- Statistical Tests - Fisher exact test for mention rates, Mann-Whitney U for rankings, Benjamini-Hochberg FDR correction for multiple testing.
- Bootstrap Confidence Intervals - 5,000 resamples for all metrics.
- Power Analysis - Monte Carlo simulation confirming >90% power at 10 runs x 100 prompts for detecting 5+ percentage point differences.
Power Analysis: How Many Runs Do You Need?
A common question in LLM research: how many repeated runs are needed for statistically reliable results? We ran Monte Carlo simulations to find out.
| Configuration | Observations | Power (5pp effect) | Cost (3 models) |
|---|---|---|---|
| 10 runs x 50 prompts | 1,500 | 34% (too low) | ~$9 |
| 30 runs x 50 prompts | 4,500 | 80% (minimum) | ~$28 |
| 20 runs x 100 prompts | 6,000 | 93% (recommended) | ~$37 |
| 30 runs x 200 prompts | 18,000 | 100% (gold standard) | ~$111 |
Rule of thumb: At least 1,500 observations per model (e.g. 30 runs x 50 prompts) for statistically reliable results. For detecting small differences (under 5 percentage points), you need significantly more. More prompts can compensate for fewer runs.
Power >= 80% means you have a high probability of detecting a real difference if one exists. Below 60% means too high a risk of missing real effects.
What This Means for Brands (GEO Implications)
- Measure first, optimize second. You cannot improve what you do not measure. This framework gives you a baseline.
- Model-specific strategies are essential. A brand visible on Gemini may be invisible on GPT. Optimize for each model separately.
- Prompt type matters. Commercial prompts ("best brand for X") trigger different brands than informational prompts.
- Consistency is key. ESN scores 20-36% across all models. AG1 scores 5-62%. ESN has more stable AI visibility.
- Track over time. LLM training data changes. Monthly monitoring is recommended.
Surprising: More Nutrition Only Ranks #6
One particularly interesting result: More Nutrition lands at only 4.5% mention rate (rank #6), despite being one of the highest-revenue supplement brands in Germany. In our More Nutrition Revenue and SEO Analysis, we showed that the brand generates over 823,000 monthly brand searches and an estimated 800M EUR annual revenue.
Why is More Nutrition so far behind in AI visibility? One possible explanation is the target audience: Our buyer persona is a 25-year-old gym-goer asking generic supplement questions. More Nutrition positions itself heavily through influencer marketing (particularly through founder Christian Wolf), which may be less represented in LLM training data compared to the traditional SEO presence of ESN or AG1.
This shows: High revenue and brand awareness do not automatically translate to high AI visibility. GEO requires different signals than traditional SEO or social media marketing.
Interactive Report
View the full interactive report with all charts, tables, and raw data:
All raw data (CSV) is also available: Download Results (ZIP)
Open Source: Use It for Your Industry
The entire framework is open source and adaptable to any industry:
GitHub: github.com/AntonioBlago/llm-visibility-framework
- 200 prompts (expandable), 48 brands (configurable), 3 models
- Parallel API calls (ThreadPoolExecutor), auto-generates reports
- Interactive HTML report with Plotly charts
- Statistical analysis engine with power analysis
- MIT License, free to use and modify
To track and improve your AI visibility continuously, check out Visibly AI, our SEO agent system with AI Brand Monitoring, Competitor Radar, and GEO optimization tools.