How to Actually Evaluate AI Visibility Vendors: A CMO...

Most AI visibility vendor dashboards are reporting noise as signal. When fewer than 1% of identical prompts across ChatGPT, Claude, and Google AI produce matching brand lists, a single "visibility score" on a vendor dashboard is selling confidence you cannot verify. Here is the decision framework I use to separate real measurement from expensive guesswork.

Why AI Visibility Scores Are Structurally Unreliable

The core problem is not that vendors are lying. It is that the thing they are measuring does not hold still long enough to produce a stable number.

Research from Rand Fishkin and Patrick O'Donnell (2026) tested 2,961 prompts across ChatGPT, Claude, and Google AI Overviews using hundreds of volunteers. Fewer than 1 in 100 repeated prompts generated identical brand recommendation lists. The brand lists shifted on nearly every run.

Position Digital's analysis found only 9.2% citation overlap between AI Mode sessions for the exact same query. Each AI Mode answer contains an average of 12.6 citation links, but which 12.6 links appear changes constantly.

As Lora Kratchounova, CEO of Scratch Marketing + Media, wrote in Demand Gen Report: large language models are "probabilistic systems, not deterministic ones," meaning "the same query can yield different results across runs, models, and contexts." Point-in-time snapshots — the foundation of most vendor dashboards — are inherently unreliable.

There is also a prompt bias problem. Research from Atwell and Alikhani (2025) showed that naming a brand in the measurement prompt inflates that brand's apparent recommendation rate by 20 or more percentage points compared to category-only prompts. If a vendor is measuring "How visible is [Your Brand] in AI search?" instead of "What brands do AI engines recommend for [your category]?", the numbers are structurally inflated.

The 5 Vendor Questions That Separate Measurement from Marketing

Before I evaluate any AI visibility vendor's output, I run these five questions. Any vendor that cannot answer all five clearly is not ready for a procurement conversation.

1. What is your sample size, and is your methodology documented? Defensible AI visibility measurement requires tens of thousands of model calls per measurement period. If a vendor cannot state their exact sample size or share versioned methodology documentation, you are buying a black box.

2. Can your scores reproduce within 5% on identical re-runs? This is the replication test. If you run the same prompts twice and get materially different scores, the vendor is measuring variance, not visibility. Ask for replication evidence before the demo.

3. Do you show per-engine breakdowns or just composite scores? A brand that appears highly visible in Perplexity may be nearly invisible in ChatGPT. A single composite score hides the most actionable signal — where you are winning and where you are losing. Demand per-engine views as standard.

4. Do your prompts include brand names or use category-only queries? This is the sycophancy test. Branded prompts produce inflated scores. Unbranded category prompts produce honest ones. If a vendor will not disclose their prompt structure, assume the worst.

5. Do you provide confidence intervals or just single-point scores? Given the inherent variability in LLM outputs, any single-integer score without confidence bounds overstates precision. A vendor reporting "your visibility score is 72" without a range is presenting measurement theater.

What the Budget Numbers Say About the Readiness Gap

The Gartner 2026 CMO Spend Survey — surveying 401 CMOs across North America and Europe from January through March 2026 — found that CMOs allocate an average of 15.3% of marketing budgets to AI initiatives. AI-ready organizations push that to 21.3%.

But only 30% of CMOs report mature AI readiness capabilities. Meanwhile, 56% say they lack the budget to deliver their 2026 strategy, and marketing budgets remain flat at 7.8% of company revenue.

The implication: CMOs are spending significantly on AI but most lack the infrastructure to evaluate what they are buying. When only 22% of marketers actively track AI visibility and just 38% have allocated budget specifically to AI search optimization, the vendor market is selling to buyers who cannot yet verify the product.

This is why the evaluation framework matters more than the tool. The Opollo 2026 AI Search Benchmark Report — analyzing GA4 referral data and CRM attribution from 312 B2B technology firms — found AI-referred visitors converting at 14.2% versus Google organic's 2.8%, roughly five times higher. The channel is real. But measuring it requires measurement science, not dashboard marketing.

Vendor Evaluation Comparison: What to Demand Before You Sign

Evaluation Criteria	What Good Looks Like	Red Flag
Methodology disclosure	Published, versioned documentation with sample sizes	"Our algorithm is proprietary"
Replication testing	Scores reproduce within 5% on re-runs; evidence provided	No replication data available
Per-engine breakdowns	Separate scores for ChatGPT, Perplexity, Gemini, Claude	Composite-only scoring
Prompt design	Unbranded category queries; prompt library disclosed	Brand-named prompts or undisclosed structure
Confidence intervals	Score ranges provided with every measurement	Single-point integer scores
Cross-model tracking	Tracks visibility across 4+ AI engines	Single-engine or Google-only focus
Data portability	Raw data exportable in CSV/JSON	Data locked behind dashboard
Methodology changelog	Version-controlled updates distinguishing brand movement from methodology shifts	No changelog or undisclosed changes
Customer evidence	Case studies with baseline metrics, interventions, and measured results	Testimonials without numbers

Why Tracking Tools Alone Will Not Build AI Visibility

Measurement is necessary. It is not sufficient.

Semrush data shows 93% of Google AI Mode sessions result in zero clicks to external websites. You cannot buy your way into AI citations the way you buy Google Ads clicks. AI engines cite sources they have already decided to trust — and that trust is built from entity authority, not keyword optimization.

Ahrefs research across 75,000 brands found that branded web mentions had the strongest correlation (0.664) with appearances in Google AI Overviews — far exceeding backlinks (0.218). The top three AI visibility correlates are all off-site factors: web mentions, brand anchors (0.527), and brand search volume (0.392).

This is where the Machine Relations framework matters. Most AI visibility vendors optimize one layer — tracking. But earning AI citations requires a full system: building entity authority that AI engines recognize, structuring content so models can extract clean claims, earning citations across multiple engines, distributing through channels AI agents actually crawl, and then measuring the result. Tracking without the upstream work is like installing a speedometer on a parked car.

If I am evaluating whether to spend budget on a visibility vendor, the first question is not "what tool?" — it is "have we built the thing the tool would measure?" A visibility audit answers that question faster than any vendor demo.

FAQ

How much should a CMO budget for AI visibility measurement tools?

The Gartner 2026 survey shows CMOs spending 15.3% of marketing budgets on AI initiatives broadly. Dedicated AI visibility measurement is a subset — typically 5-15% of the AI allocation, or roughly 1-2% of total marketing budget. The more important question is whether your organization has the entity authority and content structure to produce results worth measuring.

Can you reliably track brand visibility across all AI search engines?

Not with a single composite score. Each AI engine — ChatGPT, Perplexity, Gemini, Claude, Google AI Mode — uses different training data and weighting, producing different outputs for the same prompt. Position Digital found only 9.2% citation overlap across identical queries. Reliable tracking requires per-engine measurement, large sample sizes, unbranded prompts, and confidence intervals. Demand all four from any vendor.

What is Machine Relations and how does it relate to AI visibility?

Machine Relations is the discipline of earning AI citations and recommendations for a brand by making it legible, retrievable, and credible inside AI-driven discovery systems. AI visibility measurement is one layer of the full MR stack — the measurement layer. Without the upstream work of building entity authority, structuring extractable content, and earning cross-engine citations, there is nothing meaningful to measure.

How to Actually Evaluate AI Visibility Vendors: A CMO Decision Framework