MCP Quality Benchmark

MCP responses were reaching 30–40k tokens per query, inflating cost, slowing generation, and degrading retrieval accuracy. I ran an 8 MCP config benchmark to determine the optimal metadata format and chunking strategy. The winning configuration achieved 92% accuracy and reduced token consumption by 35%, informing the new MCP ingestion standard.

Role

Lead Designer & AI Systems Architect

Project Type

Design System AI Infrastructure

Team

Led end-to-end with 1 engineer + 1 DT partner

Contribution

Benchmarking strategy, implementation, analysis, and final recommendation

Tools

Cursor, Claude, MCP, React, TypeScript, Anthropic API, GitLab

Duration

2 months

The Problem

The MCP generated 4,000+ prototypes, but it ballooned to 30–40k tokens for trivial queries. That created cost bloat, latency, and reduced retrieval accuracy. No one knew whether the root issue was related to the metadata format, chunking strategy, or infrastructure. Without clarity, we could not standardize AI input or scale reliable prototype automation.

This MCP performance issue directly causes:

High token consumption
Increased cost
Context window pressure
Signal-to-noise ratio
Latency impact

Decision and Outcome

Which MCP configuration produces the highest quality input?

Experiment A and Semantic Chunker both performed well, but Semantic Chunker offers the best value; It achieved the highest retrieval accuracy of 92%, while reducing token consumption by 88% compared to the in-production baseline.

At 4,000 prototypes per month (~240,000 queries annually), this translates to approximately $12,000 in annual savings while improving retrieval accuracy by 3 points. Semantic Chunking JSON is now the standard for MCP ingestion.

Rank	Configuration	Token Savings	Cost Savings	Retrieval
1	Semantic Chunker Chunking at ingestion	88%	$0.05/query	92%
2	In-Production Baseline	Baseline	Baseline	89%
3	Experiment A Pre-chunked JSON	87%	$0.05/query	86%

The Method

Cloned the in-production MCP repository

Generated metadata for each MCP experiment

Ingested each MCP with the new data locally

Verified all 8 MCP configurations running

Created 8 MCPs with distinct metadata pipelines, ran each under identical prompts, and executed them on isolated local ports to remove cross-contamination. Learn more about AIMS pipeline here.

8 MCP Configurations, Same Knowledge:

In-Production MCP - Monolithic and verbose JSON
Semantic Chunker - Optimized MCP ingester (Infra level)
Experiment A - Pre-chunked and monolithic JSON
Experiment B - Original documentation MDX for humans
Experiment C - Only MD
Experiment D - Hybrid MD + JSON
Experiment E - Domain-separated JSON
Experiment F - TOON (Token-Oriented Object Notation)

The Framework

Why this matters: benchmarks are meaningless without a quality definition. I defined three dimensions because "Quality" isn't one thing. It's three things in balance:

MCP Input Quality — Is the MCP optimized for LLM reasoning?
LLM Output Quality Retrieval — Can the LLM find correct information?
LLM Output Quality Prototype — Can the LLM produce Indeed code?

Benchmark Evidence

Structured. Chunked. Smart. Coherent.

✓ Structured - JSON > Markdown

JSON outperforms Markdown for LLM retrieval. Machine-parseable beats human-readable.

Finding: Top performers (Semantic 92%, Exp A 86%) both use structured JSON formats.

✓ Chunked - Chunked > Monolithic

Breaking data into focused chunks improves retrieval. Monolithic dumps bury the signal in noise.

Finding: Both top performers use chunking, semantic, or pre-chunked, and deterministic JSON. Chunked > Monolithic.

✓ Smart - Semantic > Pre-chunk

Not just chunking, but HOW you chunk. Concept-aligned boundaries preserve meaning more effectively than arbitrary divisions.

Finding: Semantic chunking achieved 92% coverage vs. pre-chunked JSON at 86%, same format, different chunking strategy.

✓ Coherent - One format > Hybrid

Keep related information together in one file per component, AND use consistent formats. Silos fragment context. Mixed formats exhaust processing.

Finding: Domain-separated (Exp E) achieved only 20% retrieval coverage; information was there, but siloed. The hybrid format (Exp D) hit 27% completion; the LLM couldn't even finish.

Next Steps

This research extends beyond a single benchmark. The findings inform:

1. Prototype Benchmark: Complete full quality framework evaluation before production rollout.

2. Mobile Design System Documentation for MCP: Implement semantic chunking for the Mobile DS MCP.

3. Design System AI Infrastructure: Work with AI leadership to integrate MCP ingestion standards into long-term AI platform strategy.

Prompt → Prototype

Prototype benchmark is the missing piece. Over the next weeks, I will complete the full framework. Stay tuned!

I design systems that both humans and AIs can understand, trust, and build on.

Let’s Connect

Selected Work

Indeed Design SystemDesign System

AI Metadata SystemsAI Design System

Factory Web PlatformProduct Design

Dispatch MobileDesign System

Email Design SystemDesign System

Figma Governance & AnalyticsUX Ops

All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 |

MCP Quality Benchmark

MCP Quality Benchmark

The Problem

Decision and Outcome

The Method﻿

The Framework

Benchma﻿rk Evidence

Next Steps

Prompt → Prototype

The Method

Benchmark Evidence