MCP Quality Benchmark

 MCP Quality Benchmark

MCP responses were reaching 30–40k tokens per query, inflating cost, slowing generation, and degrading retrieval accuracy. I ran an 8 MCP config benchmark to determine the optimal metadata format and chunking strategy. The winning configuration achieved 92% accuracy and reduced token consumption by 35%, informing the new MCP ingestion standard.

Role

Lead Designer & AI Systems Architect

Project Type

Design System AI Infrastructure

Team

Led end-to-end with 1 engineer + 1 DT partner

Contribution

Benchmarking strategy, implementation, analysis, and final recommendation

Tools

Cursor, Claude, MCP, React, TypeScript, Anthropic API, GitLab

Duration

2 months

separator
design-system-mcp-token-burning

The Problem

The MCP generated 4,000+ prototypes, but it ballooned to 30–40k tokens for trivial queries. That created cost bloat, latency, and reduced retrieval accuracy. No one knew whether the root issue was related to the metadata format, chunking strategy, or infrastructure. Without clarity, we could not standardize AI input or scale reliable prototype automation.

This MCP performance issue directly causes:

  • High token consumption
  • Increased cost
  • Context window pressure
  • Signal-to-noise ratio
  • Latency impact

Decision and Outcome

Which MCP configuration produces the highest quality input?

Experiment A and Semantic Chunker both performed well, but Semantic Chunker offers the best value; It achieved the highest retrieval accuracy of 92%, while reducing token consumption by 88% compared to the in-production baseline.

At 4,000 prototypes per month (~240,000 queries annually), this translates to approximately $12,000 in annual savings while improving retrieval accuracy by 3 points. Semantic Chunking JSON is now the standard for MCP ingestion.

Rank Configuration Token Savings Cost Savings Retrieval
1 Semantic Chunker Chunking at ingestion 88% $0.05/query 92%
2 In-Production Baseline Baseline Baseline 89%
3 Experiment A Pre-chunked JSON 87% $0.05/query 86%

The Method

1

Cloned the in-production MCP repository

2

Generated metadata for each MCP experiment

3

Ingested each MCP with the new data locally

4

Verified all 8 MCP configurations running

Created 8 MCPs with distinct metadata pipelines, ran each under identical prompts, and executed them on isolated local ports to remove cross-contamination. Learn more about AIMS pipeline here.

8 MCP Configurations, Same Knowledge:

  1. In-Production MCP - Monolithic and verbose JSON
  2. Semantic Chunker - Optimized MCP ingester (Infra level)
  3. Experiment A - Pre-chunked and monolithic JSON
  4. Experiment B - Original documentation MDX for humans
  5. Experiment C - Only MD
  6. Experiment D - Hybrid MD + JSON
  7. Experiment E - Domain-separated JSON
  8. Experiment F - TOON (Token-Oriented Object Notation)
Dec-29-2025 20-04-21

The Framework

benchmark-framework

Why this matters: benchmarks are meaningless without a quality definition. I defined three dimensions because "Quality" isn't one thing. It's three things in balance:

  1. MCP Input Quality — Is the MCP optimized for LLM reasoning?
  2. LLM Output Quality Retrieval — Can the LLM find correct information?
  3. LLM Output Quality Prototype — Can the LLM produce Indeed code?
ignite_-01 1

Benchmark Evidence

Structured. Chunked. Smart. Coherent.

benchmark-dashboard
✓ Structured - JSON > Markdown
 
JSON outperforms Markdown for LLM retrieval. Machine-parseable beats human-readable.
  • Finding: Top performers (Semantic 92%, Exp A 86%) both use structured JSON formats.
✓ Chunked - Chunked > Monolithic 
Breaking data into focused chunks improves retrieval. Monolithic dumps bury the signal in noise.
  • Finding: Both top performers use chunking, semantic, or pre-chunked, and deterministic JSON. Chunked > Monolithic.
✓ Smart - Semantic > Pre-chunk
Not just chunking, but HOW you chunk. Concept-aligned boundaries preserve meaning more effectively than arbitrary divisions.
  • Finding: Semantic chunking achieved 92% coverage vs. pre-chunked JSON at 86%, same format, different chunking strategy.
✓ Coherent - One format > Hybrid
Keep related information together in one file per component, AND use consistent formats. Silos fragment context. Mixed formats exhaust processing.
  • Finding: Domain-separated (Exp E) achieved only 20% retrieval coverage; information was there, but siloed. The hybrid format (Exp D) hit 27% completion; the LLM couldn't even finish.
ignite_-11

Next Steps

This research extends beyond a single benchmark. The findings inform:

1. Prototype Benchmark: Complete full quality framework evaluation before production rollout.

2. Mobile Design System Documentation for MCP: Implement semantic chunking for the Mobile DS MCP.

3. Design System AI Infrastructure:  Work with AI leadership to integrate MCP ingestion standards into long-term AI platform strategy.

Prompt → Prototype

Prototype benchmark is the missing piece. Over the next weeks, I will complete the full framework. Stay tuned!

design-system-mcp-sample
design-system-mcp-prototype-viewer

I design systems that both humans and AIs can understand, trust, and build on.

Selected Work

AI Metadata SystemsAI Design System
Factory Web PlatformProduct Design
Dispatch MobileDesign System
Email Design SystemDesign System
All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 | All copyright © reserved by Diana Wolosin 2026 |