← Back to Blog
AI & Technology

Why Your MCP Servers Are Overcharging Your LLM (and How to Fix It)

MCP servers can consume 15,000+ tokens just loading definitions before doing any work. Discover the two sources of Context Bloat—static definitions and dynamic results—and learn design principles for building token-efficient MCP servers that respect the context window.

MCPLLMAIToken OptimizationModel Context ProtocolSoftware Development

As an advocate for Model Context Protocol (MCP) servers, I've watched this ecosystem evolve with both excitement and frustration. Servers like Context7 have unlocked new potential for LLMs, turning static assistants into capable agents. Yet lately, the whispers have grown into a roar: "MCP is burning through my tokens."

Here's the scale of the problem: A typical MCP server with 20 documentation search tools can consume 15,000 tokens just loading definitions—before doing any actual work. Add a single verbose tool result (say, a full code file or API response), and you've blown another 10,000 tokens. For context, that's roughly £0.03 per request with Claude Sonnet 4.5, and it compounds fast.

My recent experience—even whilst updating my own tools using the Magic MCP suite—confirmed this. We've discovered a critical flaw in many early MCP implementations: Context Bloat. The problem isn't the protocol; it's poor implementation that has turned our powerful tools into token-hungry monsters.

This token consumption comes from two main sources: Static Context Bloat (the cost of definitions) and Dynamic Context Bloat (the cost of results). Here's the breakdown and the design principles driving the next generation of efficient MCP servers.

1. Eliminating Static Context Bloat: The Art of Condensation

The single biggest cost isn't the work an agent does—it's the context it must carry. This is the Static Context Bloat caused by eagerly loading every tool definition into the LLM's context window.

Schema Condensation: Every Word Is a Token

MCP schemas are instruction manuals for the LLM. In early designs, they were verbose and redundant. The solution is aggressive schema trimming:

Concise Language: Use minimal parameter names and descriptions. Replace lengthy prose like "A value between 1 and 200, with a default of 50" with structured JSON Schema properties: minimum: 1, maximum: 200, default: 50. The LLM infers intent from structure.

This principle was key in updating our Magic MCP tools, where we turned verbose oneOf descriptions into clean enum lists combined with a single, punchy description.

Data Consolidation: Fewer Tools, More Power

Stop flooding the agent with dozens of similar tool definitions. Tool Group Abstraction is the key:

Before (bloated):

  • search_context7_docs(query: str)
  • search_github_docs(query: str)
  • search_stackoverflow_docs(query: str)
  • search_mdn_docs(query: str)
  • ... (16 more similar tools)

After (consolidated):

  • search_documentation(query: str, source: str)
    # source: enum ["context7", "github", "stackoverflow", "mdn", ...]

This immediately cuts the static token footprint by 90% and simplifies the agent's decision-making.

GraphQL/Query API Pattern

Move away from the 1:1 mapping of API endpoint to MCP tool. Instead of:

  • getUser(user_id: str)
  • getUserPermissions(user_id: str)
  • getUserOrders(user_id: str, limit: int)
  • getUserProfile(user_id: str)

Consolidate into:

  • query(query_string: str)
    # Accepts GraphQL-style queries like:
    # "user(id: '123') { name, permissions, orders(limit: 5) { id, total } }"

One definition describes all data access. Massive static token savings.

2. Taming Dynamic Context Bloat: Controlling the Tsunami

Even if the static definitions are lean, Dynamic Context Bloat from massive tool results can instantly blow up the token count. This is the biggest cause of server overcharge when a tool returns a massive file or verbose API response.

Result Control at the Source

For tools returning large datasets, implement capping and slicing:

Pagination & Filtering: All list and search operations must implement explicit limits and filtering parameters (limit, offset, since_date). The agent retrieves a small, manageable slice rather than the full dataset, requesting more pages only if necessary.

GraphQL Field Selection

Let the agent request only the specific fields it needs. Instead of returning:

{
  "user": {
    "id": "123",
    "name": "John Smith",
    "email": "john@example.com",
    "address": { "street": "...", "city": "...", ... },
    "order_history": [ /* 50 orders */ ],
    "preferences": { /* nested config */ }
  }
}

The agent queries for:

user(id: "123") { name, orders(limit: 1) { id } }

And receives only:

{
  "user": {
    "name": "John Smith",
    "orders": [{ "id": "ORD-001" }]
  }
}

This prevents verbose, nested JSON from entering the context window.

Result Summarisation: The Server's Second Job

For unstructured data retrieval—reading a large code file or transcribing a meeting—the server has a crucial second job: summarisation.

A sophisticated MCP server won't return a 50,000-token document. It uses a server-side, cheaper LLM (or specialised utility) to return a structured, concise summary: 5 bullet points of code changes, key meeting decisions, or document structure. This prevents intermediate tool result bloat, allowing the main agent to see only relevant information without the raw data's token cost.

The Trade-Off: Complexity vs Efficiency

There's a legitimate concern here: consolidated tools mean the LLM must construct more complex queries. Does this actually work reliably?

In practice, yes—modern LLMs like Claude handle structured query construction remarkably well, especially when tool definitions include clear examples. The token savings (often 80-90% reduction in static bloat) far outweigh the occasional need for query refinement. More importantly, the cognitive load on the LLM actually decreases when it sees 5 well-designed tools instead of 50 redundant ones.

The Path Forward: Build Efficient or Get Left Behind

The token crisis is a growing pain, not a fatal flaw of the Model Context Protocol. Current concerns stem from early, un-optimised implementations. The future of MCP is efficiency-first design.

For MCP Server Developers: Audit your tools now. Count the tokens in your schema definitions. Test with real-world scenarios. If your server consumes more than 5,000 tokens at rest, you're overcharging your users.

For MCP Users: Demand better. Ask server maintainers about token efficiency. Vote with your usage—switch to servers that respect the context window.

As developers, we must treat the LLM's context window as a precious, expensive resource. By adopting the principles of consolidation, condensation, and control, we build the token-efficient MCP servers that truly unleash the power of the AI agent ecosystem—keeping both the conversation and the costs manageable.

The next wave of MCP innovation won't come from adding more tools—it'll come from teaching our servers to think lean. The question isn't how powerful your model is, but how respectful your context window can be.