Why Your MCP Servers Are Overcharging Your LLM (and How to Fix It)

As an advocate for Model Context Protocol (MCP) servers, I've watched this ecosystem evolve with both excitement and frustration. Servers like Context7 have unlocked new potential for LLMs, turning static assistants into capable agents. Yet lately, the whispers have grown into a roar: "MCP is burning through my tokens."

Here's the scale of the problem: A typical MCP server with 20 documentation search tools can consume 15,000 tokens just loading definitions before doing any actual work. Add a single verbose tool result (say, a full code file or API response), and you've blown another 10,000 tokens. For context, that's roughly £0.03 per request with Claude Sonnet 4.5, and it compounds fast.

My recent experience, even whilst updating my own tools using the Magic MCP suite, confirmed this. We've discovered a critical flaw in many early MCP implementations: Context Bloat. The problem isn't the protocol; it's poor implementation that has turned our powerful tools into token-hungry monsters.

MCP token consumption comes from two main sources: Static Context Bloat (the cost of definitions) and Dynamic Context Bloat (the cost of results). Here's the breakdown and the design principles driving the next generation of efficient MCP servers.

The fix isn't a new protocol, it's discipline: consolidate many tools into a few, trim the schemas, and cap what each result is allowed to return.

How do you cut the token cost of MCP tool definitions?

Eliminating Static Context Bloat: the art of condensation. The single biggest cost isn't the work an agent does; it's the context it must carry. This is the Static Context Bloat caused by eagerly loading every tool definition into the LLM's context window.

Schema Condensation: Every Word Is a Token

MCP schemas are instruction manuals for the LLM. In early designs, they were verbose and redundant. The solution is aggressive schema trimming:

Concise Language: Use minimal parameter names and descriptions. Replace lengthy prose like "A value between 1 and 200, with a default of 50" with structured JSON Schema properties: minimum: 1, maximum: 200, default: 50. The LLM infers intent from structure.

Schema condensation was key in updating our Magic MCP tools, where we turned verbose oneOf descriptions into clean enum lists combined with a single, punchy description.

Should you consolidate many MCP tools into one?

Data consolidation: fewer tools, more power. Stop flooding the agent with dozens of similar tool definitions. Tool Group Abstraction is the key:

Before (bloated):

search_context7_docs(query: str)
search_github_docs(query: str)
search_stackoverflow_docs(query: str)
search_mdn_docs(query: str)
... (16 more similar tools)

After (consolidated):

search_documentation(query: str, source: str)
# source: enum ["context7", "github", "stackoverflow", "mdn", ...]

Consolidating tools this way cuts the static token footprint by around 90% and simplifies the agent's decision-making.

GraphQL/Query API Pattern

Move away from the 1:1 mapping of API endpoint to MCP tool. Instead of:

getUser(user_id: str)
getUserPermissions(user_id: str)
getUserOrders(user_id: str, limit: int)
getUserProfile(user_id: str)

Consolidate into:

query(query_string: str)
# Accepts GraphQL-style queries like:
# "user(id: '123') { name, permissions, orders(limit: 5) { id, total } }"

One definition describes all data access. Massive static token savings.

How do you stop MCP tool results flooding the context window?

Taming Dynamic Context Bloat: controlling the tsunami. Even if the static definitions are lean, Dynamic Context Bloat from massive tool results can instantly blow up the token count. Dynamic Context Bloat is the biggest cause of server overcharge when a tool returns a massive file or verbose API response.

Result Control at the Source

For tools returning large datasets, implement capping and slicing:

Pagination & Filtering: All list and search operations must implement explicit limits and filtering parameters (limit, offset, since_date). The agent retrieves a small, manageable slice rather than the full dataset, requesting more pages only if necessary.

GraphQL Field Selection

Let the agent request only the specific fields it needs. Instead of returning:

{
  "user": {
    "id": "123",
    "name": "John Smith",
    "email": "john@example.com",
    "address": { "street": "...", "city": "...", ... },
    "order_history": [ /* 50 orders */ ],
    "preferences": { /* nested config */ }
  }
}

The agent queries for:

user(id: "123") { name, orders(limit: 1) { id } }

And receives only:

{
  "user": {
    "name": "John Smith",
    "orders": [{ "id": "ORD-001" }]
  }
}

GraphQL field selection prevents verbose, nested JSON from entering the context window.

Result Summarisation: The Server's Second Job

For unstructured data retrieval, reading a large code file or transcribing a meeting, the server has a crucial second job: summarisation.

A sophisticated MCP server won't return a 50,000-token document. It uses a server-side, cheaper LLM (or specialised utility) to return a structured, concise summary: 5 bullet points of code changes, key meeting decisions, or document structure. This prevents intermediate tool result bloat, allowing the main agent to see only relevant information without the raw data's token cost.

The context window is a budget, and tool definitions are only one line item. I wrote separately about how to build reliable LLM applications on unpredictable foundations.

Do consolidated MCP tools actually work reliably?

The trade-off: complexity vs efficiency. There's a legitimate concern here: consolidated tools mean the LLM must construct more complex queries. Does this actually work reliably?

In practice, yes: modern LLMs like Claude handle structured query construction remarkably well, especially when tool definitions include clear examples. The token savings (often 80-90% reduction in static bloat) far outweigh the occasional need for query refinement. More importantly, the cognitive load on the LLM actually decreases when it sees 5 well-designed tools instead of 50 redundant ones.

The Path Forward: Build Efficient or Get Left Behind

The token crisis is a growing pain, not a fatal flaw of the Model Context Protocol. Current concerns stem from early, un-optimised implementations. The future of MCP is efficiency-first design.

For MCP Server Developers: Audit your tools now. Count the tokens in your schema definitions. Test with real-world scenarios. If your server consumes more than 5,000 tokens at rest, you're overcharging your users.

For MCP Users: Demand better. Ask server maintainers about token efficiency. Vote with your usage; switch to servers that respect the context window.

As developers, we must treat the LLM's context window as a precious, expensive resource. By adopting the principles of consolidation, condensation, and control, we build the token-efficient MCP servers that truly unleash the power of the AI agent ecosystem, keeping both the conversation and the costs manageable.

The next wave of MCP innovation won't come from adding more tools; it'll come from teaching our servers to think lean. The question isn't how powerful your model is, but how respectful your context window can be.