糟糕的MCP设计让您的代理消耗了5倍的代币。
我最近对两个功能相同的MCP进行了测试,结果发现其中一个性能非常差。因此,我想分享导致这种情况的糟糕MCP设计模式。
一切始于我为一个待办事项应用程序编写的MCP服务器(MCP-A)。后来,该应用程序正式发布了自己的MCP服务器(MCP-B)。这两个MCP具有相同的功能,并且都调用相同的后端API。
实验设置如下:
- 两个MCP服务器连接到同一个待办事项账户,并在每次测试后重置。
- 40个测试提示,以模拟这些MCP的典型用例。
- 测试使用相同的模型、系统提示和代理框架进行。
以下是结果:
| 指标 | MCP-A | MCP-B | 差距 |
| ------------------- | ----------- | ----------- | ----- |
| 工具描述长度 | 11,464 | 3,682 | — |
| 通过率 | 36/40 (90%) | 36/40 (90%) | 相同 |
| 总输入标记 | 637,244 | 3,174,329 | 4.98× |
| 总输出标记 | 17,301 | 23,238 | 1.34× |
| 总代理步骤 | 122 | 157 | 1.29× |
| 总时间 | 597秒 | 676秒 | 1.13× |
结果显示,MCP-B完成40个测试用例比MCP-A多用了35个ReAct循环,这意味着输出标记多了30%。我检查了日志,发现根本原因是查询工具设计不佳。
以`search tool`为例,它的工作是查找待办事项列表中的待办项。在MCP-B中,该工具返回如下内容:
{
"id": "6a1916b48f08cb3a4c857ed0",
"title": "买些杂货",
"url": "https://todo.example.com/tasks/6a1916b48f08cb3a4c857ed0"
}
但其他CRUD操作需要`project_id`,而`search_tool`并没有返回它。因此,代理必须调用另一个工具`get_task_by_id`。另一方面,MCP-A的query_tasks在一次调用中返回了执行下一个操作所需的所有信息:
任务1:
ID: 6a19143e8f084a8c8101612f
标题: 买些杂货
项目ID: 6a1914378f084a8c810160a9
开始日期: 2025-07-19 10:00:00
优先级: 中
状态: 活动
未过滤的API数据被转存到上下文窗口中。
如果MCP将未经处理的纯API结果返回给代理的上下文,代理的上下文窗口将快速累积。
以MCP-B的`create_task`工具为例。它的工作是创建一个待办事项。该工具返回如下内容:
{
"id": "6a180de78f086bdead0608be",
"projectId": "inbox125587327",
.....
"createdTime": "2026-05-28T09:41:59+0000",
"modifiedTime": "2026-05-28T09:41:59+0000",
"focusSummaries": null
}
这些600多个字符对代理的任务毫无意义,但仍然被转存到代理的上下文中。另一方面,MCP-A的create_tasks进行了过滤和格式化。这一小改动在输入标记的使用上产生了巨大的差异。
另一个问题是工具数量。工具越多,模型可选择的候选集就越大,这直接增加了决策的难度。在MCP-A中,47个工具被压缩到14个,使用更少的工具覆盖相同的功能。
因此,我对良好的MCP工具设计有以下几点总结:
- 在设计工具时,考虑代理接下来需要什么,而不仅仅是它当前所要求的内容。返回足够的上下文,以便代理能够在不进行另一次往返的情况下采取下一步行动。
- 工具过多会增加模型的决策负担。因此,最好在MCP中尽量减少工具的数量,确保它们的功能不重叠。
- 当你的MCP将数据返回给LLM时,尽量保持其对LLM友好,即易于阅读。你可以从API响应中过滤掉不必要的字段并格式化数据,而不是直接传递原始JSON。
以上所有测试都是通过MCP-Eval进行的。这是一个MCP服务器基准测试工具。如果你想检查你的MCP性能,可以随时查看这个工具。
https://github.com/Code-MonkeyZhang/mcp-eval
查看原文
I recently did some tests on two MCPs with identical functionalities. Turns out one of them has really bad performance. So I wanna share those bad MCP design patterns that cause this.<p>It all started when I wrote an MCP Server (MCP-A) for a to-do list app. Later, the app officially released its own MCP Server (MCP-B). Both MCPs have the same functionalities and hit the same backend API.<p>The experiment is set up as follows:<p>- Both MCP Servers connect to the same ToDo list account, and it will be reset after each test.
- 40 test prompts to simulate typical use cases for these MCPs.
- The test was conducted with the same model, system prompt, and Agent framework<p>Here are the results:<p>| Metric | MCP-A | MCP-B | Gap |
| ------------------- | ----------- | ----------- | ----- |
| Tool Desc Length | 11,464 | 3,682 | — |
| Pass Rate | 36/40 (90%) | 36/40 (90%) | Same |
| Total input tokens | 637,244 | 3,174,329 | 4.98× |
| Total output tokens | 17,301 | 23,238 | 1.34× |
| Total Agent steps | 122 | 157 | 1.29× |
| Total time | 597s | 676s | 1.13× |<p>---<p>The result shows that MCP-B took 35 more ReAct loops to complete 40 test cases compared to MCP-A, which means 30% more output tokens. I examined the log and found that the root cause is poor query tool design.<p>Take the `search tool` for example, its job is to find a todo item in the ToDo list. In MCP-B, this tool returns this:<p>{
"id": "6a1916b48f08cb3a4c857ed0",
"title": "buy some groceries",
"url": "https://todo.example.com/tasks/6a1916b48f08cb3a4c857ed0"
}<p>But other CRUD operations require `project_id`, and `search_tool` doesn't return it. So the Agent has to call another tool `get_task_by_id`. On the other hand, MCP-A's query_tasks returns all necessary info to perform the next action in a single call:<p>Task 1:
ID: 6a19143e8f084a8c8101612f
Title: buy some groceries
Project ID: 6a1914378f084a8c810160a9
Start Date: 2025-07-19 10:00:00
Priority: Medium
Status: Active
Unfiltered API Data was dumped into context window<p>If MCP returns pure API results to the Agent's context unprocessed, the Agent's context window will accumulate very fast.<p>Take MCP-B's `create_task` tool, for example. Its job is to create a to-do item. This is what this tool returns:<p>{
"id": "6a180de78f086bdead0608be",
"projectId": "inbox125587327",
.....
"createdTime": "2026-05-28T09:41:59+0000",
"modifiedTime": "2026-05-28T09:41:59+0000",
"focusSummaries": null
}<p>These 600+ characters mean nothing to the Agent's task, but are still dumped into the Agent's context. On the other hand, MCP-A's create_tasks does a layer of filtering and formatting. This little tweak makes a huge difference in input token usage.<p>Another issue is tool count. More tools mean a larger candidate set for the model to choose from, which directly increases decision difficulty. In MCP-A, 47 tools were compressed down to 14, covering the same functionality with fewer tools.<p>---<p>So here are my takeaways on good MCP tool design:
- When designing a tool, think about what the Agent will need next, not just what it's asking for right now. Return enough context in the result so the Agent can take the next action without making another round-trip.<p>- Too many tools will increase the model's decision burden. So it'd be better to minimize the number of tools within an MCP. Make sure they don't overlap functionalities.<p>- When your MCP returns data to the LLM, try to keep it LLM-friendly, which means readable. You can filter out unnecessary fields from the API response and format the data, rather than passing through raw JSON.<p>---<p>All the tests above were run by MCP-Eval. It's an MCP Server benchmarking tool. If you want to check your MCP's performance, feel free to check this out.<p>https://github.com/Code-MonkeyZhang/mcp-eval