糟糕的MCP设计让您的代理消耗了5倍的代币。

2作者: JohnnyZhang48325 天前原帖
我最近对两个功能相同的MCP进行了测试,结果发现其中一个性能非常差。因此,我想分享导致这种情况的糟糕MCP设计模式。 一切始于我为一个待办事项应用程序编写的MCP服务器(MCP-A)。后来,该应用程序正式发布了自己的MCP服务器(MCP-B)。这两个MCP具有相同的功能,并且都调用相同的后端API。 实验设置如下: - 两个MCP服务器连接到同一个待办事项账户,并在每次测试后重置。 - 40个测试提示,以模拟这些MCP的典型用例。 - 测试使用相同的模型、系统提示和代理框架进行。 以下是结果: | 指标 | MCP-A | MCP-B | 差距 | | ------------------- | ----------- | ----------- | ----- | | 工具描述长度 | 11,464 | 3,682 | — | | 通过率 | 36/40 (90%) | 36/40 (90%) | 相同 | | 总输入标记 | 637,244 | 3,174,329 | 4.98× | | 总输出标记 | 17,301 | 23,238 | 1.34× | | 总代理步骤 | 122 | 157 | 1.29× | | 总时间 | 597秒 | 676秒 | 1.13× | 结果显示,MCP-B完成40个测试用例比MCP-A多用了35个ReAct循环,这意味着输出标记多了30%。我检查了日志,发现根本原因是查询工具设计不佳。 以`search tool`为例,它的工作是查找待办事项列表中的待办项。在MCP-B中,该工具返回如下内容: { "id": "6a1916b48f08cb3a4c857ed0", "title": "买些杂货", "url": "https://todo.example.com/tasks/6a1916b48f08cb3a4c857ed0" } 但其他CRUD操作需要`project_id`,而`search_tool`并没有返回它。因此,代理必须调用另一个工具`get_task_by_id`。另一方面,MCP-A的query_tasks在一次调用中返回了执行下一个操作所需的所有信息: 任务1: ID: 6a19143e8f084a8c8101612f 标题: 买些杂货 项目ID: 6a1914378f084a8c810160a9 开始日期: 2025-07-19 10:00:00 优先级: 中 状态: 活动 未过滤的API数据被转存到上下文窗口中。 如果MCP将未经处理的纯API结果返回给代理的上下文,代理的上下文窗口将快速累积。 以MCP-B的`create_task`工具为例。它的工作是创建一个待办事项。该工具返回如下内容: { "id": "6a180de78f086bdead0608be", "projectId": "inbox125587327", ..... "createdTime": "2026-05-28T09:41:59+0000", "modifiedTime": "2026-05-28T09:41:59+0000", "focusSummaries": null } 这些600多个字符对代理的任务毫无意义,但仍然被转存到代理的上下文中。另一方面,MCP-A的create_tasks进行了过滤和格式化。这一小改动在输入标记的使用上产生了巨大的差异。 另一个问题是工具数量。工具越多,模型可选择的候选集就越大,这直接增加了决策的难度。在MCP-A中,47个工具被压缩到14个,使用更少的工具覆盖相同的功能。 因此,我对良好的MCP工具设计有以下几点总结: - 在设计工具时,考虑代理接下来需要什么,而不仅仅是它当前所要求的内容。返回足够的上下文,以便代理能够在不进行另一次往返的情况下采取下一步行动。 - 工具过多会增加模型的决策负担。因此,最好在MCP中尽量减少工具的数量,确保它们的功能不重叠。 - 当你的MCP将数据返回给LLM时,尽量保持其对LLM友好,即易于阅读。你可以从API响应中过滤掉不必要的字段并格式化数据,而不是直接传递原始JSON。 以上所有测试都是通过MCP-Eval进行的。这是一个MCP服务器基准测试工具。如果你想检查你的MCP性能,可以随时查看这个工具。 https://github.com/Code-MonkeyZhang/mcp-eval
查看原文
I recently did some tests on two MCPs with identical functionalities. Turns out one of them has really bad performance. So I wanna share those bad MCP design patterns that cause this.<p>It all started when I wrote an MCP Server (MCP-A) for a to-do list app. Later, the app officially released its own MCP Server (MCP-B). Both MCPs have the same functionalities and hit the same backend API.<p>The experiment is set up as follows:<p>- Both MCP Servers connect to the same ToDo list account, and it will be reset after each test. - 40 test prompts to simulate typical use cases for these MCPs. - The test was conducted with the same model, system prompt, and Agent framework<p>Here are the results:<p>| Metric | MCP-A | MCP-B | Gap | | ------------------- | ----------- | ----------- | ----- | | Tool Desc Length | 11,464 | 3,682 | — | | Pass Rate | 36&#x2F;40 (90%) | 36&#x2F;40 (90%) | Same | | Total input tokens | 637,244 | 3,174,329 | 4.98× | | Total output tokens | 17,301 | 23,238 | 1.34× | | Total Agent steps | 122 | 157 | 1.29× | | Total time | 597s | 676s | 1.13× |<p>---<p>The result shows that MCP-B took 35 more ReAct loops to complete 40 test cases compared to MCP-A, which means 30% more output tokens. I examined the log and found that the root cause is poor query tool design.<p>Take the `search tool` for example, its job is to find a todo item in the ToDo list. In MCP-B, this tool returns this:<p>{ &quot;id&quot;: &quot;6a1916b48f08cb3a4c857ed0&quot;, &quot;title&quot;: &quot;buy some groceries&quot;, &quot;url&quot;: &quot;https:&#x2F;&#x2F;todo.example.com&#x2F;tasks&#x2F;6a1916b48f08cb3a4c857ed0&quot; }<p>But other CRUD operations require `project_id`, and `search_tool` doesn&#x27;t return it. So the Agent has to call another tool `get_task_by_id`. On the other hand, MCP-A&#x27;s query_tasks returns all necessary info to perform the next action in a single call:<p>Task 1: ID: 6a19143e8f084a8c8101612f Title: buy some groceries Project ID: 6a1914378f084a8c810160a9 Start Date: 2025-07-19 10:00:00 Priority: Medium Status: Active Unfiltered API Data was dumped into context window<p>If MCP returns pure API results to the Agent&#x27;s context unprocessed, the Agent&#x27;s context window will accumulate very fast.<p>Take MCP-B&#x27;s `create_task` tool, for example. Its job is to create a to-do item. This is what this tool returns:<p>{ &quot;id&quot;: &quot;6a180de78f086bdead0608be&quot;, &quot;projectId&quot;: &quot;inbox125587327&quot;, ..... &quot;createdTime&quot;: &quot;2026-05-28T09:41:59+0000&quot;, &quot;modifiedTime&quot;: &quot;2026-05-28T09:41:59+0000&quot;, &quot;focusSummaries&quot;: null }<p>These 600+ characters mean nothing to the Agent&#x27;s task, but are still dumped into the Agent&#x27;s context. On the other hand, MCP-A&#x27;s create_tasks does a layer of filtering and formatting. This little tweak makes a huge difference in input token usage.<p>Another issue is tool count. More tools mean a larger candidate set for the model to choose from, which directly increases decision difficulty. In MCP-A, 47 tools were compressed down to 14, covering the same functionality with fewer tools.<p>---<p>So here are my takeaways on good MCP tool design: - When designing a tool, think about what the Agent will need next, not just what it&#x27;s asking for right now. Return enough context in the result so the Agent can take the next action without making another round-trip.<p>- Too many tools will increase the model&#x27;s decision burden. So it&#x27;d be better to minimize the number of tools within an MCP. Make sure they don&#x27;t overlap functionalities.<p>- When your MCP returns data to the LLM, try to keep it LLM-friendly, which means readable. You can filter out unnecessary fields from the API response and format the data, rather than passing through raw JSON.<p>---<p>All the tests above were run by MCP-Eval. It&#x27;s an MCP Server benchmarking tool. If you want to check your MCP&#x27;s performance, feel free to check this out.<p>https:&#x2F;&#x2F;github.com&#x2F;Code-MonkeyZhang&#x2F;mcp-eval