MCPMark: A Benchmark That Finally Shows Real AI Agent Capabilities
What's Wrong with Existing Benchmarks?
All current MCP tests are toys. They check simple tasks like "read some data" or do a couple of interaction steps. In real life, agents need to handle complex multi-step operations, plan ahead, work with long context, and not fail halfway through. Existing benchmarks don't cover this.
What is MCPMark?
127 real-world tasks that require full-fledged work with five different systems:
- Notion
- GitHub
- File system
- PostgreSQL
- Playwright
Tasks include all CRUD operations (Create, Read, Update, Delete) and are verified automatically through programmatic scripts. No subjective evaluation — either it works or it doesn't.
Each task starts with a prepared state (for example, a database with data or a GitHub repository with commit history), making it as close to real-world conditions as possible.
How is Quality Measured?
They use a minimalist framework called MCPMark-Agent that runs models in a standard tool-calling loop. This ensures fair comparison.
Three key metrics:
- pass@1 — percentage of tasks solved on the first try
- pass^4 — percentage of tasks the model solves consistently across 4 runs (this is hardcore)
- Average number of steps and tool calls — shows efficiency
Results (spoiler: it's bad)
Even top models are struggling:
- GPT-5-medium: 52.56% pass@1, 33.86% pass^4
- Claude Sonnet 4: <30% pass@1, <15% pass^4
- O3: <30% pass@1, <15% pass^4
On average, one task takes 16.2 steps and 17.4 tool calls. This is genuinely difficult.
The huge gap between pass@1 and pass^4 reveals the main problem: agents are unpredictable and unstable.
Why This Matters for Us?
Atomic measurements instead of E2E: Just like in RAG systems where we separately measure retrieval, reranking, and answer generation, we can now measure the quality of each agent step with tools.
Realistic evaluation: Finally, a benchmark that shows real model capabilities, not demo scenarios.
Problem identification: Results clearly show — agents have massive issues with multi-step tasks and stability.
MCPMark is what the industry needs. Hopefully, we'll see more atomic benchmarks like this that help understand where exactly agents fail and how to fix it.
More MCP Servers: FastMCP