
The Agentic Intelligence Report
Benchmark Board
Expanded interactive benchmark board with more models, more metrics, lab filters, tier filters, and sortable columns.
Evaluation Surface
Scores, lab movement, and evaluation pressure translated into one clean observatory instead of a stack of disconnected leaderboards.
Benchmark Digest
Subscribe for daily movement across models, labs, and practical operator signals instead of checking leaderboards manually.
Use the board to see model shape, lab momentum, and category leadership quickly. It is good for narrowing the field before you spend time on your own workflow tests.
A model can win a benchmark and still be the wrong production choice if it is too slow, too expensive, too brittle with tools, or too hard to supervise. Leaderboards are directional, not dispositive.
A benchmark result matters more when it lines up with repeatable workflow gains, strong tool use, and practical reliability. One isolated score spike matters less than a broad pattern across hard tasks.
Take the top two candidates from the board, run a small task set that mirrors your real use case, and keep whichever model delivers the best blend of quality, speed, and review burden.
Last benchmark refresh: 2026-04-16 09:14 UTC
| Model | |||||||||
|---|---|---|---|---|---|---|---|---|---|
GPT-5 | 98 | 99 | 98 | 98 | 99 | 97 | 96 | 96 | 82 |
Claude Opus 4.1 | 97 | 98 | 95 | 97 | 95 | 94 | 95 | 95 | 80 |
Gemini 2.5 Pro | 96 | 97 | 94 | 95 | 94 | 95 | 95 | 98 | 84 |
o3 | 96 | 99 | 95 | 93 | 92 | 98 | 95 | 90 | 76 |
GPT-4.1 | 95 | 96 | 96 | 94 | 96 | 93 | 92 | 94 | 84 |
Grok 4 | 94 | 95 | 91 | 93 | 91 | 91 | 90 | 92 | 82 |
Claude Sonnet 4 | 93 | 94 | 93 | 93 | 92 | 90 | 90 | 94 | 86 |
Llama 4 Maverick | 91 | 91 | 88 | 88 | 87 | 89 | 88 | 93 | 94 |
o4-mini | 91 | 94 | 90 | 89 | 90 | 92 | 89 | 88 | 93 |
Claude Sonnet 3.7 | 90 | 92 | 91 | 90 | 90 | 88 | 88 | 92 | 87 |
DeepSeek-R1 | 90 | 96 | 87 | 83 | 82 | 94 | 89 | 86 | 91 |
Gemini 2.5 Flash | 90 | 89 | 87 | 87 | 88 | 86 | 86 | 93 | 94 |
Qwen3 72B | 90 | 90 | 87 | 86 | 86 | 88 | 87 | 92 | 95 |
DeepSeek-V3 | 89 | 90 | 88 | 84 | 83 | 90 | 86 | 89 | 96 |
Mistral Large 2 | 89 | 89 | 86 | 85 | 84 | 86 | 85 | 90 | 92 |
Gemini 2.0 Flash | 88 | 86 | 84 | 84 | 85 | 84 | 84 | 91 | 96 |
Qwen2.5 72B | 88 | 88 | 85 | 83 | 84 | 87 | 85 | 90 | 95 |
Nova Pro | 87 | 86 | 84 | 83 | 85 | 84 | 84 | 89 | 91 |
Command R+ | 86 | 85 | 81 | 83 | 86 | 82 | 81 | 93 | 90 |
Llama 3.3 70B | 86 | 85 | 82 | 81 | 80 | 84 | 83 | 88 | 95 |
Command A | 85 | 84 | 80 | 82 | 85 | 80 | 80 | 91 | 92 |
Mixtral 8x22B | 84 | 83 | 80 | 79 | 79 | 81 | 80 | 88 | 94 |
Phi-4 | 84 | 84 | 80 | 79 | 80 | 84 | 79 | 82 | 97 |
Yi-Lightning | 83 | 82 | 79 | 77 | 78 | 81 | 79 | 86 | 96 |
GPT-5 currently leads overall (98/100). Use this as a signal for broad capability, not a guarantee for your exact workflow.
Reasoning lead: GPT-5 (99). Coding lead: GPT-5 (98). Agentic lead: GPT-5 (98).
Llama 4 Maverick shows the strongest cost-to-capability balance right now (94/100 cost efficiency, 91/100 overall).
OpenAI has the highest average overall score (95.0) across 4 tracked models.
Gemini 2.5 Pro (98), GPT-5 (96), Claude Opus 4.1 (95) are strongest for long context. Choose these when your tasks depend on long documents or multi-step memory.
For non-technical selection: pick one top model for quality, one value model for cost, run your own 3-task test set, and keep whichever is more reliable on your real prompts.
Average scores by lab to make ecosystem movement easier to read at a glance.
| Lab | Models | Avg Overall | Avg Reasoning | Avg Coding | Avg Agentic | Avg Cost Efficiency |
|---|---|---|---|---|---|---|
| OpenAI | 4 | 95.0 | 97.0 | 94.8 | 93.5 | 83.8 |
| xAI | 1 | 94.0 | 95.0 | 91.0 | 93.0 | 82.0 |
| Anthropic | 3 | 93.3 | 94.7 | 93.0 | 93.3 | 84.3 |
| 3 | 91.3 | 90.7 | 88.3 | 88.7 | 91.3 | |
| DeepSeek | 2 | 89.5 | 93.0 | 87.5 | 83.5 | 93.5 |
| Alibaba | 2 | 89.0 | 89.0 | 86.0 | 84.5 | 95.0 |
| Meta | 2 | 88.5 | 88.0 | 85.0 | 84.5 | 94.5 |
| Amazon | 1 | 87.0 | 86.0 | 84.0 | 83.0 | 91.0 |
| Mistral | 2 | 86.5 | 86.0 | 83.0 | 82.0 | 93.0 |
| Cohere | 2 | 85.5 | 84.5 | 80.5 | 82.5 | 91.0 |
| Microsoft | 1 | 84.0 | 84.0 | 80.0 | 79.0 | 97.0 |
| 01.AI | 1 | 83.0 | 82.0 | 79.0 | 77.0 | 96.0 |
Expanded benchmark coverage and references (reasoning, coding, agentic use, multimodal, and scientific rigor).
| Benchmark | Category | Why It Matters | Reference |
|---|---|---|---|
| Aider Polyglot Aider-style multi-language edit and repair workloads. | Agentic Coding | Useful for judging code-editing agents in real repo workflows. | Source |
| AIME Competition-style math problems requiring multi-step symbolic reasoning. | Math | Tests high-end quantitative reasoning beyond template completion. | Source |
| BigCodeBench Large coding benchmark focusing on executable correctness. | Coding | Highlights robust program synthesis beyond short snippets. | Source |
| BrowseComp Compositional web-browsing question answering benchmark. | Agentic Retrieval | Tests retrieval strategy quality and source-grounded answers. | Source |
| GPQA Diamond Graduate-level physics/biology/chemistry questions curated for high rigor. | Science | Shows scientific precision and resistance to shallow pattern-matching. | Source |
| Humanity Last Exam Cross-domain expert-level questions designed to stress deep reasoning. | Reasoning | Signals frontier-level reasoning under hard, unfamiliar prompts. | Source |
| LiveCodeBench Time-split coding benchmark reducing contamination leakage. | Coding | Tracks real coding generalization against recent unseen tasks. | Source |
| MATH-500 500 high-difficulty mathematics problems for chain-of-thought models. | Math | Tracks consistency on advanced symbolic and proof-style reasoning. | Source |
| MMLU-Pro Hardened MMLU variant with stronger distractors and broader topic spread. | General Intelligence | Good broad proxy for multi-domain factual and conceptual strength. | Source |
| MMMU Massive multi-discipline multimodal benchmark. | Multimodal | Measures image+text reasoning quality at scale. | Source |
| SWE-bench Verified Real GitHub issue resolution across production repositories. | Coding | Measures practical software engineering usefulness under constraints. | Source |
| WebArena Long-horizon web tasks in realistic browser environments. | Agentic Browser Use | Evaluates planning and tool use in messy UI-heavy contexts. | Source |
Guide Library
The board tells you who is leading. These guides tell you when those leads matter, when they do not, and how to translate scores into real decisions.
Guide
A practical guide to reading AI benchmarks without confusing leaderboard performance for real-world workflow value.
Guide
A practical framework for judging when a benchmark win signals real progress and when it is mostly narrative theater.
Research
A recurring research surface for the shifts that matter in agent workflows: orchestration, evaluation, coding agents, tool use, and where real operator behavior is moving.