Agentic
Benchmarks
Explore definitive, multi-step trajectory benchmarks. Track how frontier models perform on complex, real-world workflows across specialized domains.
Highlighted benchmarks
Objective metrics across 5 specialized domains
SWE-bench
The industry standard for multi-step agentic coding. Models navigate codebases to resolve real GitHub issues, verified strictly by compiler and unit tests.
Legal Agent Benchmark (LAB)
Tests agents on open-ended assignments like M&A data room reviews. Agents must read, cross-reference, and draft documents over long horizons without hallucinating under a strict all-pass metric.
DABstep (Data Agent Bench)
Tests multi-step reasoning by asking agents to query structured databases and read unstructured financial manuals to solve payments, fraud, and financial analysis use-cases.
WebArena
Drops an agent into a sandboxed web browser to perform complex, multi-page STEM and IT workflows. Success is verified objectively by checking the final application state.
MedAgentBench v2
Evaluates multi-step, clinically-driven workflows inside a FHIR-compliant Electronic Health Record (EHR) sandbox, requiring tool use and deep diagnostic reasoning.