Benchmarks

Explore the popular benchmarks used to evaluate AI model capabilities across different domains.

Mathematical Reasoning

5

AIME (2024)

Measures mathematical problem-solving ability at the American Invitational Mathematics Examination level

Stats:15 Qs in AIME I, 15 Qs in AIME II
Examples:Algebra, geometry, number theory, combinatorics

AIME (2025)

Measures mathematical problem-solving ability at the American Invitational Mathematics Examination level

Stats:15 Qs in AIME I, 15 Qs in AIME II
Examples:Algebra, geometry, number theory, combinatorics

HMMT (Feb 2025)

High school math competition - Harvard-MIT Mathematics Tournament

Stats:30 Qs
Examples:Algebra & Number Theory, Combinatorics, Geometry

MATH

General mathematical reasoning capabilities across various domains

Stats:Complexity levels: Basic to Advanced
Examples:Includes arithmetic, algebra, calculus, and mathematical proofs

MATH-500

A subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper.'

Stats:500 problems
Examples:Includes arithmetic, algebra, calculus, and mathematical proofs

Programming & Software Engineering

5

Codeforces

Algorithmic and competitive programming skills assessment

Stats:Problems range from 800 to 3500 rating
Examples:Data structures, algorithms, optimization problems

HumanEval

Evaluates code generation capabilities on practical programming tasks

Stats:164 hand-written programming problems

SWE Bench Verified

Real-world software engineering tasks from GitHub issues on Python repositories

Stats:500 samples from GitHub issues

LiveCodeBench v5

Holistic and Contamination-Free Evaluation

Stats:v5 has 166 problems for Code Generation

Aider Polyglot Diff

Code editing and debugging tasks across C++, Go, Java, JavaScript, Python, and Rust.

Stats:225 challenging Exercism coding problems

Language & Knowledge

5

GPQA Diamond

Challenging MCQs in Physics, Chemistry, Biology. Qs written by domain experts

Stats:198 Qs

Humanity's Last Exam

Expert-level academic knowledge assessment

Stats:2,500 tough, subject-diverse, multi-modal questions

MMLU

Elementary questions across various domains

Stats:Covers 57 subjects

MMLU Pro

More challenging version of MMLU, reasoning focussed

Stats:12,000 questions across 14 domains

Simple QA

Challenging factuality benchmark

Stats:Short, fact-seeking 4,326 questions

Multimodal Reasoning

5

MMMU

Expert level Multimodal understanding capability assessment

Stats:11.5k multimodal questions across 30 subjects
Examples:Qs on Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.

MMMU Pro

More challenging version of MMMU, reasoning focussed

Stats:3,460 Qs

MMBench v1.1

Evaluates visual reasoning and perception

Stats:3,217 MCQs covering 20 ability dimensions

MathVista

Visual math reasoning benchmark

Stats:6,141 examples

BLINK

Multi-image perception benchmark

Stats:3.8K questions across 7.3K images, 14 visual perception tasks
Examples:Visual correspondence, relative depth, art style, counting, jigsaw, visual similarity, IQ test, etc.

Document, Chart, and OCR Understanding

4

DocVQA

Document understanding benchmark for visual question answering

Stats:50,000 Qs, 12,767 document images
Examples:Qs on textual, non-textual, style, layout-based information

ChartQA

Evaluates understanding of charts and graphs

Stats:20,882 charts, 9.6K human-written Qs, 23.1K machine generated Qs
Examples:Questions on data interpretation, trends, and comparisons

OCRBench

Evaluates OCR performance

Stats:29 datasets, 1000 QA pairs
Examples:Text recognition, Scene Text-Centric VQA, Document-Oriented VQA, KIE(Key Information Extraction), and HMER(Handwritten Mathematical Expression Recognition)

OmniDocBench

Evaluates document parsing and content extraction

Stats:1355 pages across 9 distinct document types
Examples:Academic Papers, Slides, Books, Textbooks, Exam Papers, Notes, Magazines, Financial Reports, Newspapers

Video Understanding & Reasoning

2

LongVideoBench

Long context video understanding with referred reasoning questions (dependent on long frame inputs)

Stats:3,763 videos, Avg length: 473 sec, 17 categories of referred reasoning Qs
Examples:Video themes - Life, Movie, Knowledge, News

Video MME

Multimodal Evaluation for video analysis

Stats:900 videos, total 254 hrs, 2700 QA pairs
Examples:6 visual domains - Knowledge, Film & TV, Sports, Artistic Performance, Life Record, Multilingual. 30 fine-grained categories, e.g., astronomy, technology, documentary, news report, esports, magic show, and fashion.

Multimodal Agent Interaction + Agent Grounding

6

OSWorld

Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Stats:369 computer tasks
Examples:Tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications

ScreenSpot Pro

Evaluate GUI Grounding in high-resolution professional settings for Computer Use

Stats:23 applications across five industries and three operating systems
Examples:Dev & Programming, Creative SW, CAD, Scientific & Analytical, Office SW, and OS

Tau Bench (airline, retail)

Agentic Tool Use in domain-specific tasks

Stats:500 users - Retail: 50 products, 1000 orders, 115 tasks. Airline: 300 flights, 2000 reservations, 50 tasks
Examples:Retail: Cancel or modify pending orders, return or exchange delivered orders, modify user addresses, or provide information. Airline: Book, modify, or cancel flight reservations, or provide refunds.

Tau2 Bench (airline, retail, telecom)

Agentic Tool Use in domain-specific tasks

Stats:500 users - Retail: 50 products, 1000 orders, 115 tasks. Airline: 300 flights, 2000 reservations, 50 tasks. Telecom: 5 plans, 9 lines, 4 customers, 114 tasks.
Examples:Retail, Airline similar to Tau Bench, Telecom: Technical support, Overdue bill apayment, Line suspension, Plan options

BrowseComp

Browsing agent for hard to find information

Stats:1,266 Problems
Examples:TV shows, movies, Scient & Technology, Art, HIstory, Sports, Music, Video games, Geography, Politics

Terminal-Bench 1.0

Real terminal environment end-to-end tasks. Agent Scaffold needed with model

Stats:80 tasks
Examples:Software Eng, System administration, Security, Debugging, File operations, Data science, Model Training, Games, Scientific computing

Other

1

ARC-AGI-2

Tasks easy for humans but hard for AI

Stats:Semi-Private Eval dataset has 120 problems
Examples:Symobolic Interpretation, Compositional Reasoning, Contextual Rule Application