Benchmarks
Explore the popular benchmarks used to evaluate AI model capabilities across different domains.
⌘K
Mathematical Reasoning
5Programming & Software Engineering
5Language & Knowledge
5Multimodal Reasoning
5Document, Chart, and OCR Understanding
4Video Understanding & Reasoning
2Multimodal Evaluation for video analysis
Stats:900 videos, total 254 hrs, 2700 QA pairs
Examples:6 visual domains - Knowledge, Film & TV, Sports, Artistic Performance, Life Record, Multilingual. 30 fine-grained categories, e.g., astronomy, technology, documentary, news report, esports, magic show, and fashion.
Multimodal Agent Interaction + Agent Grounding
6Agentic Tool Use in domain-specific tasks
Stats:500 users - Retail: 50 products, 1000 orders, 115 tasks. Airline: 300 flights, 2000 reservations, 50 tasks
Examples:Retail: Cancel or modify pending orders, return or exchange delivered orders, modify user addresses, or provide information. Airline: Book, modify, or cancel flight reservations, or provide refunds.
Agentic Tool Use in domain-specific tasks
Stats:500 users - Retail: 50 products, 1000 orders, 115 tasks. Airline: 300 flights, 2000 reservations, 50 tasks. Telecom: 5 plans, 9 lines, 4 customers, 114 tasks.
Examples:Retail, Airline similar to Tau Bench, Telecom: Technical support, Overdue bill apayment, Line suspension, Plan options