Active Benchmarks by Domain

Browse verified, actively-maintained benchmarks by problem domain. These are the recommended datasets for evaluating your models.

Finding the right benchmark

1

Pick your domain

What type of problem? (vision, language, audio, etc.)

2

Find your task

What specific problem are you solving?

3

Choose dataset

Which benchmark fits your use case?

4

Compare results

How does your model stack up?

Browse by problem domain

Natural Language Processing

Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.

19 tasks23 datasets5982 results
Polish LLM GeneralPolish Cultural CompetencyPolish Text UnderstandingPolish Conversation Quality+15 more

Computer Vision

Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.

27 tasks199 datasets1778 results
Optical Character RecognitionScene Text DetectionScene Text RecognitionDocument Layout Analysis+23 more

Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

5 tasks19 datasets158 results
Mathematical ReasoningCommonsense ReasoningMulti-step ReasoningLogical Reasoning+1 more

Computer Code

Developing AI coding assistants? Test code generation, completion, translation, bug detection, and repair capabilities.

6 tasks14 datasets139 results
Code GenerationCode TranslationBug DetectionCode Completion+2 more

Time Series

Predicting future trends or detecting anomalies? Benchmark forecasting accuracy and pattern recognition in sequential data.

4 tasks9 datasets82 results
Time Series ForecastingTabular ClassificationTabular RegressionTime Series Classification

Medical

Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.

4 tasks15 datasets71 results
Disease ClassificationMedical Image SegmentationClinical NLPDrug Discovery

Agentic AI

Measuring autonomous AI capabilities? METR benchmarks track time horizon, multi-step reasoning, and sustained task performance - key metrics for AGI progress.

6 tasks7 datasets45 results
SWE-benchWeb & Desktop AgentsHCASTTime Horizon+2 more

Speech

Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.

5 tasks9 datasets40 results
Speech RecognitionText-to-SpeechSpeech TranslationSpeaker Verification+1 more

Mobile Development

Benchmarks evaluating AI code generation for mobile platforms — React Native, Flutter, Swift, Kotlin. Tests real-world patterns: navigation, animation, state management, platform APIs.

1 tasks1 datasets40 results
React Native Code Generation

Multimodal

Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.

10 tasks23 datasets37 results
Visual Question AnsweringImage CaptioningAny-to-AnyAudio-Text-to-Text+6 more

Industrial Inspection

Building quality control systems? Benchmark anomaly detection, defect classification, and automated visual inspection for manufacturing.

4 tasks10 datasets27 results
Anomaly DetectionSteel Defect DetectionSurface Defect DetectionWeld Inspection

Reinforcement Learning

Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.

3 tasks3 datasets18 results
Atari GamesContinuous ControlOffline RL

Graphs

Working with network data? Test graph learning models on node classification, link prediction, and molecular property tasks.

4 tasks5 datasets12 results
Node ClassificationLink PredictionMolecular Property PredictionGraph Classification

Knowledge Base

Building knowledge systems? Evaluate graph completion, relation extraction, and entity linking performance.

3 tasks3 datasets9 results
Relation ExtractionEntity LinkingKnowledge Graph Completion

Audio

Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.

7 tasks10 datasets9 results
Music GenerationAudio CaptioningSound Event DetectionAudio Classification+3 more

Methodology

Improving learning efficiency? Test self-supervised, few-shot, transfer, and continual learning approaches.

4 tasks4 datasets0 results
Continual LearningFew-Shot LearningSelf-Supervised LearningTransfer Learning

Adversarial

Need to test model robustness? Benchmark resilience against adversarial attacks and evaluate defense mechanisms.

2 tasks2 datasets0 results
Adversarial RobustnessAdversarial Attacks

Robots

Building robotic systems? Find benchmarks for manipulation, navigation, and simulation-to-reality transfer.

4 tasks5 datasets0 results
Robot ManipulationRobot NavigationRoboticsSim-to-Real Transfer

18

Research Areas

118

Tasks

361

Datasets

8447

Benchmark Results

Browse OntologyVote on Benchmarks (188)Saturated & LegacyPWC Archive