Active Benchmarks by Domain

Browse verified, actively-maintained benchmarks by problem domain. These are the recommended datasets for evaluating your models.

Finding the right benchmark

Pick your domain

What type of problem? (vision, language, audio, etc.)

Find your task

What specific problem are you solving?

Choose dataset

Which benchmark fits your use case?

Compare results

How does your model stack up?

Browse by problem domain

Natural Language Processing

Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.

16 tasks27 datasets7436 results

Polish LLM GeneralPolish Cultural CompetencyPolish Text UnderstandingPolish Conversation Quality+12 more

Computer Vision

Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.

35 tasks286 datasets2328 results

Optical Character RecognitionScene Text DetectionDocument ParsingDocument Layout Analysis+31 more

Speech

Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.

4 tasks13 datasets532 results

Speech RecognitionSpeech TranslationSpeaker VerificationSpeech Enhancement

Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

5 tasks20 datasets415 results

Multi-step ReasoningMathematical ReasoningCommonsense ReasoningLogical Reasoning+1 more

Computer Code

Developing AI coding assistants? Test code generation, completion, translation, bug detection, and repair capabilities.

6 tasks15 datasets297 results

Code GenerationCode TranslationCode CompletionBug Detection+2 more

Multimodal

Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.

10 tasks26 datasets267 results

Visual Question AnsweringImage-Text-to-TextVideo UnderstandingText-to-Image Generation+6 more

Agentic AI

Benchmarks for autonomous agents, software engineering agents, web agents, desktop agents, and terminal-based task execution.

10 tasks21 datasets225 results

SWE-benchTask agentsWeb & Desktop AgentsAutonomous Coding+6 more

Medical

Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.

4 tasks15 datasets83 results

Disease ClassificationMedical Image SegmentationDrug DiscoveryClinical NLP

Time-series

2 tasks7 datasets75 results

Time-series forecastingTime-series classification

Mobile Development

Benchmarks evaluating AI code generation for mobile platforms — React Native, Flutter, Swift, Kotlin. Tests real-world patterns: navigation, animation, state management, platform APIs.

1 tasks1 datasets40 results

React Native Code Generation

Natural Language Processing

The field of AI concerned with the interaction between computers and human language, encompassing text understanding, generation, translation, sentiment analysis, and question answering.

3 tasks66 datasets32 results

Language ModelingText classificationMachine Translation

Industrial Inspection

Building quality control systems? Benchmark anomaly detection, defect classification, and automated visual inspection for manufacturing.

4 tasks10 datasets27 results

Anomaly DetectionSteel Defect DetectionSurface Defect DetectionWeld Inspection

Reinforcement Learning

Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.

3 tasks3 datasets21 results

Atari GamesContinuous ControlOffline RL

Audio

Research on processing, understanding, and generating audio signals, including speech recognition, music generation, sound classification, and audio synthesis.

5 tasks52 datasets19 results

Text-to-speechAudio ClassificationVoice cloningAutomatic Speech Recognition+1 more

Audio

Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.

6 tasks8 datasets13 results

Audio CaptioningSound Event DetectionMusic GenerationText-to-Audio+2 more

Graphs

Working with network data? Test graph learning models on node classification, link prediction, and molecular property tasks.

4 tasks5 datasets12 results

Node ClassificationLink PredictionMolecular Property PredictionGraph Classification

Knowledge Base

Building knowledge systems? Evaluate graph completion, relation extraction, and entity linking performance.

3 tasks3 datasets9 results

Entity LinkingKnowledge Graph CompletionRelation Extraction

General

A broad category encompassing machine learning research and tasks that don't fit specifically into vision or language domains, including general ML methods, optimization, and cross-domain approaches.

11 tasks87 datasets8 results

Coding AgentsVideo-Language ModelsReasoningReinforcement Learning+7 more

Time Series

Predicting future trends or detecting anomalies? Benchmark forecasting accuracy and pattern recognition in sequential data.

3 tasks2 datasets7 results

Tabular ClassificationTabular RegressionTabular Machine Learning

Robots

Building robotic systems? Find benchmarks for manipulation, navigation, and simulation-to-reality transfer.

3 tasks3 datasets5 results

Robot ManipulationRobot NavigationSim-to-Real Transfer

Methodology

Improving learning efficiency? Test self-supervised, few-shot, transfer, and continual learning approaches.

4 tasks4 datasets0 results

Self-Supervised LearningFew-Shot LearningContinual LearningTransfer Learning

Adversarial

Need to test model robustness? Benchmark resilience against adversarial attacks and evaluate defense mechanisms.

2 tasks2 datasets0 results

Adversarial RobustnessAdversarial Attacks

Other

2 tasks9 datasets0 results

RoboticsOther

Computer Vision

Research focused on enabling computers to interpret and understand visual information from images and videos, including tasks such as image classification, object detection, segmentation, and visual recognition.

3 tasks97 datasets0 results

3D generationFew-Shot Image ClassificationVideo generation

Research Areas

149

Tasks

782

Datasets

11851

Benchmark Results

Browse Ontology Vote on Benchmarks (188)Saturated & Legacy PWC Archive