Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Natural language inference (NLI) determines whether a hypothesis is entailed by, contradicts, or is neutral to a premise. It's both a standalone task and a critical building block for zero-shot classification, fact verification, and textual reasoning. DeBERTa-v3 holds the top spot on MNLI, while LLMs handle NLI implicitly in their broader reasoning.
History
SNLI (Bowman et al.) provides 570K human-labeled premise-hypothesis pairs — the first large-scale NLI dataset
MultiNLI (Williams et al.) extends NLI to 10 genres with 433K pairs; becomes a core GLUE task
BERT achieves 86.7% on MNLI, establishing transformers as the NLI paradigm
RoBERTa pushes MNLI to 90.2% with improved pretraining; adversarial NLI (ANLI) exposes remaining weaknesses
DeBERTa introduces disentangled attention, surpassing human performance on MNLI at 91.1%
NLI is repurposed for zero-shot classification (Yin et al.) — entailment probability as label confidence
ANLI remains unsolved at ~60% for GPT-3 scale models, showing adversarial robustness is still lacking
GPT-4o achieves ~92% on MNLI zero-shot; DeBERTa-v3-large remains the fine-tuned SOTA at 91.9%
How Natural Language Inference Works
Input formatting
Premise and hypothesis are concatenated with a [SEP] token: '[CLS] premise [SEP] hypothesis [SEP]'
Joint encoding
The transformer processes both texts jointly, allowing cross-attention between premise and hypothesis tokens
Classification
The [CLS] representation is fed to a 3-way classifier: entailment, contradiction, or neutral
Probability calibration
Softmax outputs are calibrated to produce reliable confidence scores for downstream use (e.g., zero-shot classification)
Current Landscape
NLI in 2025 is a mature benchmark task where MNLI is effectively solved (>91% accuracy, above human baseline). The real impact of NLI research is downstream: NLI-trained models power zero-shot classification (BART-MNLI), fact verification, and textual entailment checks in RAG pipelines. ANLI remains the hard benchmark, exposing that models still lack robust logical reasoning. The field has shifted focus from standalone NLI accuracy to using NLI as a reasoning primitive within larger systems.
Key Challenges
Annotation artifacts: models exploit spurious correlations (e.g., 'not' signals contradiction) without genuine reasoning
Adversarial robustness: ANLI shows that human-written adversarial examples defeat most models
Fine-grained entailment: soft entailment ('mostly true') and graded similarity aren't captured by 3-class labels
Domain transfer: NLI models trained on general text degrade on scientific, legal, and medical premise-hypothesis pairs
Compositionality: multi-sentence premises with complex logical structure remain challenging
Quick Recommendations
Best fine-tuned NLI
DeBERTa-v3-large fine-tuned on MNLI + SNLI
91.9% on MNLI matched; best encoder model for NLI and zero-shot classification
Zero-shot NLI
GPT-4o or Claude 3.5 Sonnet
~92% on MNLI without fine-tuning; handles complex multi-sentence reasoning
Fact verification
DeBERTa + FEVER-trained classifier
NLI models fine-tuned on fact verification data detect unsupported claims
Lightweight NLI
MiniLM-L12 fine-tuned on MNLI
33M params with 87%+ accuracy; fast enough for real-time applications
What's Next
Expect NLI to be absorbed into general reasoning evaluation rather than tracked as a standalone task. The technique of using entailment as a building block for zero-shot classification, fact-checking, and claim verification will persist and deepen. Adversarial NLI (and harder versions like those in BIG-Bench) will continue to test whether models genuinely reason or merely pattern-match.
Benchmarks & SOTA
Related Tasks
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Polish Text Understanding
Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.
Polish Conversation Quality
Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Natural Language Inference benchmarks accurate. Report outdated results, missing benchmarks, or errors.