Natural Language Processing

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

1 datasets8 resultsView full task mapping →

Natural language inference (NLI) determines whether a hypothesis is entailed by, contradicts, or is neutral to a premise. It's both a standalone task and a critical building block for zero-shot classification, fact verification, and textual reasoning. DeBERTa-v3 holds the top spot on MNLI, while LLMs handle NLI implicitly in their broader reasoning.

History

2015

SNLI (Bowman et al.) provides 570K human-labeled premise-hypothesis pairs — the first large-scale NLI dataset

2017

MultiNLI (Williams et al.) extends NLI to 10 genres with 433K pairs; becomes a core GLUE task

2018

BERT achieves 86.7% on MNLI, establishing transformers as the NLI paradigm

2019

RoBERTa pushes MNLI to 90.2% with improved pretraining; adversarial NLI (ANLI) exposes remaining weaknesses

2020

DeBERTa introduces disentangled attention, surpassing human performance on MNLI at 91.1%

2021

NLI is repurposed for zero-shot classification (Yin et al.) — entailment probability as label confidence

2022

ANLI remains unsolved at ~60% for GPT-3 scale models, showing adversarial robustness is still lacking

2024

GPT-4o achieves ~92% on MNLI zero-shot; DeBERTa-v3-large remains the fine-tuned SOTA at 91.9%

How Natural Language Inference Works

1Input formattingPremise and hypothesis are …2Joint encodingThe transformer processes b…3ClassificationThe [CLS] representation is…4Probability calibrati…Softmax outputs are calibra…Natural Language Inference Pipeline
1

Input formatting

Premise and hypothesis are concatenated with a [SEP] token: '[CLS] premise [SEP] hypothesis [SEP]'

2

Joint encoding

The transformer processes both texts jointly, allowing cross-attention between premise and hypothesis tokens

3

Classification

The [CLS] representation is fed to a 3-way classifier: entailment, contradiction, or neutral

4

Probability calibration

Softmax outputs are calibrated to produce reliable confidence scores for downstream use (e.g., zero-shot classification)

Current Landscape

NLI in 2025 is a mature benchmark task where MNLI is effectively solved (>91% accuracy, above human baseline). The real impact of NLI research is downstream: NLI-trained models power zero-shot classification (BART-MNLI), fact verification, and textual entailment checks in RAG pipelines. ANLI remains the hard benchmark, exposing that models still lack robust logical reasoning. The field has shifted focus from standalone NLI accuracy to using NLI as a reasoning primitive within larger systems.

Key Challenges

Annotation artifacts: models exploit spurious correlations (e.g., 'not' signals contradiction) without genuine reasoning

Adversarial robustness: ANLI shows that human-written adversarial examples defeat most models

Fine-grained entailment: soft entailment ('mostly true') and graded similarity aren't captured by 3-class labels

Domain transfer: NLI models trained on general text degrade on scientific, legal, and medical premise-hypothesis pairs

Compositionality: multi-sentence premises with complex logical structure remain challenging

Quick Recommendations

Best fine-tuned NLI

DeBERTa-v3-large fine-tuned on MNLI + SNLI

91.9% on MNLI matched; best encoder model for NLI and zero-shot classification

Zero-shot NLI

GPT-4o or Claude 3.5 Sonnet

~92% on MNLI without fine-tuning; handles complex multi-sentence reasoning

Fact verification

DeBERTa + FEVER-trained classifier

NLI models fine-tuned on fact verification data detect unsupported claims

Lightweight NLI

MiniLM-L12 fine-tuned on MNLI

33M params with 87%+ accuracy; fast enough for real-time applications

What's Next

Expect NLI to be absorbed into general reasoning evaluation rather than tracked as a standalone task. The technique of using entailment as a building block for zero-shot classification, fact-checking, and claim verification will persist and deepen. Adversarial NLI (and harder versions like those in BIG-Bench) will continue to test whether models genuinely reason or merely pattern-match.

Benchmarks & SOTA

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Natural Language Inference benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Natural Language Inference Benchmarks - Natural Language Processing - CodeSOTA | CodeSOTA