Codesota · Papers2026-02-11
Paper

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

arXiv ↗Code ↗
§ 01 · Benchmark results

22 results reproduced from this paper.

View:
Sorted instantly in-page
Results
20
SOTA rows
1
Models
3
Datasets
0
#ModelVendorBenchmarkValueSOTADateSource
01Step-3.5-Flash PaCoReAIME 202599.9%#1source ↗
02Step-3.5-FlashAIME 202597.3%source ↗
03Step-3.5-Flash BaseHellaSwag90.2%source ↗
04Step-3.5-Flash BaseBIG-Bench Hard88.2%source ↗
05Step-3.5-Flash BaseGSM8K88.2%source ↗
06Step-3.5-FlashTau2-Bench88.2%source ↗
07Step-3.5-Flash BaseMMLU85.8%source ↗
08Step-3.5-Flash PaCoReGPQA Diamond85.0%source ↗
09Step-3.5-Flash PaCoReMMLU-Pro84.8%source ↗
10Step-3.5-FlashMMLU-Pro84.4%source ↗
11Step-3.5-FlashGPQA Diamond83.5%source ↗
12Step-3.5-Flash BaseWinoGrande79.1%source ↗
13Step-3.5-FlashSWE-Bench Verified74.4%source ↗
14Step-3.5-Flash BaseMATH66.8%source ↗
15Step-3.5-Flash BaseMMLU-Pro62.3%source ↗
16Step-3.5-FlashBrowseComp51.6%source ↗
17Step-3.5-Flash BaseGPQA Diamond41.7%source ↗
18Step-3.5-Flash BaseSimpleQA31.6%source ↗
19Step-3.5-Flash PaCoReHLE27.9%source ↗
20Step-3.5-FlashHLE23.1%source ↗
Benchmark trail
§ 02 · Models

3 models from this paper.

evaluates
Step-3.5-Flash Base
evaluates
Step-3.5-Flash PaCoRe
evaluates
Step-3.5-Flash
Read next

Three places to go from here.

Index
All papers
All tracked papers in the registry, with benchmark result, model, and leaderboard linkage where available.
Replacement
Papers with Code is dead — alternatives
What replaced PWC for each use case: LLMs, OCR, speech, vision, robotics.
Top hub
LLM benchmarks
Every frontier LLM benchmark, scored.