Codesota · Agentic · Claude Code vs Codex CLIHome/Agentic/Claude Code vs Codex CLI
Head-to-head · two frontier labs · April 2026

Claude Code vs OpenAI Codex CLI.

Both are terminal agents. Both can run unattended for half an hour. Both ship regularly. One is wrapped around Claude Opus 4.5/4.7; the other around GPT-5.2/5.3-Codex. Here is where they actually diverge.

SWE-Bench hub Claude Code Codex CLI
§ 01 · Side-by-side

At a glance, row by row.

AttributeClaude CodeOpenAI Codex CLI
VendorAnthropicOpenAI
Primary modelOpus 4.5 / 4.6 / 4.7, Sonnet 4.5, Haiku 4.5GPT-5.2 / 5.3, Codex variants, Mini, Nano
Reasoning controlSoft (prompted)Explicit: minimal / medium / high / xhigh
SWE-Bench Verified peak87.6% (Opus 4.7)85% (GPT-5.3-Codex xhigh)
Patch formatstr_replace (surgical)apply_patch (unified diff)
SandboxingHost shell + optional --safeDefault Docker sandbox
Extension mechanismMCP servers (Linear, DB, etc)custom_tools JSON schema
Context window200k / 1M (Opus 4.7 1M)200k (GPT-5) / 400k (Codex long)
Pricing signalPro $20/mo + API usagePro $20/mo + API usage
Best forMulti-file refactor, long autonomyReasoning-heavy bugs, cheap bulk runs

April 2026, pass@1

SWE-Bench Verified — best reported by each CLI

Closed modelOpen weightsAgent scaffold
0%19%38%57%76%95%Claude Code + Opus 4.7Anthropic internal87.6%Codex CLI + GPT-5.3-Codex xhighOpenAI release notes85.0%Claude Code + Opus 4.580.9%Codex CLI + GPT-5.2-Codex79.0%Claude Code + Sonnet 4.577.2%Codex CLI + GPT-5.2 Mini62.0%

Architecture — the two loops side-by-side

Both are plan-act-reflect agents, but the tool verbs and reasoning modes differ.

Architecture

Claude Code

str_replace + bash + MCP

iterateUser promptPlanlong-horizonRead + Grepstr_replacesurgical diffBashpytest, lintReflectloopMCPcustom serversCommit

Architecture

OpenAI Codex CLI

apply_patch + sandboxed shell

replanUser promptReasonerGPT-5.3 CodexSandbox shellDockerapply_patchunified diffExec testspytest / npm testSelf-critiquereasoning scratchPR-style output

Cost vs performance

Eval-run $ for a full SWE-Bench Verified pass vs resolve rate. Mini/Haiku tiers are competitive on price-per-score.

The money visual

Claude Code vs Codex CLI — price of a full Verified run

X: $ per resolved issue (log scale). Y: Verified %. Pink line = Pareto frontier.

0%20%40%60%80%100%$0.10$1$10Cost per resolved issue (USD, log)SWE-Bench Verified (%)Claude Code + Opus 4.7Claude Code + Opus 4.5Claude Code + Sonnet 4.5Claude Code + Haiku 4.5Codex + GPT-5.3-Codex xhighCodex + GPT-5.2-CodexCodex + GPT-5.2 MiniCodex + GPT-5.2 Nano
Closed modelOpen weightsAgent scaffoldPareto frontier
§ 02 · Choose

When to pick which.

Choose Claude Code when
  • You want the single highest SWE-Bench Verified score available
  • Task spans many files or repos; long-horizon planning matters
  • You have custom tools / internal DBs — MCP is a killer feature
  • You prefer surgical str_replace edits over full-file rewrites
Choose Codex CLI when
  • Your team is on OpenAI API contracts
  • You want explicit reasoning knobs per task
  • You run many cheap bulk tasks — GPT-5.2 Nano is ridiculously cheap
  • You need sandboxing by default
§ 03 · Method

How the numbers were sourced.

SWE-Bench Verified scores are taken from Anthropic's leaderboard runs (Claude Code) and OpenAI's release notes for the Codex variants of GPT-5.2/5.3. See our SWE-Bench hub for the full leaderboard.

Cost figures reflect a full Verified run at published API rates as of April 2026. Codex's Nano and Mini tiers are priced for bulk; the xhigh reasoning tier carries the same per-token rate but consumes more reasoning tokens.

The pipeline diagrams are editorial — they capture the dominant tool verbs each CLI exposes, not every internal step.

§ 04 · Related

Adjacent comparisons.

Claude Code vs Cursor Composer
Terminal autonomy vs IDE velocity.
Devin vs Claude Code
Autonomous vs interactive.
Aider vs Claude Code
Open source vs closed.
Best agent for SWE-Bench
Current leaderboard winner.
SWE-Bench explained
What the scores actually mean.
SWE-Bench hub
Full leaderboard and methodology.
Coding lineage
How coding eval evolved.
Terminal-Bench
Bench for shell-native agents.