Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Code GenerationTasks/Code/Code Generation
Code · two markets buyers conflate

Code Generation & AI Coding Assistants.

On one side, IDE-integrated developer products (Copilot, Cursor, Cody, Tabnine, Codeium, Augment) priced per seat and optimised for inline completion, chat, and multi-file edits. On the other, raw code-capable LLMs (Claude Opus 4.7, GPT-5, Gemini 3 Pro, DeepSeek, Qwen3-Coder) priced per token — the intelligence layer that powers the first category, and that agentic CLIs like Aider and Codex call directly.

Below: 13 products and models, compared on the axes that actually decide it.

Frontier LLM leaderboard LiveCodeBenchClaim a listing
§ 01 · The matrix

13 coding tools & LLMs, side by side.

IDE products (per seat / month) · frontier LLM APIs (per 1M tokens) · open-weights LLMs. Pricing units differ by tier — read the Cost column accordingly.

Provider / ProductTierLicenseCostIDE integrationsContextAgentSWE-bench Verified
GitHub Copilot logo
GitHub Copilot
Copilot · Copilot Chat · Copilot Workspace
IDEProprietary product$10–19 / seat / moVSCode · JetBrains · Vim/Neovim · Visual Studio · XcodeRepo-aware (workspace indexing)~55% (Workspace agent)Claim →
Cu
Cursor
Cursor Editor · Composer · Agent
IDEProprietary product$20–40 / seat / moCursor (VSCode fork) onlyRepo-wide embedding index · multi-file edits~50–60% (Agent mode)Claim →
Cd
Sourcegraph Cody
Cody · Cody Enterprise
IDEProprietary product$9–19 / seat / moVSCode · JetBrains · Neovim · WebGraph-based code intelligence across the monorepoNot self-reportedClaim →
Tn
Tabnine
Tabnine Pro · Enterprise
IDEProprietary product$9–39 / seat / moVSCode · JetBrains · Vim/Neovim · Eclipse · many moreLocal + repo-awareNot self-reportedClaim →
Co
Codeium
Codeium · Windsurf Editor
IDEProprietary productFree · $15 / seat / moVSCode · JetBrains · Vim · 40+ editors · Windsurf (own IDE)Repo-wideNot self-reportedClaim →
Au
Augment Code
Augment Code · Remote Agents
IDEProprietary product$30 / seat / moVSCode · JetBrains · Vim · CLI · Remote Agents200K+ token context engine across large monorepos~65% (Remote Agent, self-reported)Claim →
Cn
Continue
Continue.dev
IDEOpen sourceFree · BYO modelVSCode · JetBrainsConfigurable — local, repo, or custom retrieversHarness-dependentClaim →
Ai
Aider
Aider CLI
IDEOpen sourceFree · BYO modelCLI (works alongside any editor)Repo map + git-aware editsAider Polyglot leaderboardClaim →
Anthropic logo
Anthropic
Claude Opus 4.7 · Sonnet 4.6
Frontier LLMProprietary API$3 / $15 – $15 / $75 per 1MVia Copilot · Cursor · Cody · Continue · Aider · Claude Code CLI200K–1M tokensSOTA on SWE-bench Verified (~70%+ harnessed)Claim →
OpenAI logo
OpenAI
GPT-5 · GPT-5 Codex · o-series
Frontier LLMProprietary API~$10 / $30 per 1M (GPT-5 typical)Via Copilot · Cursor · Cody · Continue · Aider · Codex CLI200K–400K tokens typical~65–70% (Codex-style harnesses)Claim →
Google logo
Google
Gemini 3 Pro · Gemini 3 Ultra
Frontier LLMProprietary API$1.25 / $5 per 1M (Pro)Via Copilot · Cursor · Cody · Continue · Aider · Gemini CLI1M–2M tokensTop of LiveCodeBench (~91.7%)Claim →
DS
DeepSeek
DeepSeek V3.2 · V3.1 · Coder V2
Open LLMOpen weights$0.27 / $1.10 per 1M (hosted)Via Continue · Aider · any OpenAI-compatible client128K tokens~55–60% (best open result)Claim →
Alibaba / Qwen logo
Alibaba / Qwen
Qwen3-Coder · Qwen3-Coder-Plus
Open LLMOpen weightsSelf-host · ~$1–3 per 1M (hosted)Via Continue · Aider · vLLM / SGLang / Ollama256K–1M tokens (YaRN)~50–55% (best Apache-licensed result)Claim →

Pricing as of 2026-04. IDE products are priced per seat / month; LLMs are priced per 1M tokens (input / output). SWE-bench Verified scores are harness-dependent — the same model can swing 20 points across harnesses, which is why we report approximate ranges and note the harness where relevant. Click any price to open the vendor’s pricing page. Spot an error? Tell us →

§ 02 · Decision shortcuts

Which should I use?

The buyer question for an IDE tool (“Cursor or Copilot?”) is about ergonomics and multi-file context. The buyer question for a raw LLM (“Claude Opus 4.7 or GPT-5?”) is about SWE-bench numbers and cost per task. They’re not the same decision.

Solo developer · best bang-for-buck

GitHub Copilot · Cursor Pro

$10–20/mo. Copilot if you live in VSCode or JetBrains; Cursor if you're willing to switch editors for a better multi-file agent loop.

Autonomous / agentic coding (CLI)

Claude Code · Aider + Claude Opus 4.7 · Codex CLI

Terminal-driven loops that read errors, run tests, and commit their own edits. Claude Opus 4.7 is the default model; GPT-5 and Gemini 3 Pro are credible alternatives.

Cheapest credible tokens

Gemini 3 Pro · DeepSeek V3.2

Gemini 3 Pro at $1.25/$5 per 1M is the cheapest frontier model. DeepSeek V3.2 at $0.27/$1.10 is 80% of the quality at 1/5th the price again.

Large monorepo · repo-scale context

Sourcegraph Cody · Augment Code · Gemini 3 Pro

Cody indexes the graph; Augment's context engine is tuned for 200K+ token codebases; Gemini's 2M window lets you just paste the whole repo.

Air-gapped / on-prem / regulated

Tabnine Enterprise · Continue + self-hosted Qwen3-Coder

No code leaves your VPC. Tabnine is the productised path; Continue + Qwen3-Coder is the build-it-yourself path on Apache-licensed weights.

Open-source-only stack

Continue · Aider · Qwen3-Coder · DeepSeek V3.2

Apache / OSI-approved from editor to weights. Pair Continue or Aider with Qwen3-Coder locally via vLLM for a fully reproducible setup.

Raw LLM for an agent framework

Claude Opus 4.7 · GPT-5 · Gemini 3 Pro

Building your own agent loop? Opus 4.7 currently tops SWE-bench Verified with the right harness; Gemini 3 Pro tops LiveCodeBench on fresh problems.

§ 03 · Methodology

What to actually test (vendor demos lie).

HumanEval-style single-function completion is a solved problem — every model on this page clears it. The interesting failure modes are the ones that surface when you point a tool at a real codebase. Build your own 6-task eval covering these:

Run the same tasks through 2–3 candidates blind and score on finished PRs, not tokens generated. A tool that writes confident code that doesn’t compile is worse than one that asks a clarifying question.

Multi-file refactor

“Rename this interface across 5+ files, update callers, update tests.” Single-function completion is solved; multi-file edits are where Cursor / Augment / Claude Code pull ahead of Copilot-style autocomplete.

Long-context recall

Does it remember the convention you established 400 lines earlier? Or does it reinvent a parallel pattern? This is the difference between a repo-aware tool and a glorified autocompleter.

Library version awareness

Ask it to write code against a library that had a major API change in the last 6 months. Most models hallucinate the old API — the good ones either ask or use your lockfile.

Agentic execution loop

Can it run the tests, read the failures, edit the code, and re-run? This is the qualitative gap between ‘fancy autocomplete’ and ‘junior engineer you can dispatch a ticket to.’

Code review (not generation)

Paste a PR with a subtle bug. Can the tool catch it? Writing new code is easier than finding bugs in existing code — and review is where coding assistants earn their keep in teams.

Domain-specific languages

SQL, Terraform, GLSL, Solidity, Zig. Benchmarks are Python + TypeScript heavy. If your production code is 60% DSL, that's where the real evaluation happens.

§ 04 · Metrics

Why HumanEval / MBPP scores stopped being meaningful.

HumanEval (2021) and MBPP (2021) are the original code-gen benchmarks — 164 and 974 short Python problems respectively. They’re both saturated and both contamination-prone: the problems are on the public internet, which means every model trained after ~2022 has seen them. Reporting 95% on HumanEval tells you the model was trained; it doesn’t tell you the model is good.

A 2-point delta on HumanEval is entirely noise. Worse, it’s often training-data leakage dressed as capability.

The evals that still discriminate in 2026 are SWE-bench Verified (real GitHub issues, agentic), LiveCodeBench (time-stamped competition problems, contamination-resistant by construction), Aider Polyglot (6 languages, edit-based), and BigCodeBench (function-level with real library calls).

Even these have a harness problem — the same base model can post wildly different SWE-bench numbers depending on the scaffold. Agent harness > model: a great model inside a bad harness loses to a worse model with a good one.

§ 05 · Reference benchmarks

The boards that matter.

Six datasets that show up on every coding leaderboard. The first two are saturated legacy; the next three are the ones you should actually read when comparing models in 2026; the last is the emerging function-level standard.

HumanEval

164 problems · hand-written Python2021

The pioneer. OpenAI's 2021 release that defined the category. Every model on this page clears 90%+. Saturated; contamination-prone. Reported for historical comparability only.

Benchmark page →

MBPP

974 basic Python problems2021

“Mostly Basic Python Programming.” Pairs naturally with HumanEval as the entry-level benchmark suite. Same saturation story — useful only for checking that a model can code at all.

Benchmark page →

LiveCodeBench

Rolling · competition-style · time-stamped2024

Contest problems from LeetCode, AtCoder, and Codeforces, time-stamped to filter out contamination. Updates continuously. Gemini 3 Pro leads at ~91.7% on recent slices; still the cleanest signal on raw algorithmic ability.

Benchmark page →

SWE-bench Verified

500 real GitHub issues · Python2024

Human-verified subset of the original SWE-bench. Each task is a real bug-fix PR from a popular OSS project. Requires editing multiple files, running tests, and iterating — the reference agentic benchmark. Harness-dependent.

Benchmark page →

Aider Polyglot

225 problems · 6 languages2024

Edit-based eval across Python, JavaScript, Go, Rust, C++, Java. Run through Aider's CLI harness (the same harness you'd use in production), which makes the numbers honest and reproducible.

Benchmark page →

BigCodeBench

1,140 problems · real library calls2024

Function-level generation that forces the model to use real libraries (requests, numpy, pandas, etc). Discriminates between models that memorised docs and models that can actually compose APIs.

Benchmark page →
§ 06 · Practical tips

Five rules for picking a coding stack in 2026.

Stop looking at HumanEval. It’s contamination-prone and saturated — every frontier model clears 90%+ and the ranking tells you more about training data than capability. Use LiveCodeBench (for algorithmic problems) or SWE-bench Verified (for agentic, real-codebase work) instead.

IDE tools and raw APIs are different purchases. IDE products matter for ergonomics — completion latency, keybindings, diff UX, multi-file context in the editor. Raw LLM APIs matter for autonomous workflows — an agent dispatched to a backlog ticket doesn’t care how pretty the diff view is. Don’t pick one and shoehorn the other.

Open-weights have closed most of the gap. Qwen3-Coder and DeepSeek V3.2 land ~80% of frontier quality at roughly 1/30th the cost per token. If your workload is high-volume or your procurement team rejects anything that can’t run on-prem, this is the pragmatic path — especially paired with an open agent harness like Continue or Aider.

Agent harness beats model. A great model inside a bad scaffold (no test execution, no error feedback loop, no retry on failure) loses to a weaker model with a well-engineered harness. SWE-bench numbers are mostly a harness story — Anthropic, Aider, and Cognition all post different numbers for the same underlying Claude. If you’re picking a tool, the harness matters more than the model brand.

Cache your prompts on long codebases. Anthropic and OpenAI both offer prompt caching with 50–90% discount on cached input. If you’re running a coding agent that re-sends the same system prompt + repo context on every turn, caching is the highest-leverage cost lever available — easily 50–80% off the monthly bill.

For vendors

Run an AI coding product? Claim your listing.

CodeSOTA’s code-generation comparison is read by engineering leaders picking a coding assistant or LLM for production. If you represent one of the vendors above — or a product we missed — claim the listing to submit verified pricing, SWE-bench results, harness details, and a demo link. Free; credibility-gated, not pay-to-play.

Claim a listing Get a rank badge for your site
Related comparisons
Frontier LLM leaderboard (code metrics) Visual Question Answering Text-to-Speech
Reply within 48 hours · No newsletter

What were you looking for on AI coding assistants?

Missing a product, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.