AI Implementation Methodology

We design the
failure modes first.

Most AI projects fail not because of bad models — but because of wrong outcomes, ignored risks, and no adoption plan. We fix that with a process built from benchmarking 40+ models across real production conditions.

40+

Models benchmarked

codesota.com/ocr

71%

CER reduction

RysOCR, Polish docs

6

Phase framework

outcome → risk → ship

The Problem

Why AI projects keep failing the same way

The pattern is consistent across industries. We have seen it, measured it, and built a process that does not repeat it.

01

No defined outcomes

Projects kick off with "discovery" before anyone agrees what success looks like in business terms.

02

Risk as afterthought

Teams rush to PoC before mapping failure modes. The expensive risks surface late.

03

Vendor claims, no data

Vendors benchmark on their own test sets. We run independent evaluations on standardised datasets.

04

No explainability

Confidence design happens after the model is built. Users don't trust the system and route around it.

05

Deployment = done

Projects "complete" at launch. No adoption tracking, no feedback loops, no measurement of impact.

06

No kill criteria

Without pre-defined stopping conditions, failing projects keep burning budget.

The Framework

Outcome-first AI development

A continuous loop that starts with outcomes, selects models using benchmark data, and never stops measuring.

Phase 00

Outcome Definition

Before any discovery, any model selection, any code — we define what success looks like in business terms. And what would make us stop.

North Star Metric

One number that moves

What single metric proves this worked? Not "model accuracy" — actual business impact. Invoice processing time. Screening throughput. Error rate in production.

Decision Rights

What does AI decide?

Recommend? Automate? Escalate? The boundary between AI and human decision is designed explicitly — not discovered later when something goes wrong.

Cost of Wrong

Failure taxonomy

A false positive in PEP screening costs differently than one in document OCR. We map the asymmetry of errors before touching a dataset.

Kill Criteria

Pre-defined stopping conditions

What would make us recommend stopping? Defined upfront, in writing, signed by sponsor. Not a post-hoc rationalisation when budget is spent.

DeliverableOutcome Charter — 1-2 pages, signed by project sponsor

What Changes

The difference in practice

DimensionTypical agencyCodeSOTA approach
Model selectionVendor demos, blog posts, familiarityIndependent benchmarks on standardised datasets
Risk detectionDuring or after buildPhase 01 — mapped before a single line of code
Adoption strategyTraining session at launchExplicit parallel track from Phase 05 onwards
Trust architecturePost-build, if at allDesigned in Phase 04 before engineering begins
Kill decisionAfter sunk costs, politicalPre-defined criteria in Outcome Charter, Phase 00
MeasurementProject closes at deploymentPhase ∞ — continuous loop, business metrics first

Selected Outcomes

What this looks like in production

Anonymised. Numbers are real.

Industrial inspection — NDT / Energy

~85%

Detection accuracy

Real-time

Processing speed

Computer vision for automated defect detection in industrial inspection. Replaced manual visual review. Key decision from Risk Architecture: manual override is always available and every override is logged as a training signal.

Compliance / AML screening — Fintech

3x

Screening throughput

-60%

False positive rate

LLM-powered adverse media and PEP screening. The Cost of Wrong analysis (Phase 00) determined that false negatives carried regulatory risk — so the model was tuned conservatively, with explainability designed for compliance officer review.

AI Readiness Assessment

90 minutes.
No pitch.
A real answer.

Book Assessment →

hello@codesota.com

01

Outcome Definition session

We work through your North Star Metric, decision rights, and what "done" actually means.

02

Initial Risk Architecture

We map your top 5 riskiest assumptions. The ones that would kill the project if you discovered them in month four.

03

Benchmark model shortlist

If relevant, we pull current benchmark data from CodeSOTA and give you a ranked shortlist — before you have spent anything.

04

Written report

A 2-page document you keep regardless of what happens next. Useful whether you work with us or not.