Codesota · ConsultingAI implementation methodology · 8 phases · outcome-first

Editorial · How we work

We design the failure modes first.

Most AI projects fail not because of bad models — but because of wrong outcomes, ignored risks, and no adoption plan.

A process built from benchmarking 40+ models across real production conditions, feeding directly into Codesota's open registry.

§ 01 · The pattern

Why AI projects keep failing the same way.

The pattern is consistent across industries. We have seen it, measured it, and built a process that does not repeat it.

No defined outcomes

Projects kick off with "discovery" before anyone agrees what success looks like in business terms.

Risk as afterthought

Teams rush to PoC before mapping failure modes. The expensive risks surface late.

Vendor claims, no data

Vendors benchmark on their own test sets. We run independent evaluations on standardised datasets.

No explainability

Confidence design happens after the model is built. Users don't trust the system and route around it.

Deployment = done

Projects "complete" at launch. No adoption tracking, no feedback loops, no measurement of impact.

No kill criteria

Without pre-defined stopping conditions, failing projects keep burning budget.

§ 02 · The framework

Outcome-first AI development.

Eight phases, end-to-end. Every phase has explicit cards, a named deliverable, and a hand-off into the next. Skim the index, then drop into any phase below.

Phase

Kill criteria

Outcome Definition

Before any discovery, any model selection, any code — we define what success looks like in business terms. And what would make us stop.

North Star Metric

One number that moves

What single metric proves this worked? Not "model accuracy" — actual business impact. Invoice processing time. Screening throughput. Error rate in production.

Decision Rights

What does AI decide?

Recommend? Automate? Escalate? The boundary between AI and human decision is designed explicitly — not discovered later when something goes wrong.

Cost of Wrong

Failure taxonomy

A false positive in PEP screening costs differently than one in document OCR. We map the asymmetry of errors before touching a dataset.

Kill Criteria

Pre-defined stopping conditions

What would make us recommend stopping? Defined upfront, in writing, signed by sponsor. Not a post-hoc rationalisation when budget is spent.

DeliverableOutcome Charter — 1-2 pages, signed by project sponsor

Phase

Before any code

Risk Architecture

Map all the ways this can fail before building anything. We design failure modes first — then build systems that avoid them.

Data Risk

Garbage in, garbage out

Missing labels, distribution shift, adversarial inputs, PII exposure. We audit before training, not after the model underperforms.

Model Risk

Wrong approach, wrong metric

Accuracy on a benchmark vs. accuracy on your documents. We map the gap before committing to an architecture.

Integration Risk

Legacy systems don't move

API limitations, latency constraints, data format mismatches — these kill production deployments. We surface them in Phase 1, not Phase 5.

Adoption Risk

Users route around systems they don't trust

Workarounds, shadow processes, "I just do it manually" — adoption failure is as terminal as technical failure. We design for it explicitly.

DeliverableRisk Matrix + Ranked Riskiest Assumptions + Go/No-Go Checkpoint

Phase

Deep excavation

Archaeology

Not discovery. Forensic understanding of how work actually happens — the tribal knowledge, hidden rules, and workarounds that determine real success.

Process Archaeology

How they actually do it

Not how the process doc says they do it. We map the actual workflow — the manual checks, the copy-paste, the Excel file nobody admits exists.

Data Archaeology

What exists, what's lying

Schema vs. reality. Labels created by different people in different contexts. Historical data that reflects policy decisions that no longer apply.

Tribal Knowledge

Rules that live in people's heads

"Sarah always double-checks the Polish names." "These suppliers always format amounts wrong." If it's not in the system, it needs to be in the model.

Workaround Mapping

Where the current system fails

Every workaround is a failure mode waiting to become a model bug. We document them before they become our problem.

DeliverableCurrent State Map + Data Audit + Hidden Rules Document

Phase

Codesota data

Benchmark Selection

We don't choose models based on vendor claims or blog posts. We select from independently verified benchmark data — the same infrastructure that powers Codesota.

Independent Benchmarks

Not the vendor's test set

Every model we recommend has been evaluated on standardised datasets: ICDAR, SROIE, OmniDocBench, and domain-specific Polish document corpora.

Task-Specific Selection

Best for invoices != best for handwriting

We slice benchmarks by document type, language, and production conditions — not top-line accuracy scores that hide tail failures.

Cost Modelling

API cost per 1K pages matters

The fastest model and the cheapest model are rarely the same. We model cost, latency, and accuracy together against your actual volume.

Fine-tune vs. API

When to own the model

Sometimes fine-tuning beats any API at 1/10th the cost. We've done it: RysOCR achieved 71% CER reduction on Polish documents. We know when it's worth it.

Codesota

40+ models tracked across 9 datasets. When we recommend a model for your use case, we can show you exactly where it ranks, where it fails, and what the next-best alternative is. See the benchmarks →

DeliverableModel Selection Report — ranked candidates with benchmark evidence

Phase

Trust first

Confidence Design

How humans will trust — or appropriately distrust — this system. Designed before the model is built, not bolted on after deployment.

Uncertainty Communication

How does it say "I don't know"?

A model that outputs a confident wrong answer is worse than one that flags uncertainty. We design the confidence interface as a first-class deliverable.

Explainability Architecture

Right level, right audience

The compliance officer needs different explainability than the data engineer. We design both — not a generic "show the attention map" non-answer.

Human Override Design

Override is a feature, not a failure

When can humans override? What's logged? How does that feed back into the model? We treat every override as a training signal.

Trust Calibration

Prevent over-trust and under-trust

Both failure modes are real. A user who blindly accepts all outputs is as dangerous as one who ignores them. We design for the middle.

DeliverableTrust Architecture Document + Confidence UI Specifications

Phase

Go/No-go

Learn Fast

Test the riskiest assumptions with minimum investment. Prove business value, not just technical feasibility. This phase exists to give you a real Go/No-Go.

Assumption Testing

Start with the riskiest thing

The assumption most likely to kill the project gets tested first — not saved for Phase 5 discovery when half the budget is gone.

Data Proof

Can we get the data we need?

Volume, quality, labelling cost, legal constraints. Many projects die here. Better to know in week two than month six.

Model Proof

Can any model hit required accuracy?

Against the threshold defined in Phase 0. Not "it looks promising" — a real number against a real bar.

Value Proof

Does it change behaviour?

Test with real users. A model that hits 94% accuracy but gets routed around by every operator has failed. Behaviour change is the real metric.

DeliverableValidated Learning Report + Go / No-Go Recommendation

Phase

Production

Engineer & Embed

Build the production system and ensure organisational adoption run in parallel. A model that works but nobody uses is a failure.

Engineering Track

Production-grade MLOps

Training pipelines, monitoring, testing, documentation, rollback procedures. Not a notebook — an engineering system that runs when you're not watching.

Embedding Track

Adoption as explicit work

Workflow integration, training, champions network, change communication. Runs in parallel with engineering — not as a post-launch afterthought.

Handover Design

You own it after we leave

Full documentation, runbooks, on-call procedures. We design for the scenario where the person who built it isn't available at 2am.

Monitoring Setup

Instrumented from day one

Business metrics, model metrics, adoption metrics — all in one dashboard. The foundation for the continuous loop.

DeliverableProduction System + Adoption Dashboard + Runbook

Phase

∞

Permanent

Measure & Evolve

This is not a phase — it's an operating mode. Continuous monitoring feeds back into the loop. The model that shipped is not the model that runs in 12 months.

Business Outcomes

Is the North Star moving?

The metric defined in Phase 0. Tracked against baseline, reported monthly, tied to business impact — not model accuracy.

Model Health

Drift is silent and lethal

Accuracy, latency, error distribution, data drift against production baseline. Automated alerts when the model starts behaving differently.

Adoption Health

Usage, override rates, workarounds

If override rates climb, something changed — in the model, the workflow, or the organisation. We detect it before it becomes invisible.

Improvement Backlog

Every signal feeds the loop

User overrides become training data. Business metric gaps become model targets. The loop closes continuously — not at the next contract renewal.

DeliverableMonthly Health Report + Improvement Backlog + Next Iteration Plan

§ 03 · What changes

The difference in practice.

Dimension	Typical agency	Codesota approach
Model selection	Vendor demos, blog posts, familiarity	Independent benchmarks on standardised datasets
Risk detection	During or after build	Phase 01 — mapped before a single line of code
Adoption strategy	Training session at launch	Explicit parallel track from Phase 05 onwards
Trust architecture	Post-build, if at all	Designed in Phase 04 before engineering begins
Kill decision	After sunk costs, political	Pre-defined criteria in Outcome Charter, Phase 00
Measurement	Project closes at deployment	Phase ∞ — continuous loop, business metrics first

§ 04 · Selected outcomes

What this looks like in production.

Anonymised. Numbers are real.

Industrial inspection · NDT / Energy

~85%

Detection accuracy

Real-time

Processing speed

Computer vision for automated defect detection in industrial inspection. Replaced manual visual review. Key decision from Risk Architecture: manual override is always available and every override is logged as a training signal.

Compliance · AML screening · Fintech

3×

Screening throughput

−60%

False positive rate

LLM-powered adverse media and PEP screening. The Cost of Wrong analysis (Phase 00) determined that false negatives carried regulatory risk — so the model was tuned conservatively, with explainability designed for compliance officer review.

AI readiness assessmentFree · 90 min

90 minutes.
No pitch.
A real answer.

Book assessment →

k.wikiel@gmail.com

Outcome definition session

We work through your North Star Metric, decision rights, and what "done" actually means.

Initial risk architecture

We map your top 5 riskiest assumptions. The ones that would kill the project if you discovered them in month four.

Benchmark model shortlist

If relevant, we pull current benchmark data from Codesota and give you a ranked shortlist — before you have spent anything.

Written report

A 2-page document you keep regardless of what happens next. Useful whether you work with us or not.

We design the failure modes first.

Why AI projects keep failing the same way.

No defined outcomes

Risk as afterthought

Vendor claims, no data

No explainability

Deployment = done

No kill criteria

Outcome-first AI development.

Outcome Definition

One number that moves

What does AI decide?

Failure taxonomy

Pre-defined stopping conditions

Risk Architecture

Garbage in, garbage out

Wrong approach, wrong metric

Legacy systems don't move

Users route around systems they don't trust

Archaeology

How they actually do it

What exists, what's lying

Rules that live in people's heads

Where the current system fails

Benchmark Selection

Not the vendor's test set

Best for invoices != best for handwriting

API cost per 1K pages matters

When to own the model

Confidence Design

How does it say "I don't know"?

Right level, right audience

Override is a feature, not a failure

Prevent over-trust and under-trust

Learn Fast

Start with the riskiest thing

Can we get the data we need?

Can any model hit required accuracy?

Does it change behaviour?

Engineer & Embed

Production-grade MLOps

Adoption as explicit work

You own it after we leave

Instrumented from day one

Measure & Evolve

Is the North Star moving?

Drift is silent and lethal

Usage, override rates, workarounds

Every signal feeds the loop

The difference in practice.

What this looks like in production.

90 minutes.No pitch.A real answer.

90 minutes.
No pitch.
A real answer.