Knuth-Style Bounties42 Tasks$280 Total

AI Research Challenges

42 research tasks that produce real value for the ML community. Each task earns a symbolic reward — like Knuth's checks, a proof of work you frame, not cash. AI agents can help, but they can't carry you.

Every deliverable gets published on CodeSOTA with full attribution.

Reward Structure

Rewards are symbolic — like Donald Knuth's famous checks for finding errors in his books. Most recipients frame them. The real reward is published research with your name on it.

TierTasksReward
Easy1–8$1 each
Medium9–16$2 each
Hard17–24$4 each
Extra Hard25–32$8 each
Legendary33–42$16 each

Rules

Submission

  • 1. Complete the task and prepare your deliverable
  • 2. Submit via the form below with your work + source links
  • 3. We review within 72 hours
  • 4. Approved work gets published with your name + reward paid

Quality Bar

  • Every claim must cite a primary source
  • AI tools are encouraged for research assistance
  • But raw AI output without verification = rejection
  • We spot-check sources. If one is wrong, the submission fails
  • First valid submission per task earns the check
Filter:

How It Works

01

Claim a Task

Click "I'm working on this" to claim a challenge. We'll send you tips and resources.

02

Do the Research

Use AI tools to assist, but verify everything yourself. We check sources.

03

Submit

Submit your deliverable with a repo link. Community members peer review your work.

04

Get Published

Accepted work gets your name on codesota.com and a collectible Knuth-style check to frame.

Example Submission

This is what a completed Challenge #1 (Benchmark Archaeology) looks like. This is the quality bar.

1
Benchmark Archaeology
ADE20K Semantic Segmentation · mIoU
$1
ade20k-archaeology.json· 5 papers · structured data
{
  "benchmark": "ADE20K Semantic Segmentation",
  "metric": "mIoU (mean Intersection over Union)",
  "split": "val (20,210 images, 150 classes)",
  "papers": [
    {
      "model": "InternImage-H",
      "score": 62.9,
      "metric": "mIoU",
      "split": "val",
      "evaluation": "single-scale, UperNet head, 896×896 crop",
      "arxiv": "2211.05778",
      "year": 2024,
      "code": "https://github.com/OpenGVLab/InternImage",
      "caveats": "ImageNet-22k + Object365 pretraining.
        Score is single-scale; with TTA authors report
        64.2 but this is not standard protocol."    },
    {
      "model": "SwinV2-G",
      "score": 61.4,
      ...
      "caveats": "3B params. Multi-scale testing inflates
        score ~1.5 mIoU vs single-scale. Training requires
        40+ A100s, not practically reproducible."    }
    // ... 3 more entries
  ],
  "comparability_flags": [
    "crop_size_varies: 512 vs 640 vs 896 — affects score",
    "test_time_augmentation: single vs multi-scale (+1-2 mIoU)",
    "pretraining_data: ImageNet-1k to billion-scale proprietary",
    "decoder_head: UperNet vs Mask2Former not comparable"
  ]
}
summary.md · 2 paragraphs

What's being measured: ADE20K evaluates pixel-level semantic understanding across 150 categories. The metric mIoU weights rare classes (chandelier, escalator) equally with common ones (wall, floor) — so long-tail performance matters more than most leaderboards suggest. A model scoring 60 mIoU can still catastrophically fail on 30+ rare categories.

Why results aren't comparable: The 5 papers use three crop sizes, two decoder heads, and pretraining data ranging from ImageNet-1k to billion-scale proprietary sets. Multi-scale test-time augmentation (used in 2/5 papers) inflates scores by 1–2 mIoU but isn't flagged. InternImage-H's 62.9 is single-scale with UperNet; SwinV2-G's 61.4 uses multi-scale — normalized to the same protocol, the gap widens. Any leaderboard mixing these without methodology flags is misleading.

Why These Tasks Exist

ML benchmarking is broken. Papers report scores without methodology details. Leaderboards mix apples and oranges. Datasets rot. Human baselines were collected once in 2018 and never updated.

CodeSOTA tracks 231 benchmarks. 188 of them need research. That's not a backlog — it's an opportunity for anyone willing to do the work.

These challenges are designed so that AI agents get you ~30–40% of the way there. The remaining 60% — verifying sources, making judgment calls, running real experiments, finding what the data actually means — that's where the value lives. And that's what we pay for.

FAQ

Can I use AI tools?

Yes — encouraged! But AI output without human verification gets rejected. We spot-check every source link and data point. The challenge is to produce work that's correct, not just plausible.

Can I do multiple tasks?

Absolutely. No limit. Some legendary tasks build on easier ones, so starting with a few easy tasks is a good strategy.

What if someone else is working on the same task?

First valid submission wins. If two submissions arrive close together, we may accept both if they cover different benchmarks/domains within the task description.

What format should deliverables be in?

JSON for data, Markdown for reports, HTML for interactive tools. Include all source links. When in doubt, email us and ask before starting.

Are the rewards real money?

They're symbolic, like Knuth's checks — a badge of honor for rigorous research. You receive a collectible check you can frame. The real value is having your work published on CodeSOTA with full attribution, plus the skills and portfolio piece.

Do I get credited?

Yes. Every published deliverable includes your name (or handle) as contributor. You can also link your profile.

Ready to start?

Pick a task, do the research, ship the deliverable. Your work helps the entire ML community.

Contact Us