What this is, in plain words.
The short version
We built three AI systems that predict football matches. During the 2026 World Cup, they forecast every single game, in public, before kickoff. They compete against the betting market, a statistical rating, and one human. Every prediction, every miss, and every dollar spent is published.
Why football, when we are an operations company
Football is the test track, not the point. A World Cup gives us 104 decisions in five weeks, each with a clear right answer and a hard deadline. That is exactly what a business decision looks like. The question we actually care about: when an AI tells you it is 70 percent sure, can you trust that number? Because companies are starting to run real decisions on numbers like that one.
The three AI teams
Solo is one model doing everything alone in a single pass. Pipeline is a chain of specialists where each step builds on the last, and it keeps a memory of its own past mistakes. Council is three different AI models that each give an opinion, then a fourth reconciles them. Same data, same matches, same rules. The architecture is the only difference. That is the experiment.
The human in the race
Mo, the founder, predicts every match too. On instinct alone. No statistics, no odds, no data, just a pick and a confidence number before kickoff. Over one match, instinct can beat anything. Over 104 matches, does instinct survive against data discipline? That answer gets published whichever way it lands.
What we expect to find
Honestly: we expect nobody to beat the betting market on raw accuracy. That is not the prize. We measure whether each forecaster knows what it knows. Saying 70 percent and being right 70 percent of the time is called calibration, and for anyone putting AI into real operations, calibration is the number that decides whether you can delegate a decision or not.
Why you can trust the numbers
The rules were locked and published before scoring starts, so we cannot move the goalposts after the fact. The models are frozen for the whole tournament. Misses are published with the same prominence as hits. Costs are shown per prediction, to the cent. And the full dataset ships with the final paper on August 1.
Who is behind it
inocta.io, an operations boutique from Toronto and Montréal that puts AI into real businesses. This benchmark is our method shown in public: measure before you trust, understand before you automate. You can't automate what you don't understand.
The eight, in one table
| Character | Real name | In plain words | What we learn from it |
|---|---|---|---|
The Soloist | Solo | One AI reads the full match dossier and makes the call alone, in a single pass. | Whether one strong model with good data is all you need. |
The Purist | Solo-Zero | The same AI with the dossier taken away. It predicts from memory alone. | The gap between Solo and Solo-Zero shows what the match data is actually worth. |
The Assembly Line | Pipeline | A chain of specialists: one reads the numbers, one reads everything else, one writes the final call. It keeps notes on its own past misses. | Whether splitting the work into steps, plus learning from mistakes, beats one model working alone. |
The Creature of Habit | Pipeline-Static | The exact same chain with the memory of past misses switched off. | The gap between Pipeline and Pipeline-Static shows whether the learning is real or just a story. |
The Council | Council | Three different AIs each give an opinion, sometimes disagreeing sharply, and a fourth merges them into one call. | Whether a debate between different AIs beats any single one of them, and whether it is worth the extra cost. |
The Statistician | ELO | A pure math rating built from four years of results, like a chess ranking, plus a home-field bump. No AI. | If the AI teams cannot beat simple math, the AI is not adding anything. |
The Market | Market | The betting odds at kickoff, turned into percentages. The AIs never see them. | The toughest score in sports. How close anyone gets to the market is the real measure. |
The Human | Mo | The founder picks every match on pure gut feel, no data, before kickoff. | Whether human instinct survives 104 matches against machines and math. |