Can you trust AI with a mission-critical freight decision?
29 Freight Loads · 290 Runs · 4 AI Systems
We tested Euclid against the leading frontier models on real freight decisions: pricing, references, compliance, the rules behind every load. The best of them was right 92% of the time.
The stakes
Eight points doesn't sound like much. On a fleet moving a million loads a month, it's the difference between right and wrong tens of thousands of times over. Every pricing error, missed reference, and bad rule is a claim, a fine, or a lost load.
80,000
Wrong decisions / month
8% × 1,000,000 loads
The result
100% isn't a goal. It's the architecture.
n = 29 cases × 10 runs
100% accuracy · 1.00 consistency · on this benchmark
Euclid. The certainty layer for freight operations.
How we tested
- Cases
- 29
- Runs
- 10 / case
- Mode
- post-extract
- Date
- Apr 2026
29 real freight cases. 10 runs each. Each system was scored not on reading a document, but on correctly applying the business logic behind it: order identity, pricing, mode and revenue, commodity and temperature rules, stop sequencing, EDI references, and ordering constraints. Compared against gemini-2.5-flash, gpt-4.1-mini, and claude-sonnet-4-6. Benchmarked April 2026.
Where general-purpose AI fails
The gap widens exactly where the stakes are highest. On EDI references, frontier models missed by up to 28%. The weakest missed Comments by up to 22%, and temperature constraints by up to 27%. These aren't edge cases. They're the fields that turn into claims, fines, and rejected loads.
Order Identity
Mode & Revenue
Pricing
Commodity
Temperature
Stops
References
Comments
The variance problem
Accuracy isn't enough if the answer changes between runs. Frontier models are non-deterministic. Even the strongest occasionally returns a different answer to an identical load. Euclid returns 1.00 consistency: the same correct output, every run.
Order Identity
Mode & Revenue
Pricing
Commodity
Temperature
Stops
References
Comments
Speed
Euclid resolves a case in ~1–2 milliseconds. The frontier models take 3 to 17 seconds, hundreds to thousands of times slower, on work that runs at the scale of your entire operation.
~1–2 ms
Euclid · per case
3–17 s
Frontier models · per case