Benchmark Study

Can you trust AI with a mission-critical freight decision?

29 Freight Loads · 290 Runs · 4 AI Systems

We tested Euclid against the leading frontier models on real freight decisions: pricing, references, compliance, the rules behind every load. The best of them was right 92% of the time.

The stakes

Eight points doesn't sound like much. On a fleet moving a million loads a month, it's the difference between right and wrong tens of thousands of times over. Every pricing error, missed reference, and bad rule is a claim, a fine, or a lost load.

80,000

Wrong decisions / month

8% × 1,000,000 loads

The result

100% isn't a goal. It's the architecture.

Euclid

100%

Claude

92.1%

OpenAI

88.1%

Gemini

85.7%

80%85%90%95%100%

n = 29 cases × 10 runs

100% accuracy · 1.00 consistency · on this benchmark

Euclid. The certainty layer for freight operations.

How we tested

Cases: 29
Runs: 10 / case
Mode: post-extract
Date: Apr 2026

Modelsgemini-2.5-flash, gpt-4.1-mini, claude-sonnet-4-6

29 real freight cases. 10 runs each. Each system was scored not on reading a document, but on correctly applying the business logic behind it: order identity, pricing, mode and revenue, commodity and temperature rules, stop sequencing, EDI references, and ordering constraints. Compared against gemini-2.5-flash, gpt-4.1-mini, and claude-sonnet-4-6. Benchmarked April 2026.

Where general-purpose AI fails

The gap widens exactly where the stakes are highest. On EDI references, frontier models missed by up to 28%. The weakest missed Comments by up to 22%, and temperature constraints by up to 27%. These aren't edge cases. They're the fields that turn into claims, fines, and rejected loads.

EuclidGeminiOpenAIClaude

Order Identity100%89.7%−10.390.2%−9.890.8%−9.2

Mode & Revenue100%92.2%−7.897.1%−2.998.5%−1.5

Pricing100%96.6%−3.499.9%−0.199.4%−0.6

Commodity100%82.8%−17.281.4%−18.682.8%−17.2

Temperature100%72.4%−27.688.3%−11.796.6%−3.4

Stops100%91.4%−8.697.7%−2.396.6%−3.4

References100%79.7%−20.372.4%−27.683.2%−16.8

Comments100%80.7%−19.377.5%−22.588.5%−11.5

Order Identity

Euclid100%

Gemini89.7%−10.3

OpenAI90.2%−9.8

Claude90.8%−9.2

Mode & Revenue

Euclid100%

Gemini92.2%−7.8

OpenAI97.1%−2.9

Claude98.5%−1.5

Pricing

Euclid100%

Gemini96.6%−3.4

OpenAI99.9%−0.1

Claude99.4%−0.6

Commodity

Euclid100%

Gemini82.8%−17.2

OpenAI81.4%−18.6

Claude82.8%−17.2

Temperature

Euclid100%

Gemini72.4%−27.6

OpenAI88.3%−11.7

Claude96.6%−3.4

Stops

Euclid100%

Gemini91.4%−8.6

OpenAI97.7%−2.3

Claude96.6%−3.4

References

Euclid100%

Gemini79.7%−20.3

OpenAI72.4%−27.6

Claude83.2%−16.8

Comments

Euclid100%

Gemini80.7%−19.3

OpenAI77.5%−22.5

Claude88.5%−11.5

The variance problem

Accuracy isn't enough if the answer changes between runs. Frontier models are non-deterministic. Even the strongest occasionally returns a different answer to an identical load. Euclid returns 1.00 consistency: the same correct output, every run.

EuclidGeminiOpenAIClaude

Order Identity1.000.970.991.00

Mode & Revenue1.000.970.991.00

Pricing1.000.971.000.99

Commodity1.000.970.971.00

Temperature1.000.970.921.00

Stops1.000.970.991.00

References1.000.970.880.98

Comments1.000.970.870.99

Order Identity

Euclid1.00

Gemini0.97

OpenAI0.99

Claude1.00

Mode & Revenue

Euclid1.00

Gemini0.97

OpenAI0.99

Claude1.00

Pricing

Euclid1.00

Gemini0.97

OpenAI1.00

Claude0.99

Commodity

Euclid1.00

Gemini0.97

OpenAI0.97

Claude1.00

Temperature

Euclid1.00

Gemini0.97

OpenAI0.92

Claude1.00

Stops

Euclid1.00

Gemini0.97

OpenAI0.99

Claude1.00

References

Euclid1.00

Gemini0.97

OpenAI0.88

Claude0.98

Comments

Euclid1.00

Gemini0.97

OpenAI0.87

Claude0.99

Speed

Euclid resolves a case in ~1–2 milliseconds. The frontier models take 3 to 17 seconds, hundreds to thousands of times slower, on work that runs at the scale of your entire operation.

~1–2 ms

Euclid · per case

3–17 s

Frontier models · per case

See it on your own freight.

We'll run Euclid against your real workflows.

Get a Demo →