Roo Code tests each frontier model against a suite of hundreds of exercises across 5 programming languages with varying difficulty. These results can help you find the right price-to-intelligence ratio for your use case.
Want to see the results for a model we haven't tested yet? Ping us in Discord.
Model | Metrics | Scores | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Name Context Window | Price In / Out | Duration | Tokens In / Out | Cost USD | Total | |||||
Claude 3.7 Sonnet 200K | $3.00 / $15.00 | 5h 20m 28s | 35M / 853K | $40.39 | 97% | 96% | 100% | 97% | 93% | 97% |
Gemini 2.5 Pro Preview 1M | $1.25 / $10.00 | 5h 9m 21s | 26M / 1M | $45.49 | 89% | 96% | 92% | 94% | 90% | 92% |
GPT 4.1 1M | $2.00 / $8.00 | 4h 18m 24s | 37M / 583K | $41.52 | 92% | 89% | 92% | 91% | 90% | 91% |
Claude 3.5 Sonnet 200K | $3.00 / $15.00 | 4h 53m 17s | 33M / 615K | $34.07 | 86% | 98% | 90% | 85% | 87% | 90% |
Grok 3 (Beta) 131K | $3.00 / $15.00 | 6h 24m 1s | 31M / 736K | $122.95 | 81% | 87% | 96% | 76% | 77% | 85% |
Gemini 2.5 Flash (Thinking) 1M | $0.15 / $3.50 | 5h 15m 36s | 62M / 2M | $15.59 | 86% | 80% | 88% | 82% | 80% | 84% |
GPT 4.1 Mini 1M | $0.40 / $1.60 | 4h 54m 41s | 54M / 774K | $9.42 | 81% | 82% | 88% | 82% | 70% | 81% |
o3 Mini (High) 200K | $1.10 / $4.40 | 8h 55m | 13M / 3M | $24.55 | 89% | 87% | 80% | 79% | 70% | 81% |
Gemini 2.5 Flash 1M | $0.15 / $0.60 | 5h 34m 32s | 84M / 2M | $13.69 | 75% | 84% | 84% | 88% | 70% | 81% |
o4 Mini (High) 200K | $1.10 / $4.40 | 12h 35m 49s | 11M / 2M | $12.06 | 83% | 78% | 69% | 85% | 60% | 75% |
DeepSeek V3 64K | $0.27 / $1.10 | 9h 40m 45s | 21M / 421K | $12.20 | 75% | 73% | 69% | 85% | 63% | 73% |
o3 200K | $10.00 / $40.00 | 5h 53m 50s | 9M / 1M | $188.40 | 67% | 62% | 71% | 62% | 60% | 65% |
Gemini 2.0 Flash 1M | $0.10 / $0.40 | 7h 35m 44s | 282M / 2M | $33.62 | 58% | 60% | 67% | 53% | 57% | 60% |