Roo Code tests each frontier model against a suite of hundreds of exercises across 5 programming languages with varying difficulty. These results can help you find the right price-to-intelligence ratio for your use case.
Want to see the results for a model we haven't tested yet? Ping us in Discord.
Model | Metrics | Scores | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Name Context Window | Price In / Out | Duration | Tokens In / Out | Cost USD | Total | |||||
Gemini 2.5 Pro Preview 05-06 0 | $0.00 / $0.00 | 6h 39m 30s | 22M / 2M | $35.59 | 94% | 98% | 98% | 97% | 93% | 96% |
Gemini 2.5 Pro Preview 06-05 0 | $0.00 / $0.00 | 4h 22m 47s | 28M / 1M | $34.84 | 92% | 93% | 98% | 100% | 93% | 95% |
Claude Sonnet 4 0 | $0.00 / $0.00 | 4h 33m | 35M / 630K | $37.18 | 94% | 91% | 96% | 97% | 97% | 95% |
Claude 3.7 Sonnet 0 | $0.00 / $0.00 | 4h 52m 36s | 19M / 603K | $27.16 | 92% | 93% | 98% | 97% | 87% | 94% |
GPT 4.1 0 | $0.00 / $0.00 | 4h 39m 51s | 37M / 624K | $38.64 | 92% | 91% | 90% | 94% | 90% | 91% |
Claude 3.5 Sonnet 0 | $0.00 / $0.00 | 3h 37m 58s | 19M / 323K | $24.98 | 94% | 91% | 92% | 88% | 80% | 90% |
Grok 3 0 | $0.00 / $0.00 | 5h 14m 20s | 40M / 890K | $74.40 | 97% | 89% | 90% | 91% | 77% | 89% |
Gemini 2.5 Flash Preview 05-20 (Thinking) 0 | $0.00 / $0.00 | 5h 29m 16s | 47M / 2M | $11.33 | 83% | 87% | 94% | 85% | 73% | 86% |
GPT 4.1 Mini 0 | $0.00 / $0.00 | 5h 17m 57s | 47M / 715K | $8.81 | 81% | 84% | 94% | 76% | 70% | 83% |
o4 Mini (High) 0 | $0.00 / $0.00 | 14h 44m 26s | 13M / 3M | $25.70 | 75% | 82% | 86% | 79% | 67% | 79% |
DeepSeek V3 0 | $0.00 / $0.00 | 7h 12m 41s | 30M / 524K | $12.82 | 83% | 76% | 82% | 76% | 67% | 77% |
o3 Mini (High) 0 | $0.00 / $0.00 | 13h 1m 13s | 12M / 2M | $20.36 | 67% | 78% | 72% | 88% | 73% | 75% |