Roo Code tests each frontier model against a suite of hundreds of exercises across 5 programming languages with varying difficulty. These results can help you find the right price-to-intelligence ratio for your use case.
Want to see the results for a model we haven't tested yet? Ping us in Discord.
ModelMetricsScores
Name
Context Window
Price
In / Out
DurationTokens
In / Out
Cost
USD
Total
Claude 3.7 Sonnet
200K
$3.00
/
$15.00
5h 20m 28s
35M
/
853K
$40.3997%96%100%97%93%97%
Gemini 2.5 Pro Preview
1M
$1.25
/
$10.00
5h 9m 21s
26M
/
1M
$45.4989%96%92%94%90%92%
GPT 4.1
1M
$2.00
/
$8.00
4h 18m 24s
37M
/
583K
$41.5292%89%92%91%90%91%
Claude 3.5 Sonnet
200K
$3.00
/
$15.00
4h 53m 17s
33M
/
615K
$34.0786%98%90%85%87%90%
Grok 3 (Beta)
131K
$3.00
/
$15.00
6h 24m 1s
31M
/
736K
$122.9581%87%96%76%77%85%
Gemini 2.5 Flash (Thinking)
1M
$0.15
/
$3.50
5h 15m 36s
62M
/
2M
$15.5986%80%88%82%80%84%
GPT 4.1 Mini
1M
$0.40
/
$1.60
4h 54m 41s
54M
/
774K
$9.4281%82%88%82%70%81%
o3 Mini (High)
200K
$1.10
/
$4.40
8h 55m
13M
/
3M
$24.5589%87%80%79%70%81%
Gemini 2.5 Flash
1M
$0.15
/
$0.60
5h 34m 32s
84M
/
2M
$13.6975%84%84%88%70%81%
o4 Mini (High)
200K
$1.10
/
$4.40
12h 35m 49s
11M
/
2M
$12.0683%78%69%85%60%75%
DeepSeek V3
64K
$0.27
/
$1.10
9h 40m 45s
21M
/
421K
$12.2075%73%69%85%63%73%
o3
200K
$10.00
/
$40.00
5h 53m 50s
9M
/
1M
$188.4067%62%71%62%60%65%
Gemini 2.0 Flash
1M
$0.10
/
$0.40
7h 35m 44s
282M
/
2M
$33.6258%60%67%53%57%60%
Cost Versus Score
(Note: Very expensive models are exluded from the scatter plot.)