Roo Code Evals

Roo Code tests each frontier model against a suite of hundreds of exercises across 5 programming languages with varying difficulty. These results can help you find the right price-to-intelligence ratio for your use case.

Want to see the results for a model we haven't tested yet? Ping us in Discord.

Cost Versus Score
(Note: Very expensive models are exluded from the scatter plot.)
Model		Metrics			Scores
Name Context Window	Price In / Out	Duration	Tokens In / Out	Cost USD						Total
Gemini 2.5 Pro Preview 05-06 0	$0.00 / $0.00	6h 39m 30s	22M / 2M	$35.59	94%	98%	98%	97%	93%	96%
Gemini 2.5 Pro Preview 06-05 0	$0.00 / $0.00	4h 22m 47s	28M / 1M	$34.84	92%	93%	98%	100%	93%	95%
Claude Sonnet 4 0	$0.00 / $0.00	4h 33m	35M / 630K	$37.18	94%	91%	96%	97%	97%	95%
Claude 3.7 Sonnet 0	$0.00 / $0.00	4h 52m 36s	19M / 603K	$27.16	92%	93%	98%	97%	87%	94%
GPT 4.1 0	$0.00 / $0.00	4h 39m 51s	37M / 624K	$38.64	92%	91%	90%	94%	90%	91%
Claude 3.5 Sonnet 0	$0.00 / $0.00	3h 37m 58s	19M / 323K	$24.98	94%	91%	92%	88%	80%	90%
Grok 3 0	$0.00 / $0.00	5h 14m 20s	40M / 890K	$74.40	97%	89%	90%	91%	77%	89%
Gemini 2.5 Flash Preview 05-20 (Thinking) 0	$0.00 / $0.00	5h 29m 16s	47M / 2M	$11.33	83%	87%	94%	85%	73%	86%
GPT 4.1 Mini 0	$0.00 / $0.00	5h 17m 57s	47M / 715K	$8.81	81%	84%	94%	76%	70%	83%
o4 Mini (High) 0	$0.00 / $0.00	14h 44m 26s	13M / 3M	$25.70	75%	82%	86%	79%	67%	79%
DeepSeek V3 0	$0.00 / $0.00	7h 12m 41s	30M / 524K	$12.82	83%	76%	82%	76%	67%	77%
o3 Mini (High) 0	$0.00 / $0.00	13h 1m 13s	12M / 2M	$20.36	67%	78%	72%	88%	73%	75%