MMLU88%
Multitask academic knowledge across 57 subjects.
GPQA Diamond59%
Graduate-level science questions, "Google-proof".
MATH90%
High-school competition math problems.
HumanEval89%
Python function synthesis from docstrings.
SWE-bench Verified42%
Real GitHub issues solved end-to-end.
LMArena Elo1318 Elo
Crowd-sourced head-to-head preference Elo rating.
Hand-curated from each provider's published reports and public leaderboards. Methodology varies across sources — treat as directional rather than authoritative.