Scores inherited from Llama 3.2 11B Vision — this is a hosted variant of the same open-weights model, so the underlying benchmark scores are identical.
MMLU73%
Multitask academic knowledge across 57 subjects.
GPQA Diamond33%
Graduate-level science questions, "Google-proof".
MATH51%
High-school competition math problems.
HumanEval64%
Python function synthesis from docstrings.
Hand-curated from each provider's published reports and public leaderboards. Methodology varies across sources — treat as directional rather than authoritative.