LLM Performance Leaderboard 2025
Welcome to the comprehensive LLM Performance Leaderboard, tracking and comparing the capabilities of leading AI language models across multiple benchmarks. This leaderboard provides detailed insights into how different models perform on various tasks, from general knowledge to specialized domains.
Large-Scale LLMs Leaderboard
Model | Overall | MMLU | ARC | WG | PIQA | CSQA | Race | MedMCQA | OBQA |
---|---|---|---|---|---|---|---|---|---|
GPT-4o-2024-05-13 | 70.15 | 79.09 | 86.31 | 72.22 | 60.34 | 70.28 | 67.87 | 57.85 | 67.21 |
GPT-4-1106-preview | 65.93 | 74.77 | 82.68 | 66.22 | 61.64 | 62.96 | 67.05 | 51.81 | 60.29 |
Claude-3 Opus | 62.53 | 70.23 | 75.47 | 63.54 | 59.05 | 63.66 | 66.22 | 49.14 | 52.95 |
Mistral Large | 60.84 | 68.76 | 72.32 | 56.83 | 61.21 | 55.35 | 70.17 | 43.44 | 58.66 |
GPT-3.5 | 60.32 | 65.38 | 78.42 | 64.56 | 54.89 | 67.89 | 60.11 | 41.42 | 49.90 |
Gemini 1.0 Pro | 54.06 | 56.04 | 72.35 | 56.35 | 47.70 | 50.56 | 61.02 | 35.89 | 52.55 |
Llama3-70b-Instruct | 52.92 | 59.67 | 67.09 | 57.14 | 43.10 | 55.49 | 58.21 | 41.67 | 40.94 |
Small-Scale LLMs Leaderboard
Model | Overall | MMLU | ARC | WG | PIQA | CSQA | Race | MedMCQA | OBQA |
---|---|---|---|---|---|---|---|---|---|
Qwen1.5 (1.8B) | 21.68 | 9.99 | 15.84 | 40.96 | 15.52 | 31.13 | 34.91 | 4.70 | 20.37 |
Gemma (2B) | 16.66 | 17.52 | 23.93 | 16.10 | 15.09 | 27.46 | 14.32 | 4.57 | 14.26 |
SlimPajama-DC (1.3B) | 9.60 | 9.22 | 14.95 | 14.76 | 5.32 | 9.01 | 16.19 | 1.68 | 5.70 |
RedPajama (1.3B) | 9.00 | 9.21 | 13.50 | 16.97 | 0.86 | 11.41 | 14.35 | 1.86 | 3.87 |
Understanding the Benchmarks
MMLU (Massive Multitask Language Understanding)
Tests knowledge across 57 subjects including science, humanities, engineering, and more.
ARC (AI2 Reasoning Challenge)
Evaluates grade-school level scientific reasoning and knowledge.
WinoGrande (WG)
Tests common sense reasoning through pronoun resolution tasks.
PIQA (Physical Interaction QA)
Evaluates physical commonsense knowledge.
CommonsenseQA (CSQA)
Tests common sense reasoning about everyday situations.
Race
Reading comprehension tasks from English exams.
MedMCQA
Medical domain knowledge and reasoning.
OpenBookQA (OBQA)
Tests application of basic science facts to novel situations.
Data sourced from the VILA-Lab Open-LLM-Leaderboard. This leaderboard is based on the OSQ-bench (Open-Style Question Evaluation) methodology. For more details, see their paper: Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation.