LLM Performance Leaderboard 2025

LLM Performance Leaderboard 2025

Welcome to the comprehensive LLM Performance Leaderboard, tracking and comparing the capabilities of leading AI language models across multiple benchmarks. This leaderboard provides detailed insights into how different models perform on various tasks, from general knowledge to specialized domains.

Large-Scale LLMs Leaderboard

Model Overall MMLU ARC WG PIQA CSQA Race MedMCQA OBQA
GPT-4o-2024-05-13 70.15 79.09 86.31 72.22 60.34 70.28 67.87 57.85 67.21
GPT-4-1106-preview 65.93 74.77 82.68 66.22 61.64 62.96 67.05 51.81 60.29
Claude-3 Opus 62.53 70.23 75.47 63.54 59.05 63.66 66.22 49.14 52.95
Mistral Large 60.84 68.76 72.32 56.83 61.21 55.35 70.17 43.44 58.66
GPT-3.5 60.32 65.38 78.42 64.56 54.89 67.89 60.11 41.42 49.90
Gemini 1.0 Pro 54.06 56.04 72.35 56.35 47.70 50.56 61.02 35.89 52.55
Llama3-70b-Instruct 52.92 59.67 67.09 57.14 43.10 55.49 58.21 41.67 40.94

Small-Scale LLMs Leaderboard

Model Overall MMLU ARC WG PIQA CSQA Race MedMCQA OBQA
Qwen1.5 (1.8B) 21.68 9.99 15.84 40.96 15.52 31.13 34.91 4.70 20.37
Gemma (2B) 16.66 17.52 23.93 16.10 15.09 27.46 14.32 4.57 14.26
SlimPajama-DC (1.3B) 9.60 9.22 14.95 14.76 5.32 9.01 16.19 1.68 5.70
RedPajama (1.3B) 9.00 9.21 13.50 16.97 0.86 11.41 14.35 1.86 3.87

Understanding the Benchmarks

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects including science, humanities, engineering, and more.

ARC (AI2 Reasoning Challenge)

Evaluates grade-school level scientific reasoning and knowledge.

WinoGrande (WG)

Tests common sense reasoning through pronoun resolution tasks.

PIQA (Physical Interaction QA)

Evaluates physical commonsense knowledge.

CommonsenseQA (CSQA)

Tests common sense reasoning about everyday situations.

Race

Reading comprehension tasks from English exams.

MedMCQA

Medical domain knowledge and reasoning.

OpenBookQA (OBQA)

Tests application of basic science facts to novel situations.

Data sourced from the VILA-Lab Open-LLM-Leaderboard. This leaderboard is based on the OSQ-bench (Open-Style Question Evaluation) methodology. For more details, see their paper: Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation.