LLM Performance Leaderboard 2025

Updated: January 15, 2025 10 min read

Welcome to the comprehensive LLM Performance Leaderboard, tracking and comparing the capabilities of leading AI language models across multiple benchmarks. This leaderboard provides detailed insights into how different models perform on various tasks, from general knowledge to specialized domains.

Large-Scale LLMs Leaderboard

Model	Overall	MMLU	ARC	WG	PIQA	CSQA	Race	MedMCQA	OBQA
GPT-4o-2024-05-13	70.15	79.09	86.31	72.22	60.34	70.28	67.87	57.85	67.21
GPT-4-1106-preview	65.93	74.77	82.68	66.22	61.64	62.96	67.05	51.81	60.29
Claude-3 Opus	62.53	70.23	75.47	63.54	59.05	63.66	66.22	49.14	52.95
Mistral Large	60.84	68.76	72.32	56.83	61.21	55.35	70.17	43.44	58.66
GPT-3.5	60.32	65.38	78.42	64.56	54.89	67.89	60.11	41.42	49.90
Gemini 1.0 Pro	54.06	56.04	72.35	56.35	47.70	50.56	61.02	35.89	52.55
Llama3-70b-Instruct	52.92	59.67	67.09	57.14	43.10	55.49	58.21	41.67	40.94

Small-Scale LLMs Leaderboard

Model	Overall	MMLU	ARC	WG	PIQA	CSQA	Race	MedMCQA	OBQA
Qwen1.5 (1.8B)	21.68	9.99	15.84	40.96	15.52	31.13	34.91	4.70	20.37
Gemma (2B)	16.66	17.52	23.93	16.10	15.09	27.46	14.32	4.57	14.26
SlimPajama-DC (1.3B)	9.60	9.22	14.95	14.76	5.32	9.01	16.19	1.68	5.70
RedPajama (1.3B)	9.00	9.21	13.50	16.97	0.86	11.41	14.35	1.86	3.87

Understanding the Benchmarks

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects including science, humanities, engineering, and more.

ARC (AI2 Reasoning Challenge)

Evaluates grade-school level scientific reasoning and knowledge.

WinoGrande (WG)

Tests common sense reasoning through pronoun resolution tasks.

PIQA (Physical Interaction QA)

Evaluates physical commonsense knowledge.

CommonsenseQA (CSQA)

Tests common sense reasoning about everyday situations.

Race

Reading comprehension tasks from English exams.

MedMCQA

Medical domain knowledge and reasoning.

OpenBookQA (OBQA)

Tests application of basic science facts to novel situations.

Data sourced from the VILA-Lab Open-LLM-Leaderboard. This leaderboard is based on the OSQ-bench (Open-Style Question Evaluation) methodology. For more details, see their paper: Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation.