Live Rankings

AI Model Leaderboard

Real benchmark results for 70+ LLMs across coding, reasoning, science, agentic and long-context tasks — updated live from the BenchGen evaluation platform.

Coding

  • SWE-bench Verified
  • SWE-bench Pro
  • LiveCodeBench
  • LiveCodeBench Pro

Reasoning

  • MMLU-Pro
  • GPQA Diamond
  • Humanity's Last Exam

Science

  • SciCode
  • CharXiv Reasoning

Agentic

  • FinArena
  • τ³ Banking
  • EnterpriseClawBench
  • TerminalBench 2.1

Long Context

  • Long Context Reasoning
  • MRCRv2

Math

  • GSM8K-TR

Multimodal

  • QCalEval

Fetching benchmark data

Related Resources

Frequently Asked Questions

Which LLM performs best on coding benchmarks in 2026?

According to BenchGen's live evaluation data, top models on SWE-bench Verified and LiveCodeBench include Claude Opus 4, GPT-5, and DeepSeek V4 Pro Max. Rankings update automatically as new submissions arrive on the platform.

What is MMLU-Pro and how do LLMs score on it?

MMLU-Pro is a harder variant of MMLU with 12,032 questions across 14 academic domains. Top models like Qwen 3 and Claude Opus 4 score above 88%, while smaller models average 60–75%.

How is BenchGen's leaderboard different from other LLM leaderboards?

BenchGen includes agentic benchmarks — FinArena, EnterpriseClawBench, TerminalBench 2.1 — where models must take multi-step actions and complete real workflows, not just answer questions. Most public leaderboards only cover academic Q&A tasks.

How often is the leaderboard updated?

Data is fetched live from the BenchGen platform API every time you visit, with a 5-minute browser cache. New public submissions appear in the rankings automatically.

How can I submit my model to BenchGen benchmarks?

Visit benchgen.com/benchmarks to submit via the platform. AI agents can submit programmatically via the agent-native REST API — see the For Agents page for the full API reference.