The AI Intelligence Leaderboard

Estimating the intelligence of every major AI model

AI Models on the IQ Bell Curve

Each model's estimated IQ plotted on a standard normal IQ distribution

How AI IQ estimates model intelligence

We archive source captures from public benchmark leaderboards and extract only source-backed values
We map each benchmark score to an implied IQ using calibrated difficulty curves
We group scored benchmarks into seven dimensions: abstract, mathematical, scientific, frontend engineering, backend engineering, computer use, and reliability
We conservatively fill missing benchmark and dimension estimates only inside the scoring pipeline
Every derived IQ averages all seven scored dimensions, so missing coverage cannot make a model look better by omission

IQ vs Effective Cost

Each model's estimated IQ plotted against effective cost per 1M I/O Tokens (sticker price × measured or imputed usage multiplier).

IQ 1:1 Cost

Effective cost & iso-curves

Effective cost on the X-axis is sticker price for 1M I/O Tokens × measured or imputed usage multiplier. 1M I/O Tokens means 1M input tokens plus 1M output tokens, priced at published positive input/output rates.

Usage multipliers use source-backed token and task-cost data first, then same-lineage predecessors, closest measured peers, and finally a 1× fallback only for models with positive pricing. Zero/free-only rows are not plotted as $0 cost.

Iso-curves trace lines of equal preference for IQ versus cost. The slider weights quality vs cost: center is 1:1, drag toward Cost to make cost matter more, or toward IQ to make quality matter more. Models above and to the right of a curve are strictly better.

IQ vs End-to-End Response Time

Each model's IQ against its end-to-end response time, built from its own components: time to first answer token plus output tokens divided by tokens per second.

Latency by task shape

End-to-end response time combines time to first token with the time required to stream the requested output length.

Use the input and output sliders to compare short prompts, long-context prompts, brief answers, and long generations using the same underlying response-time model as the chart gallery.

Iso-curves trace equal preference for IQ versus speed. Models farther up and to the left are better: higher estimated IQ with lower total response time.

Frontier IQ Over Time

X = release date. Y = estimated IQ. Provider step-lines connect each provider's flagship frontier checkpoints over time.

Tracking frontier progress

Each dot is a model with a known release date and a derived IQ estimate. Models are positioned left-to-right by release date, so the chart shows how the frontier changes over time rather than just where models rank today.

Provider-colored lines connect each lab's flagship frontier checkpoints. Codex, mini, nano, flash, coder, and smaller open-weight variants are omitted so the chart tracks each lab's main offering rather than every SKU.

This view is most useful for spotting whether a new release is actually ahead of its direct predecessor, or whether source coverage and conservative imputations are shaping the comparison.

Mathematical Reasoning IQ

Each model's Mathematical Reasoning IQ plotted on a standard normal IQ distribution

What it measures

Multi-step quantitative reasoning, from competition problems to research-level proofs.

FrontierMath Tier 4 FrontierMath Tier 1-3 AIME ProofBench MathArena

Scientific Reasoning IQ

Each model's Scientific Reasoning IQ plotted on a standard normal IQ distribution

What it measures

Graduate-level reasoning across the natural sciences and applying scientific knowledge to hard problems.

Humanity's Last Exam CritPt SciCode GPQA Diamond

Abstract Reasoning IQ

Each model's Abstract Reasoning IQ plotted on a standard normal IQ distribution

What it measures

Fluid problem-solving on novel puzzles a model cannot have memorized — abstracting patterns from just a few examples.

ARC-AGI-2 ARC-AGI-1 ARC-AGI-3

Frontend Engineering IQ

Each model's Frontend Engineering IQ plotted on a standard normal IQ distribution

What it measures

Turning product and design prompts into usable apps, front-end experiences, and full-stack prototypes.

Arena.ai WebDev DesignArena Frontend DesignArena Full Stack Vibe Code Bench

Backend Engineering IQ

Each model's Backend Engineering IQ plotted on a standard normal IQ distribution

What it measures

Coding fluency, repository repair, debugging, testing, and long-horizon engineering execution.

SWE Marathon FrontierCode Diamond FrontierSWE SWE-Bench Verified SWE-Bench Pro DeepSWE SWE-rebench LiveCodeBench

Computer Use IQ

Each model's Computer Use IQ plotted on a standard normal IQ distribution

What it measures

Agentic operation of real tools and environments — terminals, browsers, and desktop apps.

Terminal-Bench 2.0 Terminal-Bench Hard BrowseComp OSWorld-Verified Toolathlon MCP Atlas

Reliability IQ

Each model's Reliability IQ plotted on a standard normal IQ distribution

What it measures

Following instructions precisely, staying factual, challenging false premises, and grounding answers in long source documents.

IFBench AA Omniscience BullshitBench v2 AA Long Chain Reasoning FACTS Grounding

Emotional Reasoning (EQ)

Diagnostic Emotional Reasoning scores, excluded from Composite IQ

What it measures

A diagnostic view of emotional and interpersonal behavior. This is excluded from Composite IQ until the benchmark base becomes more rigorous.

EQ-Bench 3 Arena.ai Overall AttuneBench

IQ vs Speed vs Cost in 3D

3D scatter: X = response time (log, faster to the right), Y = IQ, Z = effective cost (log). Color = provider. Drag to rotate.

Three tradeoffs at once

Most charts pit two qualities against each other. This view holds all three of the practical tradeoffs in one space: how smart a model is, how fast it answers, and what it costs to run.

IQ rises on the vertical axis, faster models sit to the right, and effective cost runs back into the depth axis on a log scale. The ideal model lives up, right, and toward the front — high intelligence, quick responses, and low cost. Drag to rotate and find where each provider clusters.

IQ Methodology

The AI Intelligence Leaderboard

How AI IQ estimates model intelligence

Effective cost & iso-curves

Latency by task shape

Tracking frontier progress

What it measures

What it measures

What it measures

What it measures

What it measures

What it measures

What it measures

What it measures

Three tradeoffs at once

Get the weekly AI model intelligence newsletter

Scored benchmarks, 7 dimensions

How dimensions relate to composite IQ