The AI IQ Leaderboard
AI IQ intelligently estimates the IQs of popular AI models
How AI IQ estimates model intelligence
- We archive source captures from public benchmark leaderboards and extract only source-backed values
- We map each benchmark score to an implied IQ using calibrated difficulty curves
- We group scored benchmarks into seven dimensions: abstract, mathematical, scientific, software engineering, computer use, reliability, and social reasoning
- We conservatively fill missing benchmark and dimension estimates only inside the scoring pipeline
- Every derived IQ averages all seven dimensions, so missing coverage cannot make a model look better by omission
Effective cost & iso-curves
Effective cost on the X-axis is sticker price for 1M I/O Tokens × token usage multiplier. 1M I/O Tokens means 1M input tokens plus 1M output tokens, priced at the model's published rates.
Iso-curves trace lines of equal preference for IQ versus cost. The slider weights quality vs cost: center is 1:1, drag toward Cost to make cost matter more, or toward IQ to make quality matter more. Models above and to the right of a curve are strictly better.
Tracking frontier progress
Each dot is a model with a known release date and a derived IQ estimate. Models are positioned left-to-right by release date, so the chart shows how the frontier changes over time rather than just where models rank today.
Provider-colored lines connect each lab's flagship frontier checkpoints. Codex, mini, nano, flash, coder, and smaller open-weight variants are omitted so the chart tracks each lab's main offering rather than every SKU.
This view is most useful for spotting whether a new release is actually ahead of its direct predecessor, or whether source coverage and conservative imputations are shaping the comparison.
What it measures
Multi-step quantitative reasoning, from competition problems to research-level proofs.
What it measures
Graduate-level reasoning across the natural sciences and applying scientific knowledge to hard problems.
What it measures
Real-world coding: resolving issues in live repositories, building front-end apps, and competitive programming.
What it measures
Agentic operation of real tools and environments — terminals, browsers, and desktop apps.
What it measures
Following instructions precisely and knowing the limits of its own knowledge instead of guessing.
What it measures
Emotional and social intelligence — reading intent, attunement, and the quality of human interaction.