Methodology
AI IQ assigns each model an estimated IQ score by evaluating performance across 4 cognitive dimensions, each measured by multiple benchmarks. Hard, ungameable benchmarks retain full IQ curves, while easier or gameable benchmarks have compressed ceilings that limit their influence. Missing benchmarks and dimensions are conservatively imputed, and the composite IQ is the mean of all four dimension scores.
This page documents the full scoring system: how the four dimensions are defined, how raw scores map to IQ via piecewise-linear interpolation, how benchmark ceilings are compressed for gameability, and how missing values are imputed.
The 4-Dimension Framework
AI IQ organizes evaluation into four cognitive dimensions. Each dimension uses multiple benchmarks that are averaged together, with missing benchmarks conservatively imputed:
- Hard benchmarks are frontier-discriminating tests with low gameability. They retain full IQ curves with ceilings of 143–158 and can differentiate between the strongest models.
- Compressed benchmarks are easier or more gameable tests. Their anchor curves are compressed to lower ceilings (128–140), limiting how much a high score on a gameable benchmark can inflate the composite.
The composite IQ requires at least 2 of 4 dimensions to have data. Models with fewer scored dimensions fall back to a manual IQ estimate.
Formulas
Each benchmark raw score \(s\) is converted to an IQ value via piecewise-linear interpolation over that benchmark's anchor points \(\mathbf{A} = [(s_0, a_0),\, (s_1, a_1), \ldots]\):
Each dimension averages the IQ values of its benchmarks. Missing benchmarks are conservatively imputed before averaging using a symmetric 3-tier system (see Benchmark-Level Imputation).
The four dimensions:
Superscripts denote compressed ceilings. \(f\) is piecewise-linear interpolation over each benchmark's anchor curve.
The composite IQ is the mean of all four dimension scores, requiring at least 2 scored dimensions:
D1: Abstract Reasoning
Abstract reasoning is the ability to solve novel problems without relying on prior knowledge. This is the closest analogue to the "g factor" in human psychometrics — raw problem-solving ability applied to patterns never seen before.
ARC-AGI-2 Hard
Each puzzle requires identifying a novel visual transformation rule from examples and applying it to a new input. The puzzles are unique and cannot be memorized. This is the purest test of abstract reasoning in the benchmark set — no prior knowledge helps, only the ability to infer abstract rules from examples. Far from saturation (top models ~85%). The curve compresses above IQ 140 to reflect diminishing returns in the superhuman range.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 20 | 85 |
| 40 | 95 |
| 60 | 100 |
| 75 | 115 |
| 85 | 125 |
| 95 | 140 |
| 100 | 143 |
ARC-AGI-1 Compressed · ceil 135
Same format as ARC-AGI-2 but an easier problem set. Top models now score ~96%, so it no longer discriminates at the frontier. The anchor curve is compressed from a ceiling of 152 down to 135 to limit the influence of saturated scores.
| Score % | IQ |
|---|---|
| 0 | 78 |
| 15 | 92 |
| 30 | 102 |
| 50 | 111 |
| 70 | 119 |
| 85 | 127 |
| 95 | 132 |
| 100 | 135 |
D2: Mathematical Reasoning
Mathematical reasoning and quantitative problem-solving — the ability to work with mathematical structures, proofs, and analytical frameworks. The hard benchmarks test novel quantitative reasoning that cannot be memorized from training data.
FrontierMath T4 Hard
Extremely difficult original math problems from Tier 4 of the FrontierMath benchmark. Problems are novel and cannot be found in training data. Top models currently score ~25%. T4 is averaged with AIME to form the D2 (Mathematical Reasoning) dimension score. The curve compresses above IQ 140.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 5 | 100 |
| 15 | 120 |
| 30 | 135 |
| 50 | 142 |
| 70 | 148 |
| 100 | 155 |
AIME Compressed · ceil 135
Competition mathematics with integer answers. Old AIME problems are widely available in training data, with studies detecting 10–20 point contamination boosts. Models at ~98%. The original anchor curve was reshaped to be flatter in the mid-range and steeper at the top (so 80–100% scores spread out instead of bunching), then compressed from a ceiling of 146 to 135 to limit the influence of contamination-driven scores.
| Score % | IQ |
|---|---|
| 0 | 82 |
| 20 | 95 |
| 40 | 103 |
| 60 | 109 |
| 80 | 117 |
| 90 | 124 |
| 100 | 135 |
D3: Programmatic Reasoning
Practical engineering ability — the capacity to solve real-world technical problems in code and systems. The hard benchmark tests execution-based tasks that require genuine interaction with systems, while the compressed benchmarks cover real-world software engineering and scientific computing.
Terminal-Bench 2.0 Hard
Models execute shell commands in isolated Docker containers to complete practical system administration and development tasks. The interactive, execution-based format makes memorization ineffective. One of the highest-integrity benchmarks in the set. The curve compresses above IQ 140.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 10 | 100 |
| 25 | 115 |
| 40 | 125 |
| 55 | 135 |
| 65 | 140 |
| 80 | 145 |
| 100 | 150 |
SWE-bench Verified Compressed · ceil 128
Models generate patches to resolve real GitHub issues and pass unit tests. However, 94% of issues predate model training cutoffs and ~30% have solution leakage. This makes it one of the most gameable benchmarks, resulting in the most aggressive compression — from a ceiling of 144 down to 128.
| Score % | IQ |
|---|---|
| 0 | 80 |
| 15 | 92 |
| 30 | 102 |
| 50 | 110 |
| 65 | 117 |
| 80 | 123 |
| 100 | 128 |
SciCode Compressed · ceil 140
Scientific computing tasks requiring domain expertise in physics, chemistry, and biology alongside programming skill. The bottleneck is understanding the science, not the programming. The interdisciplinary nature provides partial protection against memorization from academic literature. Compressed from ceiling 158 to 140 to account for moderate gameability.
| Score % | IQ |
|---|---|
| 10 | 78 |
| 20 | 88 |
| 30 | 100 |
| 40 | 108 |
| 50 | 117 |
| 60 | 125 |
| 80 | 135 |
| 100 | 140 |
D4: Academic Reasoning
Breadth and depth of expert-level knowledge across academic domains. The hard benchmarks test whether a model can answer questions that push the boundaries of human expertise itself, while the compressed benchmark tests graduate-level science knowledge.
Humanity's Last Exam Hard
Questions contributed by domain experts and explicitly screened to ensure no existing model can answer them at creation time. The benchmark spans the full frontier of human expertise. Current top score is ~48%. The curve compresses significantly above IQ 140 — even though 100% would represent superhuman breadth of knowledge, the IQ ceiling is kept at 158 so that no single benchmark can inflate the composite above ~155.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 5 | 95 |
| 10 | 110 |
| 15 | 120 |
| 20 | 130 |
| 25 | 140 |
| 35 | 145 |
| 50 | 150 |
| 75 | 155 |
| 100 | 158 |
CritPt Hard
Novel mathematical analysis problems that require identifying critical points and applying analytical reasoning. Problems are original, making memorization ineffective. Current top score is 13/20. Scores are on a 0–20 scale (not percentage). The curve compresses above IQ 140.
| Score (0–20) | IQ |
|---|---|
| 0 | 70 |
| 0.6 | 120 |
| 1.6 | 130 |
| 3 | 135 |
| 5 | 140 |
| 8 | 145 |
| 12 | 150 |
| 20 | 155 |
GPQA Diamond Compressed · ceil 135
Graduate-level science questions written by PhD experts. A 25% score equals random guessing. Domain experts score 65–81%. The public question set is widely available in training data, making contamination a significant concern. The anchor curve is compressed from ceiling 148 to 135.
| Score % | IQ |
|---|---|
| 25 | 85 |
| 35 | 98 |
| 50 | 107 |
| 65 | 115 |
| 80 | 123 |
| 90 | 131 |
| 100 | 135 |
Piecewise-Linear Interpolation
Each benchmark defines a set of anchor points mapping raw scores to IQ values. For scores that fall between two anchors, we use piecewise-linear interpolation:
If the score is at or below the lowest anchor, the model receives that anchor's IQ. If at or above the highest, it receives the ceiling IQ. There is no extrapolation beyond the defined range.
This approach avoids assumptions about the distribution shape between anchors. Each segment can have a different slope, allowing the curve to be steeper where small score improvements represent large cognitive leaps (e.g., going from 0% to 5% on HLE) and flatter where additional points reflect diminishing differentiation.
Benchmark Averaging & Compression
Each dimension averages all its benchmarks together, with missing benchmarks conservatively imputed. Rather than separating benchmarks into primary/fallback tiers with a hard cap, we use compressed anchor curves to limit the influence of easier or gameable benchmarks.
How Compression Works
For compressed benchmarks, the anchor curve is rescaled so that IQ values above 100 are proportionally reduced toward a lower ceiling:
Values at or below IQ 100 are unchanged. This preserves the low-end of the curve (where models genuinely struggle) while compressing the high-end where gameable benchmarks over-reward.
Why compress instead of cap? A hard cap (e.g., IQ 115) discards all discrimination above the cap — a model scoring 80% and one scoring 100% on AIME would both receive 115. Compression preserves the rank ordering while reducing the magnitude of the advantage that gameable benchmarks can confer. A perfect AIME score now yields IQ 130 instead of 146, which still contributes meaningfully but cannot dominate the dimension average.
Benchmark-Level Imputation
When a model has scores for some but not all of a dimension's benchmarks, the missing benchmarks are filled in before the dimension IQ is averaged. We use two ingredients:
- The model's available-benchmark IQ average — how the model is performing on the benchmarks it does have in this dimension. This is the within-dimension signal: if a model is hitting IQ 130 on the dimension's other benchmarks, the missing one is probably also somewhere around 130.
- The benchmark's 80th-percentile IQ (\(P_{80}\)) — a per-benchmark ceiling derived from the actual data. Take every model that has a real score on that benchmark, convert each score to an implied IQ via the anchor curve, sort those IQs from low to high, and take the value at the 80th-percentile rank. So if 50 models have HLE scores yielding implied IQs ranging from 70 to 155, \(P_{80}(\text{HLE})\) is the implied IQ at the 80th-percentile rank in that sorted list. It's "where strong-but-not-frontier models actually land on this benchmark."
The imputed value is the minimum of the two:
Why min of the two? The model's own dimension average is the best within-dimension signal we have. Capping at the 80th-percentile prevents a strong model from being imputed past where the actual data has been observed — a model averaging IQ 145 in this dimension might hit the missing benchmark's ceiling, but the imputed value won't claim that without measurement. The min lets imputation move a missing score up or down toward what the rest of the dimension implies, while staying conservatively below where the field has empirically reached.
Imputation only fires inside a dimension that has at least one real benchmark. If a dimension has zero real benchmarks for a given model, no benchmark-level imputation runs there — the dimension itself is either left missing or filled at the dimension level (see Composite IQ Calculation below).
Composite IQ Calculation
Step 1: Score Each Dimension
For every dimension where the model has at least one real benchmark, compute the dimension IQ as the average of its benchmarks. If some benchmarks within the dimension are missing, fill them via the conservative imputation above before averaging.
Step 2: Dimension-Level Imputation (D2–D4 only)
If a model has at least 2 scored dimensions but is missing some of the others, the missing dimensions D2 (Math), D3 (Programmatic), or D4 (Academic) are imputed using the same min-of-two pattern as benchmark imputation, but applied one level up:
where \(\overline{\mathrm{IQ}}_{\text{scored dims}}\) is the model's average IQ across the dimensions it does have, and \(P_{80}(D_k)\) is the 80th-percentile dimension-IQ across all models that have real data on dimension \(D_k\) — the same construct as before, with dimension IQs (averages of their benchmarks) sorted in place of single-benchmark implied IQs.
D1 (Abstract) is never imputed at the dimension level. If a model has no ARC-AGI data at all, D1 is left missing and the composite is averaged over the remaining dimensions.
Step 3: Compute the Composite
where \(N\) is the number of dimensions actually used (real or imputed via Step 2). After Step 2, most models have \(N=4\).
Key rules:
- Minimum 2 dimensions required. Models with fewer than 2 scored dimensions do not receive a derived composite IQ and instead display a manual estimate.
- D1 is special. ARC-AGI data is never substituted by another dimension's average. Models without ARC data simply have a 3-dimension composite.
- Transparent count. The display shows
X/4so readers can see how many dimensions actually had data vs. were imputed. - Equal weighting. All dimensions contribute equally. Compressed ceilings (not differential weighting) handle benchmark quality differences.
Rank Status
Each model receives a rank status reflecting the completeness of its evaluation:
- Full — All 4 dimensions scored. The most reliable composite.
- Partial — 2–3 dimensions scored. Composite is derived but based on incomplete coverage.
- Provisional — Only 1 dimension scored. Not enough for a derived composite; falls back to manual IQ.
- Unranked — No dimension data available. Uses manual IQ estimate only.
Benchmarks Not Included
Three benchmarks that were part of the previous (v1) flat-averaging system have been removed from the composite IQ calculation:
- LiveCodeBench — While it has very low gameability due to continuously refreshed problems, it overlaps heavily with the Programmatic Reasoning dimension already covered by Terminal-Bench. Its removal avoids double-counting coding ability.
- MMLU-Pro — A 10-choice multiple-choice knowledge test. Overlaps with the Academic Reasoning dimension (GPQA/HLE) and adds limited discrimination at the frontier. Models have converged to similar high scores.
- MMMU-Pro — Multimodal academic questions. While the vision component is interesting, most frontier model evaluation focuses on text-based reasoning. This benchmark is tracked in the data but excluded from the IQ composite.
These benchmarks remain in the database and are viewable on the data page — they are simply not included in the composite IQ computation.
Supplementary Math Benchmarks (scored, not yet weighted)
Two newer math benchmarks are scored against hand-calibrated anchor curves and surfaced on the home page as standalone bell curves and cost-scatters, but are not yet weighted into the composite IQ. They sit ready to be promoted into the D2 (Mathematical Reasoning) dimension once we're confident the calibration tracks the rest of the framework.
- FrontierMath Tier 1–3 (ceiling 152) — harder than AIME, easier than T4. Top general models score ~50%. Slots between AIME and T4 in difficulty and gives more discrimination in the middle of the math distribution where T4 is too sparse.
- ProofBench (ceiling 158) — formally-verified proof writing. A different cognitive task than the problem-solving benchmarks (you have to construct a verified proof, not just give an answer). Top general models ~56%; specialized math models ~71%.
EQ Scoring
AI IQ estimates an Emotional Quotient (EQ) for each model from two complementary signals: EQ-Bench 3 Elo (AI-judged emotional understanding in challenging roleplays) and Arena Elo (broad conversational quality as ranked by human-preference voting). Each Elo score is mapped to an implied EQ via a hand-calibrated anchor curve, and the two are combined into a 50/50 composite when both are available.
If only one source is available, the composite uses that source directly.
EQ-Bench 3 Elo → EQ
EQ-Bench 3 produces Elo ratings from head-to-head emotional-roleplay matchups judged by Claude. The Elo range observed in production runs roughly from 200 (very weak) to 2000 (top frontier). The mapping:
| EQ-Bench Elo | EQ |
|---|---|
| 200 | 78 |
| 600 | 88 |
| 900 | 93 |
| 1100 | 97 |
| 1300 | 105 |
| 1500 | 113 |
| 1700 | 125 |
| 2000 | 140 |
Arena Elo → EQ
LM Arena Elo reflects broad conversational quality as judged by human voters in head-to-head matchups. The observed Elo range is tighter (~1100–1520), so the anchor curve is calibrated separately:
| Arena Elo | EQ |
|---|---|
| 1100 | 70 |
| 1200 | 80 |
| 1300 | 95 |
| 1350 | 105 |
| 1400 | 113 |
| 1450 | 122 |
| 1500 | 132 |
| 1520 | 140 |
Anthropic Family-Bias Adjustment
EQ-Bench 3 is judged by Claude (an Anthropic model), which creates potential scoring bias in favor of Anthropic models. To correct for this, we subtract a 200-point Elo penalty from the EQ-Bench component for all Anthropic models before mapping to implied EQ. The Arena component is unaffected since it uses human judges.
Why two sources? EQ-Bench is a dedicated emotional-intelligence benchmark with explicit roleplay scenarios; it's the most direct measurement we have, but it's AI-judged. Arena is human-judged but covers general conversation, not emotional intelligence specifically. Averaging the two gives a balance of specificity and judgment-source diversity.
Cost & Speed Metrics
Token Cost — sticker price for a typical workload
Cost on the home page is anchored to a 2:1 input-to-output token mix — a deliberately input-heavy workload that reflects most real applications (RAG, long-context reasoning, agent loops). Token Cost is the sticker dollar amount to process 2M input tokens and generate 1M output tokens at a model's published rates:
where \(p_{\text{in}}\) and \(p_{\text{out}}\) are the published per-million-token prices in dollars.
Token Efficiency — how token-hungry is the model?
Sticker price alone hides large per-task differences in how many tokens a model burns to do a given amount of work. A frugal model can be cheaper in practice than a pricier one that uses fewer tokens, and reasoning models can spend dramatically more output tokens than non-reasoning ones. We measure this with the Artificial Analysis Index token-usage data — the total tokens (input + reasoning + output) each model consumes when running the AA evaluation suite:
where \(T_{\text{model}}\) is the model's total tokens consumed running the AA evaluation suite, and the median is taken across all models with token-usage data. A multiplier of 1.0 means the model uses the same total tokens as the median; 2× means it uses twice as many; 0.5× means half.
Effective Cost — what it actually costs to do the same task
The product of the two:
Reads as: what this model spends on a task that the median model handles with 2M input + 1M output tokens. Models below the diagonal (Effective Cost < Token Cost) are token-efficient and cheaper than their sticker suggests; models above are token-hungry. This is the cost axis on every cost-vs-quality chart on the home page.
Response Time
Response time is the median seconds to a complete answer (lower is better), shown on a logarithmic scale. The IQ vs Response Time chart reverses the X axis so the upper-right corner represents the ideal — high intelligence at low latency.
Limitations & Transparency
- Dimension coverage varies. Some models have data for all 4 dimensions; others have as few as 2 (with the rest imputed). A model's composite IQ is most reliable when all dimensions are scored. Always check the
X/4count and rank status. - Benchmark mix matters. Two models with the same composite IQ may have very different underlying data quality. One might have all hard benchmarks (ungameable tests with full curves) while another relies mostly on compressed benchmarks (with lower ceilings). The rank status and dimension count help distinguish these cases.
- Imputation is conservative, not clairvoyant. Missing benchmarks are filled using a 3-tier system (family match, grouping regression, or conservative fallback). These are reasonable estimates, not ground truth — a model's true ability on an unevaluated benchmark could be significantly higher or lower.
- Anchor calibration is subjective. The mapping from raw scores to IQ involves judgment calls about what different performance levels mean relative to human cognitive ability. We document our rationale for each benchmark, but reasonable people can disagree.
- IQ is a metaphor. Human IQ tests measure a specific construct via standardized instruments under controlled conditions. AI benchmark performance is a different thing. The IQ scale provides an intuitive frame of reference, not a claim of equivalence.
- Compressed ceilings are a design choice. The ceiling values directly affect which models benefit and which are penalized. Models that excel on compressed benchmarks will have their contributions limited, which may feel unfair if those benchmarks genuinely reflect high ability. We believe the trade-off — rewarding harder evaluation — is correct, but the specific ceiling values are judgment calls.
- Benchmarks become stale. As models improve and training data evolves, benchmark ceilings, gameability ratings, and compression levels may need revision. This methodology is a living document.
Asymptotic Compression Above IQ 140
The anchor point curves intentionally compress above IQ 140. Each additional percentage point on a benchmark contributes less to the IQ score in the superhuman range than in the human range. This reflects three realities:
- Human IQ distributions compress at the tails. The difference between IQ 100 and IQ 120 is much more common than the difference between IQ 140 and IQ 160.
- Superhuman benchmark scores are driven by breadth, not depth. A model scoring 50% on FrontierMath T4 isn't twice as smart as one scoring 25% — it covers more mathematical branches rather than being fundamentally more capable in any single branch.
- Practical discrimination. Without compression, reasoning vs. non-reasoning configurations of the same model produce 20+ point IQ gaps, which is unrealistic. With compression, the gap narrows to ~10-12 points (“smart” vs. “very smart” rather than “above average” vs. “genius”).
The compression ensures that no benchmark can single-handedly produce IQ values above ~155, regardless of raw score. The theoretical ceiling of the composite is approximately 150–155 under current benchmarks.
Sources
Benchmark scores, prices, and token usage come from publicly published leaderboards. Each source is sampled periodically and reconciled against published numbers before being applied.
- Artificial Analysis Intelligence Index — the primary aggregator. Provides scores for AIME, GPQA Diamond, SWE-Bench Verified, HLE, SciCode, Terminal-Bench 2.0, CritPt, LiveCodeBench, IFBench, MMMU-Pro, and the AA composite indices (Omniscience, GDPval, τ2-Bench Telecom, LCR), plus per-model pricing (input + output $/M tokens), response time, median throughput, total evaluation cost for the AA suite, and the token-usage breakdown (input + reasoning + output) used for token efficiency.
- LM Arena — head-to-head Elo ratings and ranks
- ARC Prize leaderboard — ARC-AGI-1 and ARC-AGI-2 scores and per-task cost
- Vals.ai — the Vals Index (accuracy, cost-per-test, and latency on a curated agentic-task suite) and ProofBench (formally-verified proof-writing accuracy)
- Epoch AI — FrontierMath Tier 1–3 and Tier 4 accuracy
- EQ-Bench 3 — emotional-intelligence Elo
The Artificial Analysis Intelligence Index can list two rows for the same model under one display name when the same underlying model has both a reasoning and a non-reasoning configuration (the reasoning row is marked with a 💡 lightbulb icon). When the two configurations differ meaningfully on cost, latency, or quality, they are tracked as separate model entries (e.g. reasoning vs non-reasoning variants of the same release).