Methodology

AI IQ assigns each model an estimated IQ score by evaluating performance across 4 cognitive dimensions, each measured by multiple benchmarks. Hard, ungameable benchmarks retain full IQ curves, while easier or gameable benchmarks have compressed ceilings that limit their influence. Missing benchmarks and dimensions are conservatively imputed, and the composite IQ is the mean of all four dimension scores.

This page documents the full scoring system: how the four dimensions are defined, how raw scores map to IQ via piecewise-linear interpolation, how benchmark ceilings are compressed for gameability, and how missing values are imputed.

The 4-Dimension Framework

AI IQ organizes evaluation into four cognitive dimensions. Each dimension uses multiple benchmarks that are averaged together, with missing benchmarks conservatively imputed:

Hard benchmarks are frontier-discriminating tests with low gameability. They retain full IQ curves with ceilings of 143–158 and can differentiate between the strongest models.
Compressed benchmarks are easier or more gameable tests. Their anchor curves are compressed to lower ceilings (128–140), limiting how much a high score on a gameable benchmark can inflate the composite.

The composite IQ requires at least 2 of 4 dimensions to have data. Models with fewer scored dimensions fall back to a manual IQ estimate.

Abstract Reasoning

ARC-AGI-2 (ceil 143), ARC-AGI-1 (compressed 135)

Mathematical Reasoning

FrontierMath T4 (ceil 155), AIME (compressed 130)

Programmatic Reasoning

Terminal-Bench 2.0 (ceil 150), SWE-bench (compressed 128), SciCode (compressed 140)

Academic Reasoning

Humanity's Last Exam (ceil 158), CritPt (ceil 155), GPQA Diamond (compressed 135)

Formulas

Each benchmark raw score $s$ is converted to an IQ value via piecewise-linear interpolation over that benchmark's anchor points $\mathbf{A} = [(s_0, a_0),\, (s_1, a_1), \ldots]$:

$$f(s) = a_i + \frac{s - s_i}{s_{i+1} - s_i}\,(a_{i+1} - a_i), \qquad s_i \le s \le s_{i+1}$$

Each dimension averages the IQ values of its benchmarks. Missing benchmarks are conservatively imputed before averaging using a symmetric 3-tier system (see Benchmark-Level Imputation).

The four dimensions:

$$\begin{array}{l l} \mathrm{IQ}_{\text{Abstract}} & = \operatorname{avg}\!\left(f(\text{ARC-AGI-2}),\; f(\text{ARC-AGI-1}^{135})\right) \\[6pt] \mathrm{IQ}_{\text{Math}} & = \operatorname{avg}\!\left(f(\text{FrontierMath T4}),\; f(\text{AIME}^{130})\right) \\[6pt] \mathrm{IQ}_{\text{Programmatic}} & = \operatorname{avg}\!\left(f(\text{Terminal-Bench 2.0}),\; f(\text{SWE-bench}^{128}),\; f(\text{SciCode}^{140})\right) \\[6pt] \mathrm{IQ}_{\text{Academic}} & = \operatorname{avg}\!\left(f(\text{HLE}),\; f(\text{CritPt}),\; f(\text{GPQA}^{135})\right) \end{array}$$

Superscripts denote compressed ceilings. $f$ is piecewise-linear interpolation over each benchmark's anchor curve.

The composite IQ is the mean of all four dimension scores, requiring at least 2 scored dimensions:

$$\boxed{\;\mathrm{IQ} = \frac{1}{4}\!\left(\mathrm{IQ}_{\text{Abstract}} + \mathrm{IQ}_{\text{Math}} + \mathrm{IQ}_{\text{Programmatic}} + \mathrm{IQ}_{\text{Academic}}\right), \qquad n_{\text{scored}} \ge 2\;}$$

D1: Abstract Reasoning

Abstract reasoning is the ability to solve novel problems without relying on prior knowledge. This is the closest analogue to the "g factor" in human psychometrics — raw problem-solving ability applied to patterns never seen before.

ARC-AGI-2 Hard

Format: Visual grid puzzles (novel patterns)

Tasks: Unique visual pattern completion

Gameability: Essentially Ungameable

IQ Ceiling: 143

Each puzzle requires identifying a novel visual transformation rule from examples and applying it to a new input. The puzzles are unique and cannot be memorized. This is the purest test of abstract reasoning in the benchmark set — no prior knowledge helps, only the ability to infer abstract rules from examples. Far from saturation (top models ~85%). The curve compresses above IQ 140 to reflect diminishing returns in the superhuman range.

Score %	IQ
0	70
20	85
40	95
60	100
75	115
85	125
95	140
100	143

ARC-AGI-1 Compressed · ceil 135

Format: Visual grid puzzles

Gameability: Ungameable (but saturating)

Original Ceiling: 152

Compressed Ceiling: 135

Same format as ARC-AGI-2 but an easier problem set. Top models now score ~96%, so it no longer discriminates at the frontier. The anchor curve is compressed from a ceiling of 152 down to 135 to limit the influence of saturated scores.

Score %	IQ
0	78
15	92
30	102
50	111
70	119
85	127
95	132
100	135

D2: Mathematical Reasoning

Mathematical reasoning and quantitative problem-solving — the ability to work with mathematical structures, proofs, and analytical frameworks. The hard benchmarks test novel quantitative reasoning that cannot be memorized from training data.

FrontierMath T4 Hard

Format: Novel research-level math problems

Tier: 4 (research-level)

Gameability: Very Low

IQ Ceiling: 155

Extremely difficult original math problems from Tier 4 of the FrontierMath benchmark. Problems are novel and cannot be found in training data. Top models score ~25%. Currently no T4 data is available for any model — this benchmark is included in the framework for future use. When data exists, it will be averaged with AIME for the D2 score. The curve compresses above IQ 140.

Score %	IQ
0	70
5	100
15	120
30	135
50	142
70	148
100	155

AIME Compressed · ceil 130

Format: Integer answers (0–999)

Questions: 15 per exam

Gameability: High

Original Ceiling: 146

Compressed Ceiling: 130

Competition mathematics with integer answers. Old AIME problems are widely available in training data, with studies detecting 10–20 point contamination boosts. Models at ~98%. The anchor curve is compressed from a ceiling of 146 to 130 to limit the influence of contamination-driven scores.

Score %	IQ
0	82
20	95
40	104
60	112
80	120
90	124
100	130

D3: Programmatic Reasoning

Practical engineering ability — the capacity to solve real-world technical problems in code and systems. The hard benchmark tests execution-based tasks that require genuine interaction with systems, while the compressed benchmarks cover real-world software engineering and scientific computing.

Terminal-Bench 2.0 Hard

Format: Docker container tasks (shell commands)

Tasks: 89 practical tasks

Gameability: Low

IQ Ceiling: 150

Models execute shell commands in isolated Docker containers to complete practical system administration and development tasks. The interactive, execution-based format makes memorization ineffective. One of the highest-integrity benchmarks in the set. The curve compresses above IQ 140.

Score %	IQ
0	70
10	100
25	115
40	125
55	135
65	140
80	145
100	150

SWE-bench Verified Compressed · ceil 128

Format: Real GitHub issue resolution

Tasks: 500 verified issues

Gameability: Very High

Original Ceiling: 144

Compressed Ceiling: 128

Models generate patches to resolve real GitHub issues and pass unit tests. However, 94% of issues predate model training cutoffs and ~30% have solution leakage. This makes it one of the most gameable benchmarks, resulting in the most aggressive compression — from a ceiling of 144 down to 128.

Score %	IQ
0	80
15	92
30	102
50	110
65	117
80	123
100	128

SciCode Compressed · ceil 140

Format: Scientific coding tasks

Gameability: Moderate

Original Ceiling: 158

Compressed Ceiling: 140

Scientific computing tasks requiring domain expertise in physics, chemistry, and biology alongside programming skill. The bottleneck is understanding the science, not the programming. The interdisciplinary nature provides partial protection against memorization from academic literature. Compressed from ceiling 158 to 140 to account for moderate gameability.

Score %	IQ
10	78
20	88
30	100
40	108
50	117
60	125
80	135
100	140

D4: Academic Reasoning

Breadth and depth of expert-level knowledge across academic domains. The hard benchmarks test whether a model can answer questions that push the boundaries of human expertise itself, while the compressed benchmark tests graduate-level science knowledge.

Humanity's Last Exam Hard

Format: 76% exact-match, expert-contributed

Questions: 3,000 (expert-sourced, screened against models)

Gameability: Low

IQ Ceiling: 158

Questions contributed by domain experts and explicitly screened to ensure no existing model can answer them at creation time. The benchmark spans the full frontier of human expertise. Current top score is ~48%. The curve compresses significantly above IQ 140 — even though 100% would represent superhuman breadth of knowledge, the IQ ceiling is kept at 158 so that no single benchmark can inflate the composite above ~155.

Score %	IQ
0	70
5	95
10	110
15	120
20	130
25	140
35	145
50	150
75	155
100	158

CritPt Hard

Format: Critical-point analysis (novel problems)

Tasks: 20 problems

Gameability: Low

IQ Ceiling: 155

Novel mathematical analysis problems that require identifying critical points and applying analytical reasoning. Problems are original, making memorization ineffective. Current top score is 13/20. Scores are on a 0–20 scale (not percentage). The curve compresses above IQ 140.

Score (0–20)	IQ
0	70
0.6	120
1.6	130
3	135
5	140
8	145
12	150
20	155

GPQA Diamond Compressed · ceil 135

Format: 4-choice multiple choice

Questions: 198 (public set)

Gameability: Moderate-High

Original Ceiling: 148

Compressed Ceiling: 135

Graduate-level science questions written by PhD experts. A 25% score equals random guessing. Domain experts score 65–81%. The public question set is widely available in training data, making contamination a significant concern. The anchor curve is compressed from ceiling 148 to 135.

Score %	IQ
25	85
35	98
50	107
65	115
80	123
90	131
100	135

Piecewise-Linear Interpolation

Each benchmark defines a set of anchor points mapping raw scores to IQ values. For scores that fall between two anchors, we use piecewise-linear interpolation:

$$t = \frac{s - s_i}{s_{i+1} - s_i}, \qquad \mathrm{IQ} = a_i + t \cdot (a_{i+1} - a_i)$$

If the score is at or below the lowest anchor, the model receives that anchor's IQ. If at or above the highest, it receives the ceiling IQ. There is no extrapolation beyond the defined range.

This approach avoids assumptions about the distribution shape between anchors. Each segment can have a different slope, allowing the curve to be steeper where small score improvements represent large cognitive leaps (e.g., going from 0% to 5% on HLE) and flatter where additional points reflect diminishing differentiation.

Benchmark Averaging & Compression

Each dimension averages all its benchmarks together, with missing benchmarks conservatively imputed. Rather than separating benchmarks into primary/fallback tiers with a hard cap, we use compressed anchor curves to limit the influence of easier or gameable benchmarks.

How Compression Works

For compressed benchmarks, the anchor curve is rescaled so that IQ values above 100 are proportionally reduced toward a lower ceiling:

$$\mathrm{IQ}_{\text{compressed}} = 100 + (\mathrm{IQ}_{\text{orig}} - 100) \times \frac{C_{\text{new}} - 100}{C_{\text{orig}} - 100}, \qquad \mathrm{IQ}_{\text{orig}} > 100$$

Values at or below IQ 100 are unchanged. This preserves the low-end of the curve (where models genuinely struggle) while compressing the high-end where gameable benchmarks over-reward.

Why compress instead of cap? A hard cap (e.g., IQ 115) discards all discrimination above the cap — a model scoring 80% and one scoring 100% on AIME would both receive 115. Compression preserves the rank ordering while reducing the magnitude of the advantage that gameable benchmarks can confer. A perfect AIME score now yields IQ 130 instead of 146, which still contributes meaningfully but cannot dominate the dimension average.

Benchmark-Level Imputation

When a model is missing benchmark scores, the missing values are filled in before dimension IQs are computed. A symmetric 3-tier imputation system is applied to all 10 benchmarks across all 4 dimensions. For each dimension, the imputation uses only real data from the other 3 dimensions as the predictor (leave-one-dimension-out), preventing circular dependencies.

Tier 1 — Family match: If a weaker family member (same model family, leave-out IQ at least 3 points lower, benchmark distance ≤ 15) has real data for this benchmark, copy its score. The IQ margin ensures we only impute downward — a model never inherits a score from a stronger sibling.
Tier 2 — Grouping regression: If the model’s grouping (e.g., China, OpenAI, Anthropic) has a positive-slope linear regression for this benchmark, and the model’s leave-out IQ falls within the grouping’s training range, predict from the within-grouping regression. The prediction is capped at the global regression to moderate outliers.
Tier 3 — Conservative fallback: Use min(median score, global regression prediction), clamped to [0, 100]. This ensures models without strong cross-dimensional evidence cannot score above the median through imputation.

Why leave-one-dimension-out? To impute a missing benchmark in dimension D_i, we compute each model’s “leave-out IQ” from only the other 3 dimensions’ real data. This prevents imputed values from leaking into the predictor axis — all regressions and family comparisons use only original measurements. Every dimension is treated identically; there is no special ordering or phased imputation.

Why impute downward only (Tier 1)? Models from the same family can have very different capabilities. The −3 IQ margin ensures we only copy scores from a demonstrably weaker relative. For example, gpt-5-mini’s ARC-AGI scores can be used for gpt-oss-120b (since gpt-5-mini has lower leave-out IQ), but o3’s scores cannot — o3 may be substantially better at ARC despite similar overall IQ.

Composite IQ Calculation

After benchmark-level imputation fills in all missing scores, each dimension has a complete set of benchmarks. The composite IQ is always computed over all 4 dimensions.

Step 1: Score All Dimensions

For each dimension, the dimension IQ is computed by averaging all its benchmarks (hard + compressed). Because the 3-tier imputation has already filled in missing benchmarks, every model has scores for all 10 benchmarks and therefore all 4 dimensions.

Step 2: Safety-Net Dimension Imputation

In the rare case that a model has no real or imputed data for an entire dimension (D2–D4), a fallback applies:

$$\mathrm{IQ}_{D_k}^{\text{imputed}} = \min\!\left(\bar{D}_{\text{known}},\; P_{80}(D_k)\right) \qquad k \in \{2,3,4\}$$

This is a safety net that rarely triggers since the benchmark-level 3-tier system fills in missing scores first. D1 is never imputed at the dimension level — if a model has no D1 data even after benchmark imputation, the composite uses the remaining dimensions.

Step 3: Compute the Composite

$$\mathrm{IQ} = \operatorname{round}\!\left(\frac{1}{N}\sum_{k=1}^{N}\mathrm{IQ}_{D_k}\right)$$

where $N$ is the number of dimensions with data. With the 3-tier imputation, most models have $N=4$.

Key rules:

Minimum 2 dimensions required. Models with fewer than 2 scored dimensions do not receive a derived composite IQ and instead display a manual estimate.
All benchmarks pre-filled. The 3-tier imputation ensures every model has all 10 benchmark scores before dimension IQs are computed.
Transparent count. The display shows X/4 so readers can see how many dimensions were actually scored vs. imputed.
Equal weighting. All dimensions contribute equally. Compressed ceilings (not differential weighting) handle benchmark quality differences.

Imputation Examples

The following table shows selected models with imputed benchmarks, illustrating how the 3 tiers work across all dimensions:

Model	IQ	Imputed	Tier Breakdown
gpt-5.3-codex	129	5/10	arcAgi2, arcAgi1, fmT4Acc, swebench from gpt-5.2-pro; aime from gpt-5.2 (Family)
gemini-3-deep-think	129	5/10	fmT4Acc, critPt, terminalbench, swebench, sciCode from gemini-3-flash (Family)
opus-4.6-nonreasoning	118	5/10	arcAgi2, arcAgi1, aime, terminalbench, swebench from sonnet-4.5 (Family)
gpt-oss-120b	107	3/10	arcAgi2, arcAgi1 from gpt-5-mini; fmT4Acc from gpt-5-nano (Family)
glm-4.7	112	2/10	arcAgi2, arcAgi1 (Conservative — no weaker family match)
ernie-5.0-thinking-preview	110	5/10	arcAgi2, arcAgi1 (China regression); fmT4Acc, terminalbench, swebench (Conservative)
kimi-k2.5	117	1/10	aime (China regression)
deepseek-r1	105	3/10	fmT4Acc, terminalbench, swebench (Conservative)

Rank Status

Each model receives a rank status reflecting the completeness of its evaluation:

Full — All 4 dimensions scored. The most reliable composite.
Partial — 2–3 dimensions scored. Composite is derived but based on incomplete coverage.
Provisional — Only 1 dimension scored. Not enough for a derived composite; falls back to manual IQ.
Unranked — No dimension data available. Uses manual IQ estimate only.

Benchmarks Not Included

Three benchmarks that were part of the previous (v1) flat-averaging system have been removed from the composite IQ calculation:

LiveCodeBench — While it has very low gameability due to continuously refreshed problems, it overlaps heavily with the Programmatic Reasoning dimension already covered by Terminal-Bench. Its removal avoids double-counting coding ability.
MMLU-Pro — A 10-choice multiple-choice knowledge test. Overlaps with the Academic Reasoning dimension (GPQA/HLE) and adds limited discrimination at the frontier. Models have converged to similar high scores.
MMMU-Pro — Multimodal academic questions. While the vision component is interesting, most frontier model evaluation focuses on text-based reasoning. This benchmark is tracked in the data but excluded from the IQ composite.

These benchmarks remain in the database and are viewable on the data page — they are simply not included in the composite IQ computation.

EQ Scoring

AI IQ also estimates an Emotional Quotient (EQ) for each model, measuring social and emotional intelligence across 11 sub-dimensions:

HumanlikeHow natural and human the responses feel

SafetyResponsible and safe behavior

AssertiveConfidence and directness

Social IQUnderstanding of social dynamics

WarmFriendliness and approachability

AnalyticStructured emotional reasoning

InsightDepth of psychological understanding

EmpathyAbility to understand feelings

CompliantAgreeableness and cooperation

MoralisingTendency toward moral judgment

PragmaticPractical, solution-oriented responses

Each sub-dimension is scored on a 0–10 scale and mapped to an EQ value using shared anchor points:

Raw (0–10)	EQ
0	55
3	70
5	85
6	95
7	105
8	115
9	130
10	145

EQ-Bench Elo (Preferred Source)

When available, we use a model's EQ-Bench 3 Elo rating as the preferred EQ source. EQ-Bench is a dedicated emotional intelligence benchmark that produces Elo ratings reflecting relative emotional understanding:

EQ-Bench Elo	EQ
200	55
600	70
900	85
1100	95
1300	105
1500	115
1700	130
2000	145

When EQ-Bench Elo is not available, the composite EQ is computed as the mean of the 11 sub-dimension EQ scores (minimum 2 required).

Cost & Speed Metrics

Query Assumptions

All cost calculations assume a standard query of 1,000 input tokens and 2,000 output tokens, representing a typical conversational exchange.

$$C_{\text{query}} = 1000 \cdot p_{\text{in}} + 2000 \cdot p_{\text{out}}$$

Charts display cost per 1,000 queries on a logarithmic scale to handle the wide price range between models. Response time uses a log scale as well. Both axes are reversed so that the upper-right corner of every chart represents the best outcome: high intelligence at low cost and fast speed.

Limitations & Transparency

Dimension coverage varies. Some models have data for all 4 dimensions; others have as few as 2 (with the rest imputed). A model's composite IQ is most reliable when all dimensions are scored. Always check the X/4 count and rank status.
Benchmark mix matters. Two models with the same composite IQ may have very different underlying data quality. One might have all hard benchmarks (ungameable tests with full curves) while another relies mostly on compressed benchmarks (with lower ceilings). The rank status and dimension count help distinguish these cases.
Imputation is conservative, not clairvoyant. Missing benchmarks are filled using a 3-tier system (family match, grouping regression, or conservative fallback). These are reasonable estimates, not ground truth — a model's true ability on an unevaluated benchmark could be significantly higher or lower.
Anchor calibration is subjective. The mapping from raw scores to IQ involves judgment calls about what different performance levels mean relative to human cognitive ability. We document our rationale for each benchmark, but reasonable people can disagree.
IQ is a metaphor. Human IQ tests measure a specific construct via standardized instruments under controlled conditions. AI benchmark performance is a different thing. The IQ scale provides an intuitive frame of reference, not a claim of equivalence.
Compressed ceilings are a design choice. The ceiling values directly affect which models benefit and which are penalized. Models that excel on compressed benchmarks will have their contributions limited, which may feel unfair if those benchmarks genuinely reflect high ability. We believe the trade-off — rewarding harder evaluation — is correct, but the specific ceiling values are judgment calls.
Benchmarks become stale. As models improve and training data evolves, benchmark ceilings, gameability ratings, and compression levels may need revision. This methodology is a living document.

Asymptotic Compression Above IQ 140

The anchor point curves intentionally compress above IQ 140. Each additional percentage point on a benchmark contributes less to the IQ score in the superhuman range than in the human range. This reflects three realities:

Human IQ distributions compress at the tails. The difference between IQ 100 and IQ 120 is much more common than the difference between IQ 140 and IQ 160.
Superhuman benchmark scores are driven by breadth, not depth. A model scoring 50% on FrontierMath T4 isn't twice as smart as one scoring 25% — it covers more mathematical branches rather than being fundamentally more capable in any single branch.
Practical discrimination. Without compression, reasoning vs. non-reasoning configurations of the same model produce 20+ point IQ gaps, which is unrealistic. With compression, the gap narrows to ~10-12 points (“smart” vs. “very smart” rather than “above average” vs. “genius”).

The compression ensures that no benchmark can single-handedly produce IQ values above ~155, regardless of raw score. The theoretical ceiling of the composite is approximately 150–155 under current benchmarks.