AI IQ

Home About

Methodology

AI IQ assigns each model an estimated IQ score by evaluating performance across 4 cognitive dimensions, each measured by multiple benchmarks. Hard, ungameable benchmarks retain full IQ curves, while easier or gameable benchmarks have compressed ceilings that limit their influence. Missing benchmarks and dimensions are conservatively imputed, and the composite IQ is the mean of all four dimension scores.

This page documents the full scoring system: how the four dimensions are defined, how raw scores map to IQ via piecewise-linear interpolation, how benchmark ceilings are compressed for gameability, and how missing values are imputed.

The 4-Dimension Framework

AI IQ organizes evaluation into four cognitive dimensions. Each dimension uses multiple benchmarks that are averaged together, with missing benchmarks conservatively imputed:

The composite IQ requires at least 2 of 4 dimensions to have data. Models with fewer scored dimensions fall back to a manual IQ estimate.

D1
Abstract Reasoning
ARC-AGI-2 (ceil 143), ARC-AGI-1 (compressed 135)
D2
Mathematical Reasoning
FrontierMath T4 (ceil 155), AIME (compressed 130)
D3
Programmatic Reasoning
Terminal-Bench 2.0 (ceil 150), SWE-bench (compressed 128), SciCode (compressed 140)
D4
Academic Reasoning
Humanity's Last Exam (ceil 158), CritPt (ceil 155), GPQA Diamond (compressed 135)

Formulas

Each benchmark raw score \(s\) is converted to an IQ value via piecewise-linear interpolation over that benchmark's anchor points \(\mathbf{A} = [(s_0, a_0),\, (s_1, a_1), \ldots]\):

$$f(s) = a_i + \frac{s - s_i}{s_{i+1} - s_i}\,(a_{i+1} - a_i), \qquad s_i \le s \le s_{i+1}$$

Each dimension averages the IQ values of its benchmarks. Missing benchmarks are conservatively imputed before averaging using a symmetric 3-tier system (see Benchmark-Level Imputation).

The four dimensions:

$$\begin{array}{l l} \mathrm{IQ}_{\text{Abstract}} & = \operatorname{avg}\!\left(f(\text{ARC-AGI-2}),\; f(\text{ARC-AGI-1}^{135})\right) \\[6pt] \mathrm{IQ}_{\text{Math}} & = \operatorname{avg}\!\left(f(\text{FrontierMath T4}),\; f(\text{AIME}^{130})\right) \\[6pt] \mathrm{IQ}_{\text{Programmatic}} & = \operatorname{avg}\!\left(f(\text{Terminal-Bench 2.0}),\; f(\text{SWE-bench}^{128}),\; f(\text{SciCode}^{140})\right) \\[6pt] \mathrm{IQ}_{\text{Academic}} & = \operatorname{avg}\!\left(f(\text{HLE}),\; f(\text{CritPt}),\; f(\text{GPQA}^{135})\right) \end{array}$$

Superscripts denote compressed ceilings. \(f\) is piecewise-linear interpolation over each benchmark's anchor curve.

The composite IQ is the mean of all four dimension scores, requiring at least 2 scored dimensions:

$$\boxed{\;\mathrm{IQ} = \frac{1}{4}\!\left(\mathrm{IQ}_{\text{Abstract}} + \mathrm{IQ}_{\text{Math}} + \mathrm{IQ}_{\text{Programmatic}} + \mathrm{IQ}_{\text{Academic}}\right), \qquad n_{\text{scored}} \ge 2\;}$$

D1: Abstract Reasoning

Abstract reasoning is the ability to solve novel problems without relying on prior knowledge. This is the closest analogue to the "g factor" in human psychometrics — raw problem-solving ability applied to patterns never seen before.

ARC-AGI-2 Hard

Format: Visual grid puzzles (novel patterns)
Tasks: Unique visual pattern completion
Gameability: Essentially Ungameable
IQ Ceiling: 143

Each puzzle requires identifying a novel visual transformation rule from examples and applying it to a new input. The puzzles are unique and cannot be memorized. This is the purest test of abstract reasoning in the benchmark set — no prior knowledge helps, only the ability to infer abstract rules from examples. Far from saturation (top models ~85%). The curve compresses above IQ 140 to reflect diminishing returns in the superhuman range.

Score %IQ
070
2085
4095
60100
75115
85125
95140
100143

ARC-AGI-1 Compressed · ceil 135

Format: Visual grid puzzles
Gameability: Ungameable (but saturating)
Original Ceiling: 152
Compressed Ceiling: 135

Same format as ARC-AGI-2 but an easier problem set. Top models now score ~96%, so it no longer discriminates at the frontier. The anchor curve is compressed from a ceiling of 152 down to 135 to limit the influence of saturated scores.

Score %IQ
078
1592
30102
50111
70119
85127
95132
100135

D2: Mathematical Reasoning

Mathematical reasoning and quantitative problem-solving — the ability to work with mathematical structures, proofs, and analytical frameworks. The hard benchmarks test novel quantitative reasoning that cannot be memorized from training data.

FrontierMath T4 Hard

Format: Novel research-level math problems
Tier: 4 (research-level)
Gameability: Very Low
IQ Ceiling: 155

Extremely difficult original math problems from Tier 4 of the FrontierMath benchmark. Problems are novel and cannot be found in training data. Top models score ~25%. Currently no T4 data is available for any model — this benchmark is included in the framework for future use. When data exists, it will be averaged with AIME for the D2 score. The curve compresses above IQ 140.

Score %IQ
070
5100
15120
30135
50142
70148
100155

AIME Compressed · ceil 130

Format: Integer answers (0–999)
Questions: 15 per exam
Gameability: High
Original Ceiling: 146
Compressed Ceiling: 130

Competition mathematics with integer answers. Old AIME problems are widely available in training data, with studies detecting 10–20 point contamination boosts. Models at ~98%. The anchor curve is compressed from a ceiling of 146 to 130 to limit the influence of contamination-driven scores.

Score %IQ
082
2095
40104
60112
80120
90124
100130

D3: Programmatic Reasoning

Practical engineering ability — the capacity to solve real-world technical problems in code and systems. The hard benchmark tests execution-based tasks that require genuine interaction with systems, while the compressed benchmarks cover real-world software engineering and scientific computing.

Terminal-Bench 2.0 Hard

Format: Docker container tasks (shell commands)
Tasks: 89 practical tasks
Gameability: Low
IQ Ceiling: 150

Models execute shell commands in isolated Docker containers to complete practical system administration and development tasks. The interactive, execution-based format makes memorization ineffective. One of the highest-integrity benchmarks in the set. The curve compresses above IQ 140.

Score %IQ
070
10100
25115
40125
55135
65140
80145
100150

SWE-bench Verified Compressed · ceil 128

Format: Real GitHub issue resolution
Tasks: 500 verified issues
Gameability: Very High
Original Ceiling: 144
Compressed Ceiling: 128

Models generate patches to resolve real GitHub issues and pass unit tests. However, 94% of issues predate model training cutoffs and ~30% have solution leakage. This makes it one of the most gameable benchmarks, resulting in the most aggressive compression — from a ceiling of 144 down to 128.

Score %IQ
080
1592
30102
50110
65117
80123
100128

SciCode Compressed · ceil 140

Format: Scientific coding tasks
Gameability: Moderate
Original Ceiling: 158
Compressed Ceiling: 140

Scientific computing tasks requiring domain expertise in physics, chemistry, and biology alongside programming skill. The bottleneck is understanding the science, not the programming. The interdisciplinary nature provides partial protection against memorization from academic literature. Compressed from ceiling 158 to 140 to account for moderate gameability.

Score %IQ
1078
2088
30100
40108
50117
60125
80135
100140

D4: Academic Reasoning

Breadth and depth of expert-level knowledge across academic domains. The hard benchmarks test whether a model can answer questions that push the boundaries of human expertise itself, while the compressed benchmark tests graduate-level science knowledge.

Humanity's Last Exam Hard

Format: 76% exact-match, expert-contributed
Questions: 3,000 (expert-sourced, screened against models)
Gameability: Low
IQ Ceiling: 158

Questions contributed by domain experts and explicitly screened to ensure no existing model can answer them at creation time. The benchmark spans the full frontier of human expertise. Current top score is ~48%. The curve compresses significantly above IQ 140 — even though 100% would represent superhuman breadth of knowledge, the IQ ceiling is kept at 158 so that no single benchmark can inflate the composite above ~155.

Score %IQ
070
595
10110
15120
20130
25140
35145
50150
75155
100158

CritPt Hard

Format: Critical-point analysis (novel problems)
Tasks: 20 problems
Gameability: Low
IQ Ceiling: 155

Novel mathematical analysis problems that require identifying critical points and applying analytical reasoning. Problems are original, making memorization ineffective. Current top score is 13/20. Scores are on a 0–20 scale (not percentage). The curve compresses above IQ 140.

Score (0–20)IQ
070
0.6120
1.6130
3135
5140
8145
12150
20155

GPQA Diamond Compressed · ceil 135

Format: 4-choice multiple choice
Questions: 198 (public set)
Gameability: Moderate-High
Original Ceiling: 148
Compressed Ceiling: 135

Graduate-level science questions written by PhD experts. A 25% score equals random guessing. Domain experts score 65–81%. The public question set is widely available in training data, making contamination a significant concern. The anchor curve is compressed from ceiling 148 to 135.

Score %IQ
2585
3598
50107
65115
80123
90131
100135

Piecewise-Linear Interpolation

Each benchmark defines a set of anchor points mapping raw scores to IQ values. For scores that fall between two anchors, we use piecewise-linear interpolation:

$$t = \frac{s - s_i}{s_{i+1} - s_i}, \qquad \mathrm{IQ} = a_i + t \cdot (a_{i+1} - a_i)$$

If the score is at or below the lowest anchor, the model receives that anchor's IQ. If at or above the highest, it receives the ceiling IQ. There is no extrapolation beyond the defined range.

This approach avoids assumptions about the distribution shape between anchors. Each segment can have a different slope, allowing the curve to be steeper where small score improvements represent large cognitive leaps (e.g., going from 0% to 5% on HLE) and flatter where additional points reflect diminishing differentiation.

Benchmark Averaging & Compression

Each dimension averages all its benchmarks together, with missing benchmarks conservatively imputed. Rather than separating benchmarks into primary/fallback tiers with a hard cap, we use compressed anchor curves to limit the influence of easier or gameable benchmarks.

How Compression Works

For compressed benchmarks, the anchor curve is rescaled so that IQ values above 100 are proportionally reduced toward a lower ceiling:

$$\mathrm{IQ}_{\text{compressed}} = 100 + (\mathrm{IQ}_{\text{orig}} - 100) \times \frac{C_{\text{new}} - 100}{C_{\text{orig}} - 100}, \qquad \mathrm{IQ}_{\text{orig}} > 100$$

Values at or below IQ 100 are unchanged. This preserves the low-end of the curve (where models genuinely struggle) while compressing the high-end where gameable benchmarks over-reward.

Why compress instead of cap? A hard cap (e.g., IQ 115) discards all discrimination above the cap — a model scoring 80% and one scoring 100% on AIME would both receive 115. Compression preserves the rank ordering while reducing the magnitude of the advantage that gameable benchmarks can confer. A perfect AIME score now yields IQ 130 instead of 146, which still contributes meaningfully but cannot dominate the dimension average.

Benchmark-Level Imputation

When a model is missing benchmark scores, the missing values are filled in before dimension IQs are computed. A symmetric 3-tier imputation system is applied to all 10 benchmarks across all 4 dimensions. For each dimension, the imputation uses only real data from the other 3 dimensions as the predictor (leave-one-dimension-out), preventing circular dependencies.

  1. Tier 1 — Family match: If a weaker family member (same model family, leave-out IQ at least 3 points lower, benchmark distance ≤ 15) has real data for this benchmark, copy its score. The IQ margin ensures we only impute downward — a model never inherits a score from a stronger sibling.
  2. Tier 2 — Grouping regression: If the model’s grouping (e.g., China, OpenAI, Anthropic) has a positive-slope linear regression for this benchmark, and the model’s leave-out IQ falls within the grouping’s training range, predict from the within-grouping regression. The prediction is capped at the global regression to moderate outliers.
  3. Tier 3 — Conservative fallback: Use min(median score, global regression prediction), clamped to [0, 100]. This ensures models without strong cross-dimensional evidence cannot score above the median through imputation.

Why leave-one-dimension-out? To impute a missing benchmark in dimension Di, we compute each model’s “leave-out IQ” from only the other 3 dimensions’ real data. This prevents imputed values from leaking into the predictor axis — all regressions and family comparisons use only original measurements. Every dimension is treated identically; there is no special ordering or phased imputation.

Why impute downward only (Tier 1)? Models from the same family can have very different capabilities. The −3 IQ margin ensures we only copy scores from a demonstrably weaker relative. For example, gpt-5-mini’s ARC-AGI scores can be used for gpt-oss-120b (since gpt-5-mini has lower leave-out IQ), but o3’s scores cannot — o3 may be substantially better at ARC despite similar overall IQ.

Composite IQ Calculation

After benchmark-level imputation fills in all missing scores, each dimension has a complete set of benchmarks. The composite IQ is always computed over all 4 dimensions.

Step 1: Score All Dimensions

For each dimension, the dimension IQ is computed by averaging all its benchmarks (hard + compressed). Because the 3-tier imputation has already filled in missing benchmarks, every model has scores for all 10 benchmarks and therefore all 4 dimensions.

Step 2: Safety-Net Dimension Imputation

In the rare case that a model has no real or imputed data for an entire dimension (D2–D4), a fallback applies:

$$\mathrm{IQ}_{D_k}^{\text{imputed}} = \min\!\left(\bar{D}_{\text{known}},\; P_{80}(D_k)\right) \qquad k \in \{2,3,4\}$$

This is a safety net that rarely triggers since the benchmark-level 3-tier system fills in missing scores first. D1 is never imputed at the dimension level — if a model has no D1 data even after benchmark imputation, the composite uses the remaining dimensions.

Step 3: Compute the Composite

$$\mathrm{IQ} = \operatorname{round}\!\left(\frac{1}{N}\sum_{k=1}^{N}\mathrm{IQ}_{D_k}\right)$$

where \(N\) is the number of dimensions with data. With the 3-tier imputation, most models have \(N=4\).

Key rules:

Imputation Examples

The following table shows selected models with imputed benchmarks, illustrating how the 3 tiers work across all dimensions:

ModelIQImputedTier Breakdown
gpt-5.3-codex1295/10arcAgi2, arcAgi1, fmT4Acc, swebench from gpt-5.2-pro; aime from gpt-5.2 (Family)
gemini-3-deep-think1295/10fmT4Acc, critPt, terminalbench, swebench, sciCode from gemini-3-flash (Family)
opus-4.6-nonreasoning1185/10arcAgi2, arcAgi1, aime, terminalbench, swebench from sonnet-4.5 (Family)
gpt-oss-120b1073/10arcAgi2, arcAgi1 from gpt-5-mini; fmT4Acc from gpt-5-nano (Family)
glm-4.71122/10arcAgi2, arcAgi1 (Conservative — no weaker family match)
ernie-5.0-thinking-preview1105/10arcAgi2, arcAgi1 (China regression); fmT4Acc, terminalbench, swebench (Conservative)
kimi-k2.51171/10aime (China regression)
deepseek-r11053/10fmT4Acc, terminalbench, swebench (Conservative)

Rank Status

Each model receives a rank status reflecting the completeness of its evaluation:

Benchmarks Not Included

Three benchmarks that were part of the previous (v1) flat-averaging system have been removed from the composite IQ calculation:

These benchmarks remain in the database and are viewable on the data page — they are simply not included in the composite IQ computation.

EQ Scoring

AI IQ also estimates an Emotional Quotient (EQ) for each model, measuring social and emotional intelligence across 11 sub-dimensions:

HumanlikeHow natural and human the responses feel
SafetyResponsible and safe behavior
AssertiveConfidence and directness
Social IQUnderstanding of social dynamics
WarmFriendliness and approachability
AnalyticStructured emotional reasoning
InsightDepth of psychological understanding
EmpathyAbility to understand feelings
CompliantAgreeableness and cooperation
MoralisingTendency toward moral judgment
PragmaticPractical, solution-oriented responses

Each sub-dimension is scored on a 0–10 scale and mapped to an EQ value using shared anchor points:

Raw (0–10)EQ
055
370
585
695
7105
8115
9130
10145

EQ-Bench Elo (Preferred Source)

When available, we use a model's EQ-Bench 3 Elo rating as the preferred EQ source. EQ-Bench is a dedicated emotional intelligence benchmark that produces Elo ratings reflecting relative emotional understanding:

EQ-Bench EloEQ
20055
60070
90085
110095
1300105
1500115
1700130
2000145

When EQ-Bench Elo is not available, the composite EQ is computed as the mean of the 11 sub-dimension EQ scores (minimum 2 required).

Cost & Speed Metrics

Query Assumptions

All cost calculations assume a standard query of 1,000 input tokens and 2,000 output tokens, representing a typical conversational exchange.

$$C_{\text{query}} = 1000 \cdot p_{\text{in}} + 2000 \cdot p_{\text{out}}$$

Charts display cost per 1,000 queries on a logarithmic scale to handle the wide price range between models. Response time uses a log scale as well. Both axes are reversed so that the upper-right corner of every chart represents the best outcome: high intelligence at low cost and fast speed.

Limitations & Transparency

Asymptotic Compression Above IQ 140

The anchor point curves intentionally compress above IQ 140. Each additional percentage point on a benchmark contributes less to the IQ score in the superhuman range than in the human range. This reflects three realities:

  1. Human IQ distributions compress at the tails. The difference between IQ 100 and IQ 120 is much more common than the difference between IQ 140 and IQ 160.
  2. Superhuman benchmark scores are driven by breadth, not depth. A model scoring 50% on FrontierMath T4 isn't twice as smart as one scoring 25% — it covers more mathematical branches rather than being fundamentally more capable in any single branch.
  3. Practical discrimination. Without compression, reasoning vs. non-reasoning configurations of the same model produce 20+ point IQ gaps, which is unrealistic. With compression, the gap narrows to ~10-12 points (“smart” vs. “very smart” rather than “above average” vs. “genius”).

The compression ensures that no benchmark can single-handedly produce IQ values above ~155, regardless of raw score. The theoretical ceiling of the composite is approximately 150–155 under current benchmarks.