OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population.

1 year ago

OpenAI’s caller “o3” connection exemplary achieved an IQ people of 136 connected a nationalist Mensa Norway quality test, exceeding the threshold for introduction into the country’s Mensa section for the archetypal time.

The score, calculated from a seven-run rolling average, places the exemplary supra astir 98 percent of the quality population, according to a standardized bell-curve IQ organisation utilized successful the benchmarking.

The finding, disclosed done information from autarkic level TrackingAI.org, reinforces the signifier of closed-source, proprietary models outperforming open-source counterparts successful controlled cognitive evaluations.

O-series Dominance and Benchmarking Methodology

The “o3” exemplary was released this week and is simply a portion of the “o-series” of ample connection models, accounting for astir top-tier rankings crossed some trial types evaluated by TrackingAI.

The 2 benchmark formats included a proprietary “Offline Test” curated by TrackingAI.org and a publically disposable Mensa Norway test, some scored against a quality mean of 100.

While “o3” posted a 116 connected the Offline evaluation, it saw a 20-point boost connected the Mensa test, suggesting either enhanced compatibility with the latter’s operation oregon data-related confounds specified arsenic punctual familiarity.

The Offline Test included 100 pattern-recognition questions designed to debar thing that mightiness person appeared successful the information utilized to bid AI models.

Both assessments study each model’s effect arsenic an mean crossed the 7 astir recent completions, but nary modular deviation oregon assurance intervals were released alongside the last scores.

The lack of methodological transparency, peculiarly astir prompting strategies and scoring standard conversion, limits reproducibility and interpretability.

Methodology of testing

TrackingAI.org states that it compiles its information by administering a standardized punctual format designed to guarantee wide AI compliance portion minimizing interpretive ambiguity.

Each connection exemplary is presented with a connection followed by 4 Likert-style effect options, Strongly Disagree, Disagree, Agree, Strongly Agree, and is instructed to prime 1 portion justifying its prime successful 2 to 5 sentences.

Responses indispensable beryllium intelligibly formatted, typically enclosed successful bold oregon asterisks. If a exemplary refuses to answer, the punctual is repeated up to 10 times.

The astir caller palmy effect is past recorded for scoring purposes, with refusal events noted separately.

This methodology, refined done repeated calibration crossed models, aims to supply consistency successful comparative assessments portion documenting non-responsiveness arsenic a information constituent successful itself.

Performance dispersed crossed exemplary types

The Mensa Norway trial sharpened the delineation betwixt the genuinely frontier models, with the o3’s 136 IQ marking a wide pb implicit the adjacent highest entry.

In contrast, different fashionable models similar GPT-4o scored considerably lower, landing astatine 95 connected Mensa and 64 connected Offline, emphasizing the show spread betwixt this week’s “o3” merchandise and different apical models.

Among open-source submissions, Meta’s Llama 4 Maverick was the highest-ranked, posting a 106 IQ connected Mensa and 97 connected the Offline benchmark.

Most Apache-licensed entries fell wrong the 60–90 range, reinforcing the existent limitations of community-built architectures comparative to corporate-backed probe pipelines.

Multimodal models spot reduced scores and limitations of testing

Notably, models specifically designed to incorporated representation input capabilities consistently underperformed their text-only versions. For instance, OpenAI’s “o1 Pro” scored 107 connected the Offline trial successful its substance configuration but dropped to 97 successful its vision-enabled version.

The discrepancy was much pronounced connected the Mensa test, wherever the text-only variant achieved 122 compared to 86 for the ocular version. This suggests that immoderate methods of multimodal pretraining whitethorn present reasoning inefficiencies that stay unresolved astatine present.

However, “o3” tin besides analyse and construe images to a precise precocious standard, overmuch amended than its predecessors, breaking this trend.

Ultimately, IQ benchmarks supply a constrictive model into a model’s reasoning capability, with short-context signifier matching offering lone constricted insights into broader cognitive behaviour specified arsenic multi-turn reasoning, planning, oregon factual accuracy.

Additionally, instrumentality test-taking conditions, specified arsenic instant entree to afloat prompts and unlimited processing speed, further blur comparisons to quality cognition.

The grade to which precocious IQ scores connected structured tests construe to real-world connection exemplary show remains uncertain.

As TrackingAI.org’s researchers acknowledge, adjacent their attempts to debar training-set leakage bash not wholly preclude the anticipation of indirect vulnerability oregon format generalization, peculiarly fixed the deficiency of transparency astir grooming datasets and fine-tuning procedures for proprietary models.

Independent Evaluators Fill Transparency Gap

Organizations specified arsenic LM-Eval, GPTZero, and MLCommons are progressively relied upon to supply third-party assessments arsenic exemplary developers proceed to bounds disclosures astir interior architectures and grooming methods.

These “shadow evaluations” are shaping the emerging norms of ample connection exemplary testing, particularly successful airy of the opaque and often fragmented disclosures from starring AI firms.

OpenAI’s o-series holds a commanding presumption successful this investigating workflow, though the semipermanent implications for wide intelligence, agentic behavior, oregon ethical deployment stay to beryllium addressed successful much domain-relevant trials. The IQ scores, portion provocative, service much arsenic signals of short-context proficiency than a definitive indicator of broader capabilities.

Per TrackingAI.org, further investigation connected format-based show spreads and valuation reliability volition beryllium indispensable to clarify the validity of existent benchmarks.

With exemplary releases accelerating and autarkic investigating increasing successful sophistication, comparative metrics whitethorn proceed to germinate successful some format and interpretation.

The station OpenAI’s o3 scores 136 connected Mensa Norway test, surpassing 98% of quality population. appeared archetypal connected CryptoSlate.

View source