top of page

OpenAI's o3 Model Excels in AI Reasoning Test – Yet Falls Short of AGI

20 dic 2024

Impressive advancements in artificial intelligence reasoning highlight progress, but true general intelligence remains a distant goal.

In 2019, François Chollet introduced the Abstraction and Reasoning Corpus (ARC) Challenge to evaluate how effectively artificial intelligence (AI) systems could identify patterns connecting pairs of colored grids. These visual puzzles were designed to assess general intelligence and basic reasoning abilities. To prevent non-reasoning systems from solving the puzzles through brute force by leveraging excessive computational resources, the competition enforces strict limits on computing power for official score submissions.

OpenAI recently announced its o3 model, set for release in early 2025, which achieved a breakthrough score of 75.7% on the ARC Challenge’s “semi-private” test, used for public leaderboard rankings. This was accomplished at an average cost of $20 per visual puzzle, adhering to the competition’s total budget cap of $10,000. However, the more challenging “private” test—which determines grand prize eligibility—imposes a stricter computational budget, equivalent to just $0.10 per puzzle, a target OpenAI did not meet.

The o3 model achieved an unofficial score of 87.5% by utilizing approximately 172 times more computational power than it did for its official submission. For comparison, the average human score on the ARC Challenge is 84%, and a score of 85%—combined with meeting the computational cost limits—would secure the $600,000 grand prize. Despite its impressive performance, the o3 model could not solve over 100 visual puzzles even with significantly increased computational resources, according to Mike Knoop, an ARC Challenge organizer at Zapier, in a post on social media platform X.

Critics have raised concerns about the reliance on brute force in achieving high scores. Melanie Mitchell from the Santa Fe Institute commented on Bluesky that using brute computational power undermines the ARC Challenge’s original intent. François Chollet himself noted in a post on X that while the o3 model is an impressive milestone, it still struggles with several relatively simple tasks. He emphasized that achieving human-level intelligence would be evident when creating tasks that are easy for humans but challenging for AI becomes impossible.

Thomas Dietterich at Oregon State University offered another perspective on recognizing artificial general intelligence (AGI). He proposed that AGI would require systems to incorporate essential cognitive functions, such as episodic memory, planning, logical reasoning, and meta-cognition—capabilities that current AI models, including the o3, still lack.

bottom of page