Advanced AI models face ‘complete accuracy collapse’: Apple

Innovation

New AI variants called large reasoning models face ‘a complete accuracy collapse’ beyond certain complexities, a new paper published by Apple researchers argues.

Advanced AI models face 'complete accuracy collapse': Apple — Advanced AI models face ‘complete accuracy collapse’: Apple

Key Takeaways

Large Reasoning Models (LRMs), which generate detailed thinking processes before providing answers, face “a complete accuracy collapse beyond certain complexities”, a new paper by Apple researchers posits.
The paper, called ‘The illusion of thinking’, finds these models, which include OpenAI 01 and 03, DeepSeek R1 and Google Gemini Flash Thinking, perform worse than Large Language Models (LLMs) on simple puzzles, and completely collapsed when they faced hard puzzles.
The study found that LRMs actually began reducing their reasoning effort near the collapse point as complexity increased.

Key background

Large reasoning models are a kind of new breed of AI system, designed to generate detailed thinking processes before providing answers. By breaking down problems and providing ‘reasoning’, LRMs are – supposedly – easier to interpret and purport to be less prone to errors.

But this latest paper published during Apple’s Worldwide Developers Conference by researchers at Apple show that recent generations of LRMs, like OpenAI’s 01 and 03 and Google’s Gemini Flash Thinking have their limits – in fact, they completely collapse when faced with problems ‘beyond certain complexities’.

The paper, titled The Illusion of Thinking, saw researchers test LRMs against different levels of puzzles to determine whether they actually reason, or are just pattern matching like their large language model (LLM) predecessor. It found that LRMs outperform LLMs on medium-difficulty puzzles, but actually performed worse on simple puzzles, and completely collapsed when faced with hard puzzles.

“Through extensive experimentation across diverse puzzles, we show that frontier LRMs
face a complete accuracy collapse beyond certain complexities,” the paper read. “This indicates LRMs possess limited self-correction capabilities that, while valuable, reveal fundamental inefficiencies and clear scaling limitations.”

The puzzles, which included The Tower of Hanoi (where the model is required to transfer different coloured disks across pegs) and checkers jumping (where the model is required to position swap coloured tokens), showed that LRMs suffer from an ‘overthinking’ phenomenon. They stopped working altogether when faced with high-complexity tasks.

The ultimate take? LRMs do not reason, and fail basic logic.

Crucial quote

“Large Reasoning Models have shown emergent behaviors such as discrepancy between thought traces and final answers as well as efficiency concerns through what researchers term the “overthinking phenomenon”, where models produce verbose, redundant outputs, even after finding the solution, creating significant inference computational overhead.”

Big picture

It goes against what AI proponents have been trying to tell us about LRMs, which is that they’re the next frontier of AI models – and purport to work better than LLMs.

Tangent

Apple unveiled its new iPhone software (iOS 26) at this year’s WWDC, including a new change to the Phone app – including call screening and hold assist. There’ll be more to come at the week-long conference.