A Look at the "Illusion of Thinking"
In the relentless hype cycle of artificial intelligence, we're often told that we're on a fast track to true Artificial General Intelligence (AGI). But what if the engines driving us there aren't as powerful as they seem? In a fascinating turn, researchers from Apple have published two papers that serve as a crucial reality check on the current state of AI.

These studies have sparked intense debate by suggesting that the impressive capabilities of our models might be more of a sophisticated illusion than genuine intelligence. Let's dive into what they found and why it matters.
GSM-Symbolic: When AI Fails at Basic Math
The first major challenge came in a 2024 paper titled "GSM-Symbolic." Led by researchers Iman Mirzadeh and Mehrdad Farajtabar, the team created a new benchmark to test how well Large Language Models (LLMs) handle mathematical reasoning. Instead of just testing if a model could get the right answer, they tested how robust its reasoning was.
The findings were revealing:
- Fragile Logic: The models' performance dropped significantly when the researchers only changed the numbers in a word problem while keeping the underlying mathematical logic identical. A model that could solve for "2+3" might fail when asked to solve for "4+5" in the same story context.
- Easily Distracted: When a single, seemingly relevant but ultimately useless piece of information was added to a problem, the performance of all leading AI models plummeted—in some cases by as much as 65%.
- The Core Conclusion: The study strongly suggested that these models aren't performing true logical reasoning. Instead, they are engaging in highly advanced pattern matching, essentially looking for familiar problem structures from their training data to find a solution.
This was the first hint that something was amiss under the hood. But it was the follow-up study that truly shook the conversation.
"The Illusion of Thinking": AI Hits a Wall
In June 2025, a paper titled "The Illusion of Thinking," spearheaded by Parshin Shojaee and Iman Mirzadeh, took this investigation a step further. The team tested so-called "Large Reasoning Models" (LRMs)—models specifically designed for complex problem-solving—against a set of classic logic puzzles with adjustable difficulty, including:
- Towers of Hanoi
- The River Crossing Problem
- Checkers Jumping
- Blocks World
The results were nothing short of stunning.
- The "Accuracy Cliff": The models performed well on simpler versions of the puzzles. But as the complexity was dialed up, their performance didn't gracefully decline; it fell off a cliff, dropping dramatically to zero accuracy.
- Paradoxical Scaling: Even more bizarrely, when faced with harder problems, the models often used fewer computational steps (or "thinking tokens"). It was as if, upon recognizing a challenge beyond its capabilities, the AI simply "gave up" rather than trying harder.
- Three Performance Regimes: The researchers identified three distinct zones. At low complexity, standard LLMs sometimes did better. At medium complexity, the specialized LRMs had an edge. But at high complexity, every single model failed completely.
The researchers' conclusion was blunt and powerful: these models create "the illusion of formal reasoning" but are actually performing a brittle form of pattern matching that can be broken by something as simple as changing a name in a puzzle.
The Debate and Apple's Motivation
Naturally, these findings didn't go unchallenged. The scientific community engaged in a vigorous debate. Some critics, like Alex Lawsen in a response titled "The Illusion of the Illusion of Thinking," argued that flaws in the experimental setup—such as using unsolvable versions of the River Crossing problem or token limits that forced models to quit—were to blame, not a fundamental flaw in the models themselves.
This scientific back-and-forth is healthy and necessary. But it's also worth considering the context. Apple has been playing catch-up in the AI race. While its competitors have soared on the AI boom, Apple has proceeded more cautiously. Publishing research that highlights the fundamental weaknesses of the current dominant approach could be a strategic move to reshape the narrative, arguing that a slower, more deliberate path is wiser than the current "scale is all you need" philosophy.
What This Means for the Future of AI
The implications of Apple's research are profound and force us to confront uncomfortable questions:
- Is Real Reasoning Possible? Are current LLM architectures fundamentally incapable of achieving true, generalized reasoning, no matter how large they become?
- The End of Scaling Laws? This research casts doubt on the prevailing "scaling law"—the idea that simply adding more data and more computing power will inevitably lead to greater intelligence.
- A Call for Innovation: If current methods have a hard ceiling, then achieving AGI may require entirely new architectural innovations beyond the transformer models that power today's AI.
Apple’s research doesn't claim that AI is useless; its power as a tool is undeniable. However, it provides a sobering and evidence-based counter-narrative to the relentless hype. It suggests that the path to truly intelligent machines may not be a straight line up but may require us to go back to the drawing board.