- AI RESEARCH UNIT
- Posts
- The (Illusion of the) Illusion of Thinking: Do LLMs Really Reason?
The (Illusion of the) Illusion of Thinking: Do LLMs Really Reason?
Welcome back to AI Insights! Today we delve into Apple’s recent paper, “Illusion of Thinking” and Claude Opus’ riposte: “The Illusion of the Illusion of Thinking”.

Apple’s AI team just published “The Illusion of Thinking” (June 2025), a paper designed to shake up how we view chatbots like ChatGPT. The basic claim: large language models (LLMs) seem to reason step-by-step, but that may be a mirage. Apple tested these “thinking” models on classic puzzles of rising difficulty, watching not just answers but the chains of thought.
They found something startling: beyond moderate difficulty, the models’ accuracy collapses to zero. In simple terms, today’s AI can look like it’s reasoning, but only up to a point; on hard problems it just fails outright. As one summary puts it, AI is showing “sophisticated pattern matching” rather than true understanding.
Testing AI with puzzles
To probe how LLMs “think,” Apple created clean, simple puzzle environments. For example, they used the Tower of Hanoi (move stacked disks between pegs without placing a larger one on a smaller). By varying the number of disks, they could precisely ramp up difficulty. The image below shows one such Hanoi puzzle:
A Tower of Hanoi puzzle. Apple used puzzles like this, with 3–10 disks, to test if AI models truly plan out solutions.
In these experiments, models fell into three regimes:
Easy puzzles: All models (even plain LLMs) solve 1–2 disks perfectly.
Medium puzzles: “Thinking” models (with chain-of-thought) start to pull ahead. They deliberate more steps and get a boost on moderately hard tasks.
Hard puzzles: Then comes the crash. At high complexity (e.g. Tower of Hanoi with 8–10 disks) every model type flat-out fails. Accuracy drops to near zero.
In other words, beyond a certain point these AIs hit a cliff. Apple reports a “complete accuracy collapse beyond certain complexities”. On Tower of Hanoi, for instance, a reasoning model solved ~10% of 8-disk puzzles, and at 10 disks no model succeeded. Even the supposedly smarter “chain-of-thought” models couldn’t limp along: at the highest difficulties they literally “gave up early,” barely making any effort as the puzzle size grew. (One host quipped: “They gave them the answer and it didn’t help”, because the model still couldn’t correctly enumerate the steps.)
When AI “gives up”
Why the sudden failure? Apple looked inside the chain-of-thought. They often saw the model find the right solution early in its output – then keep “thinking” for pages and slip into wrong moves. It’s as if the AI had the answer, then started adding nonsense. As one commentator put it: “Claude 3.7 … finds the right answer early, then wastes its time exploring wrong answers for thousands of tokens”. In effect, the model confuses length with depth: it writes a long rationale that looks impressive, but the correct answer was there from the start.
Even when Apple handed the model the exact algorithm (the step-by-step rules) to solve the Hanoi puzzle, the AI still botched it. No matter how they prompted it, the model couldn’t reliably follow precise logic without drifting off. All this suggests there’s no internal planning or real “if–then” reasoning going on – the model is essentially making statistical guesses. In Apple’s words, these LMs still behave “like autocomplete engines doing their best impression of Sherlock Holmes”: spitting out believable-sounding steps without genuine understanding.
Pattern-matching, not true understanding
Apple’s results echo a wider sentiment in AI research: LLMs are pattern matchers, not real reasoners. The study found that small changes in the puzzle often confuse the model. For example, adding an irrelevant detail to a math problem causes a big drop in accuracy – evidence that the model isn’t logically filtering out distractions. A Psychology Today summary of Apple’s work underscores this: “they fall short of genuine logical reasoning, relying instead on pattern replication from their training data.”. In other words, if the test “feels” like something from the training set, the AI can do well. But if it has to apply fresh rules in a new way, it quickly fails. The Apple paper’s puzzles — unlike typical benchmarks — avoid leaked solutions, making this limitation clear.
The bottom line: these models are incredibly good at matching what’s been seen before, but they don’t have a built-in ability to plan ahead or genuinely understand novel tasks. They can simulate reasoning by regurgitating patterns, but when those patterns become too complex or unfamiliar, the illusion breaks.
But there’s a problem
Firstly, no one ever claimed that LLMs - even so-called “reasoning” models were genuinely reasoning. You can ask the models themselves and they will deny emphatically that they are capable of such things. Even in the initial buzz of excitement, the core interest in “reasoning” models centred on the realisation that telling a model to simulate the types of speech that usually accompany reasoning in humans would significantly improve their answers.
And that’s not all…
Tricks and counterpoints
Not everyone sees it as a fatal flaw. Twitter immediately jumped in to point out flaws in the Illusion paper’s reasoning. For instance, critics noted that Apple’s models hit token limits when asked to list every move in Hanoi. In follow-up tests, when models were instead asked to write a short program (a recursive function) to solve the puzzle, they could handle many more disks. In lawsen’s words, “preliminary experiments… indicate high accuracy on Tower of Hanoi instances previously reported as complete failures” once you let the model output code instead of brute-force moves. In practice, enabling LLMs to use tools can similarly help: allowing them to call a calculator, a search engine, an external API or even just a scratchpad often boosts their performance on tough problems. Other commentators pointed out that very few humans can solve the Tower of Hanoi problem verbally - they need visual aids or a pen and paper - and yet no one is arguing that humans are incapable of reasoning.
And then things got worse: Claude Opus (plus one human assistant) released a rebuttal, pointing out that some of the problems given to the models were mathematically to solve - in which case giving up is a reasonable approach. Similarly, some of them required more tokens to solve than the models had access to, meaning that giving up was once again the right answer.
What it means: staying grounded
Key takeaways: Don’t confuse wordiness for wisdom. LLMs can produce impressively articulate answers, but they’re still just autocomplete on steroids, not mini-minds. But so are a lot of humans a lot of the time. We should not minimise the achievements of either. The reasoning models are a significant step forward, and the Apple paper contains some interesting elements (also, it’s worth noting that the lead author was an intern).
For users and developers, the lesson is to use LLMs wisely. They shine on clear, medium-complexity problems, especially when supported with tools or step-by-step prompts. But for tricky, multi-step tasks, always double-check. Give the model multiple tries (using self-consistency), let it generate code or use calculators if possible, and watch for red flags like suddenly terse answers or obvious mistakes. Remember: Apple’s paper and others agree that current LLMs look like they’re thinking, but they’re mostly matching patterns. The “illusion of thinking” is compelling, but it’s not the same as true thought. As these models improve, we should celebrate their successes but also stay critically aware of their boundaries.
Let us know if you'd like a deeper dive into AI research or want to follow this story as it unfolds.
Build the Future of AI With Us!
Join our community of innovators shaping the next era of open-source intelligence.
Follow us on X for updates, breakthroughs, and dev drops.
This isn’t just open-source — it’s open potential.