> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Because they are not.
Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
Do submarines swim? I don't really care if it gets me where I want to go. The fact is that just two days ago, I asked Claude to look at some reasonably complicated concurrent code to which I had added a new feature, and asked it to list what tests needed to be added; and then when I asked GPT-5 to add them, it one-shot nailed the implementations. I've written a gist of it here:
Seriously just even read the description of the test it's trying to write.
In order to one-shot that code, it had to understand:
- How the cache was supposed to work
- How conceptually to set up the scenario described
- How to assemble golang's concurrency primitives (channels, goroutines, and waitgroups), in the correct order, to achieve the goal.
Did it have a library of concurrency testing patterns in its head? Probably -- so do I. Had it ever seen my exact package before in its training? Never.
I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.
If anything, the examples in this article are the opposite. Take the second example, which is basically 'assemble these assorted pieces into a rectangle'. Nearly every adult has assembled a minimum of dozens of things in their lives; many have assembled thousands of things. So it's humans in this case who are simply "pattern matching questions on a contrived test", and the LLMs, which almost certainly didn't have a lot of "assemble these items" in their training data, that are reasoning out what's going on from first principles.
It doesn't matter HOW LLMs "swim" as long as they can, but the point being raised is whether they actually can.
It's as if LLMs can swim in the ocean, in rough surf, but fail to swim in rivers or swimming pools, because they don't have a generalized ability to swim - they've just been RL-trained on the solution steps to swimming in surf, but since those exact conditions don't exist in a river (which might seem like a less challenging environment), they fail there.
So, the question that might be asked is when LLMs are trained to perform well in these vertical domains like math and programming, where it's easy to verify results and provide outcome- or process-based RL rewards, are they really learning to reason, or are they just learning to pattern match to steer generation in the direction of problem-specific reasoning steps that they had been trained on?
Does the LLM have the capability to reason/swim, or is it really just an expert system that has been given the rules to reason/swim in certain cases, but would need to be similarly hand fed the reasoning steps to be successful in other cases?
I think the answer is pretty obvious given that LLM's can't learn at runtime - can't try out some reasoning generalization they may have arrived at, find that it doesn't work in a specific case, then explore the problem and figure it out for next time.
Given that it's Demis Hassabis who it pointing out this deficiency of LLMs (and has a 5-10 year plan/timeline to fix it - AGI), not some ill-informed LLM critic, it seems silly to deny it.
What? Submarines can definitely “swim” in rivers, although shallow water is certainly more challenging for a submerged vessel. Most submarines are a bit big for most swimming pools, but small ones like ROVs are frequently tested in pools.
> I think the answer is pretty obvious given that LLM's can't learn at runtime - can't try out some reasoning generalization they may have arrived at, find that it doesn't work in a specific case, then explore the problem and figure it out for next time.
This is just a problem of memory. Supposing that an LLM did generate a genuinely novel insight, it could in theory they could write a note for itself so that next time they come online, they can read through a summary of the things they learned. And it could also write synthetic training data for itself so that the next time they're trained, that gets incorporated into its general knowledge.
OpenAI allows you to fine-tune GPT models, I believe. You could imagine a GPT system working for 8 hours in a day, then spending a bunch of time looking over all its conversation looking for patterns or insights or things to learn, and then modifying its own fine-tuning data (adding, removing, or modifying as appropriate), which it then used to train itself overnight, waking up the next morning having synthesized the previous day's experience.
How does memory (maybe later incorporated via fine tuning) help if you can't figure out how to do something in the first place ?
That would be a way to incorporate new declarative data at "runtime" - feedback to the AI intern as to what it is doing wrong. However, in order to do something effectively by yourself generally requires more than just new knowledge - it requires personal practice/experimentation etc, since you need to learn how to act based on the contents of your own mind, not that of the instructor.
Even when you've had enough practice to become proficient at a taught skill, you may not be able to verbalize exactly what you are doing (which is part of the teacher-student gap), so attempting to describe then capture that as textual/context "sensory input" is not always going to work.
> are they really learning to reason, or are they just learning to pattern match to steer generation in the direction of problem-specific reasoning steps that they had been trained on?
Are you sure there's a real difference? Do you have a definition of "reasoning" that excludes this?
It's trivial to demonstrate that LLMs are pattern matching rather than reasoning. A good way is to provide modified riddles-that-aren't. As an example:
> Prompt: A man working at some white collar job gets an interview scheduled with an MBA candidate. The man says "I can't interview this candidate, he's my son." How is this possible?
> ChatGPT: Because the interviewer is the candidate’s mother. (The riddle plays on the assumption that the interviewer must be a man.)
This is clearly pattern matching and overfitting to the "doctor riddle" and a good demonstration of how there's no actual reasoning going on. A human would read the prompt and initially demonstrate confusion, which LLMs don't demonstrate because they don't actually reason.
Over fitting isn't evidence of non-reasoning, but that aside, what's interesting is that ChatGPT (free) trips on this, as did older models. But GPT-5 thinking, Opus 4, and Gemini 2.5 Pro all pointed out that there is no trick and it's likely the man just views it as a conflict of interest to interview his son.
It's hard to say whether this has been trained out (it's an old example) or if it's just another hurdle that general model progression has overcome.
'This is possible because the man is the candidate's father. When he says "he's my son," he's simply stating their family relationship.
The scenario doesn't present any logical contradiction - a father could very well be in a position where he's supposed to interview his own son for a job. This would create a conflict of interest, which is why he's saying he can't conduct the interview. It would be inappropriate and unfair for a parent to interview their own child for a position, so he would need to recuse himself and have someone else handle the interview.
The phrasing might initially seem like it's setting up a riddle, but it's actually a straightforward situation about professional ethics and avoiding conflicts of interest in hiring.'
We kinda move from the situation “LLM can only do what it seen before” to “LLM can do something by composing several things it has seen before”. We didn’t get to the situation “LLM can do things it has not seen before”.
The practicality of the situation is that a lot of problems fall into the second bucket. We all like to think we deal with novel problems, but most of what we can think of was already considered by another human and captured by llm. You had to invent something deliberately unique, and that’s telling. Most startup ideas are invented more than once, for example.
The key shortcoming of the llm is that it is not aware of its own limits. If it ever becomes aware it can outsource such rare things to mechanical Turk.
You claim that AI is patterned matching instead of reasoning, but the psychological literature is clear that people reason by pattern matching. As evidenced by the fact that people tend to make the same sorts of mistakes when reasoning quickly.
Ask someone who has made such a mistake to think a little more on it, and they’ll notice their error. Ask a reasoning model to do literally the same thing, to “think” on it, and it will also notice its error.
If you’re still insist that AI are not reasoning here, then neither are people.
> It's trivial to demonstrate that LLMs are pattern matching rather than reasoning.
Again, this is just asserting the premise that reasoning cannot include pattern matching, but this has never been justified. What is your definition for "reasoning"?
> This is clearly pattern matching and overfitting to the "doctor riddle" and a good demonstration of how there's no actual reasoning going on.
Not really, no. "Bad reasoning" does not entail "no reasoning". Your conclusion is simply too strong for the evidence available, which is why I'm asking for a rigourous definition of reasoning that doesn't leave room for disagreement about whether pattern matching counts.
If your assertion is that you can't prove reasoning isn't just pattern matching, then I counter by saying you can't prove reasoning isn't just chaining a large number of IF/THEN/ELSE logic statements and therefore computers have been generally intelligent since ~1960.
The difference between ML models and computers since the 1960s is that the ML models weren't programmed with predicates, they "learned" them from analyzing data, and can continue to learn in various ways from further data. That's a meaningful difference, and why the former may qualify as intelligent and the latter cannot.
But I agree in principle that LLMs can be distilled into large IF/THEN/ELSE trees, that's the lesson of BitNet 1-bit LLMs. The predicate tree being learned from data is the important qualifier for intelligence though.
Edit: in case I wasn't clear, I agree that a specific chain of IF/THEN/ELSE statements in a loop can be generally intelligent. How could it not, specific kinds of these chains are Turing complete after all, so unless you think the brain has some kind of magic, it too is reducible to such a program, in principle. We just haven't yet discovered what kind of chain this is, just like we didn't understand what kind of chain could produce distributed consensus before PAXOS.
So I do think there are two distinct types of activities involved in knowledge work:
1. Taking established techniques or concepts and appropriately applying them to novel situations.
2. Inventing or synthesizing new, never-before-seen techniques or concepts
The vast majority of the time, humans do #1. LLMs certainly do this in some contexts as well, as demonstrated by my example above. This to me counts as "understanding" and "thinking". Some people define "understanding" such that it's something only humans can do; to which I respond, I don't care what you call it, it's useful.
Can LLMs do #2? I don't know. They've got such extensive experience that how would you know if they'd invented a technique vs had seen it somewhere?
But I'd venture to argue that most humans never or rarely do #2.
> But I'd venture to argue that most humans never or rarely do #2.
That seems fair, although the distinction between synthesizing something new and combining existing techniques is a bit blurry.
What's missing from LLMs though is really part of 1). If techniques A, B, C & D are all the tools you need to solve a novel problem, then a human has the capability of learning WHEN to use each of these tools, and in what order/combination, to solve that problem - a process of trial and error, generalization and exception, etc. It's not just the techniques (bag of tools) you need, but also the rules (acquired knowledge) of how they can be used to solve different problems.
LLMs aren't able to learn at runtime from their own experience, so the only way they can learn these rules of when to apply given tools (aka reasoning steps) - is by RL training on how they have been successfully used to solve a range of problems in the training data. So, the LLM may have learnt that in specific context it should first apply tool A (generate that reasoning step), etc, etc, but that doesn't help it to solve a novel problem where the same solution step selection process doesn't apply, even if the tools A-D are all it needs (if only it could learn how to apply them to this novel problem).
I define intelligence as prediction (degree of ability to use past experience to correctly predict future action outcomes), and reasoning/planning as multi-step what-if prediction.
Certainly if a human (or some AI) has learned to predict/reason over some domain, then what they will be doing is pattern matching to determine the generalizations and exceptions that apply in a given context (including a hypothetical context in a what-if reasoning chain), in order to be able to select a next step that worked before.
However, I think what we're really talking about here isn't the mechanics of applying learnt reasoning (context pattern matching), but rather the ability to reason in the general case, which requires the ability to LEARN to solve novel problems, which is what is missing from LLMs.
A system that has a fixed set of (reasoning/prediction) rules, but can't learn new ones for itself, seems better regarded as an expert system. We need to make the distinction between a system that can only apply rules, and one that can actually figure out the rules in the first place.
In terms of my definitions of intelligence and reasoning, based around ability to use past experience to learn to predict, then any system that can't learn from fresh experience doesn't meet that definition.
Of course in humans and other intelligent animals the distinction between past and ongoing experience doesn't apply since they can learn continually and incrementally (something that is lacking from LLMs), so for AI we need to use a different vocabulary, and "expert system" seems the obvious label for something that can use rules, but not discover them for itself.
> but rather the ability to reason in the general case, which requires the ability to LEARN to solve novel problems, which is what is missing from LLMs.
I don't think it's missing, zero shot prompting is quite successful in many cases. Maybe you find the extent that LLMs can do this to be too limited, but I'm not sure that means they don't reason at all.
> A system that has a fixed set of (reasoning/prediction) rules, but can't learn new ones for itself, seems better regarded as an expert system.
I think expert systems are a lot more limited than LLMs, so I don't agree with that classification. LLMs can generate output that's out of distribution, for instance, which is not something that's classic expert systems can do (even if you think LLM OOD is still limited compared to humans).
I've elaborated in another comment [1] what I think part of the real issue is, and why people keep getting tripped up by saying that pattern matching is not reasoning. I think it's perfectly fine to say that pattern matching is reasoning, but pattern matching has levels of expressive power. First-order pattern matching is limited (and so reasoning is limited), and clearly humans are capable of higher order pattern matching which is Turing complete. Transformers are also Turing complete, and neural networks can learn any function, so it's not a matter of expressive power, in principle.
Aside from issues stemming from tokenization, I think many of these LLM failures are because they aren't trained in higher order pattern matching. Thinking models and the generalization seen from grokking are the first steps on this path, but it's not quite there yet.
Powerful pattern matching is still just pattern matching.
How is an LLM going to solve a novel problem with just pattern matching?
Novel means it has never seen it before, maybe doesn't even have the knowledge needed to solve it, so it's not going to be matching any pattern, and even if it did, that would not help if it required a solution different to whatever the pattern match had come from.
Human level reasoning includes ability to learn, so that people can solve novel problems, overcome failures by trial and error, exploration, etc.
So, whatever you are calling "reasoning" isn't human level reasoning, and it's therefore not even clear what you are trying to say? Maybe just that you feel LLMs have room for improvement by better pattern matching?
> Powerful pattern matching is still just pattern matching.
Higher order pattern matching is Turing complete. Transformers are Turing complete. Memory augmented LLMs are Turing complete. Neural networks can learn to reproduce any function. These have all been proven.
So if computers can be intelligent and can solve novel problems in principle, then LLMs can too if given the right training. If you don't think computers can be intelligent, you have a much higher burden to meet.
> Human level reasoning includes ability to learn, so that people can solve novel problems, overcome failures by trial and error, exploration, etc.
You keep bringing this up as if it's lacking, but basically all existing LLM interfaces provide facilities for memory to store state. Storing progress just isn't an issue if the LLM has the right training. HN has some recent articles about Claude code just being given the task to port some GitHub repos to other programming languages, and they woke up the next morning and it did it autonomously, using issue tracking, progress reports, PRs the hole nine yards. This is frankly not the hard part IMO.
It seems readily apparent there is a difference given their inability to do tasks we would otherwise reasonably describe as achievable via basic reasoning on the same facts.
I agree LLMs have many differences in abilities relative to humans. I'm not sure what this implies for their ability to reason though. I'm not even sure what examples about their bad reasoning can prove about the presence or absence of any kind of "reasoning", which is why I keep asking for definitions to remove the ambiguity. If examples of bad reasoning sufficed, then this would prove that humans can't reason either, which is silly.
A rigourous definition of "reasoning" is challenging though, which is why people consistently can't provide a general one that's satisfactory when I ask, and this is why I'm skeptical that pattern matching isn't a big part of it. Arguments that LLMs are "just pattern matching" are thus not persuasive arguments that they are not "reasoning" at some cruder level.
Maybe humans are just higher order pattern matchers and LLMs are only first or second-order pattern matchers. Maybe first-order pattern matching shouldn't count as "reasoning", but should second-order? Third-order? Is there evidence or some proof that LLMs couldn't be trained to be higher order pattern matchers, even in principle?
None of the arguments or evidence I've seen about LLMs and reasoning is rigourous or persuasive on these questions.
>and the LLMs, which almost certainly didn't have a lot of "assemble these items" in their training data
I don't think this assumption is sound. Humans write a huge amount on "assemble components x and y to make entity z". I'd expect all LLMs to have consumed every IKEA type instruction manual, the rules for Jenga, all geometry textbooks and papers ever written.
Most of our coding is just plumbing. Getting data from one place to where it needs to be. There is no advanced reasoning necessary. Just a good idea of the structure of the code and the data-structures.
Even high school maths tests are way harder than what most professional programmers do on a daily basis.
I could be mistaken but generally LLMs cannot tackle out-of-domain problems whereas humans do seem to have that capability. Relatedly, the energy costs are wildly different suggesting that LLMs are imitating some kind of thought but not simulating it. They’re doing a remarkable job of passing the Turing test but that says more about the limitations of the Turing test than it does about the capabilities of the LLMs.
> I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.
IMO its still "just" a, very good, autocomplete. No actual reasoning, but lots of statistics on what is the next token to spit out.
That's the main point of the parent comment. Arguing about the definition of "reasoning" or "pattern matching" is just a waste of time. What really matters is if it produces helpful output. Arguing about that is way better!
Instead of saying: "It's just pattern matching -> It won't improve the world", make an argument like: "AI's seem to have trouble specializing like humans -> adopting AI will increase error rates in business processes -> due to the amount of possible edge cases, most people will get into an edge case with no hope of escaping it -> many people's lives will get worse".
The first example relies on us agreeing on the definition of pattern matching, and then taking a conclusion based on how those words feel. This has no hope of convincing me if I don't like your definition! The second one is an argument that could potentially convince me, even if I'm an AI optimist. It is also just by itself an interesting line of reasoning.
No it's not "just a very good autocomplete". I don't know why people repeat this thing (it's wrong) but I find it an extremely counterproductive position. Some people just love to dismiss the capabilities of AI with a very shallow understanding of how it works. Why?
It generates words one by one, like we all do. This doesn't mean it does just that and nothing else. It's the mechanics of how they are trained and how they do inference. And most importantly how they communicate with us. It doesn't define what they are or their limits. This is reductionism. Ignoring the mathematical complexity of a giant neural network.
Do we though? Sure, we communicate sequentially, but that doesn't mean that our internal effort is piecewise and linear. A modern transformer LLM however is. Each token is sampled from a population exclusively dependent on the tokens that came before it.
Mechanistically speaking, it works similarly to autocomplete, but at a very different scale.
Now how much of an unavoidable handicap this incurs, if any, is absolutely up for debate.
But yes, taking this mechanistic truth and only considering it in a shallow manner underestimates the capability of LLMs by a large degree.
Is this a certainty? I thought it was an open question whether quantum effects are at play in the brain, and those have a counterintuitive relationship with time (to vastly dumb things down in a way my grug mind can comprehend).
I think it's more that there isn't yet evidence against it. In other words, we're not sure or not if the brain has some kind of special sauce that doesn't just reduce to linear algebra.
"I think it's more that there isn't yet evidence against it."
We don't? AFAIK we have no proof of anyone being able to see into the future. Now maybe there are other manifestations of this, but I know of no test today that even hints at it.
What's obtuse about it? It's honestly a very straightforward statement. Every thing we think or say is a function of past events. We don't incorporate future events into what we think or say. Even speculation or imagination of future events occurred in the past (that is the act of imagining it occurred in the past).
It's really a super simple concept -- maybe it's so simple that it seems obtuse.
Because the other poster's point wasn't that it was a 'past event.' The point was that it's just predicting based upon the previous token. It's disingenuous to mix the two concepts up.
>Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
I think most of the problem i solve is also a pattern matching. The problems i am good at solving are the ones i've seen before or the ones i can break into problems i've seen before.
> It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
True, and "Agentic Workflows" are now playing the same role as "Agile" in that both take the idea that if you have many people/LLMs that can solve toy problems but not real ones then you can still succeed by breaking down the real problems into toy problems and assigning them out.
"Not understanding or reasoning" is anthropocentric cope. There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.
One notable difference, however, is that LLMs disproportionately suck at spatial reasoning. Which shouldn't be surprising, considering that their training datasets are almost entirely text. The ultimate wordcel makes for a poor shape rotator.
All ARC-AGI tasks are "spatial reasoning" tasks. They aren't in any way special. They just force LLMs to perform in an area they're spectacularly weak at. And LLMs aren't good enough yet to be able to brute force through this innate deficiency with raw intelligence.
The primary source is: measured LLM performance on once-human-exclusive tasks - such as high end natural language processing or commonsense reasoning.
Those things were once thought to require a human mind - clearly, not anymore. Human commonsense knowledge can be both captured and applied by a learning algorithm trained on nothing but a boatload of text.
But another important source is: loads and loads of mech interpret research that tried to actually pry the black box open and see what happens on the inside.
This found some amusing artifacts - such as latent world models that can be extracted from the hidden state, or neural circuits corresponding to high level abstracts being chained together to obtain the final outputs. Very similar to human "abstract thinking" in function - despite being implemented on a substrate of floating point math and not wet meat.
One of the most astonishing things about LLMs is that they actually seem to have achieved general common-sense reasoning to a signficant extent. Example from the thread about somebody ordering 18000 waters at a drive-through: https://news.ycombinator.com/item?id=45067653
TL;DR: Even without being explicitly prompted to, a pretty weak LLM "realized" that a thousand glasses of water was an unreasonable order. I'd say that's good enough to call "common sense".
You can try it out yourself! Just pick any AI chatbot, make up situations with varying levels of absurdity, maybe in a roleplay setting (e.g. "You are a fast food restaurant cashier. I am a customer. My order is..."), and test how it responds.
So, you don't, but wikipedia does? I'll believe they can do commonsense reasoning when they can figure out that people have 4 fingers and 1 thumb. Here I was thinking common sense reasoning was what we call reasoning based on common sense. Go figure some AI folks needed to write a wikipedia article to redefine common sense.
Like they say, common sense ain't so common at all.
>Least you could do is look up what an unfamiliar term means before rolling in with all the hot takes.
Thanks for proving my point that common sense ain't so common. To be clear, common sense reasoning is not an "unfamiliar term" save for this new (article was written in 2021) redefinition of it to be something AI related. It's kinda laughable that you are being this snitty about.
> That would help you to be less ignorant the next time around.
No, psychology is right. Psychology studies what the properties of thought are. Neuroscience studies the specific biochemical mechanisms of the brain. Psychology is the study of what mental reasoning IS, while neuroscience is the study of HOW neurons in our brain implement it.
If you are asking “ok, but what is reasoning, really? What definition of reasoning would enable us to recognize whether it is going on in this AI or not?” it is a question of psychology. Unless we are restricting ourselves to whole brain emulation only.
Psychology is stuck in pre-Galilean era. Even if it studies "properties of thought", as you put it, it does so without formal basis, let alone understanding from first principles. As Chomsky said, about psychology and the like, "You want to move from behavioral science to authentic science." [1]
Very much agree with this. Looking at the dimensionality of a given problem space is a very helpful heuristic when analyzing how likely an llm is going to be suitable/reliable for that task. Consider how important positional encodings are LLM performance. You also then have an attention model that operates in that 1-dimensional space. With multidimensional data significant transformations to encode into a higher dimensional abstraction needs to happen within the model itself, before the model can even attempt to intelligently manipulate it.
> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
Pattern matching is definitely the same thing as understanding and reasoning.
The problem is that LLMs can't recognize patterns that are longer than a few paragraphs, because the tokens would have to be far too long. LLMs are a thing we are lucky to have because we have very fast computers and very smart mathematicians making very hard calculations very efficient and parallelizable. But they sit on top of a bed of an enormous amount of human written knowledge, and can only stretch so far from that bed before completely falling apart.
Humans don't use tokenizers.
The goal right now is to build a scaffolding of these dummies in order to get really complicated work done, but that work is only ever going to accidentally be correct because of an accumulation of errors. This may be enough for a lot if we try it 1000x and run manually-tuned algos over the output to find the good ones. But this is essentially manual work, done in the traditional way.
edit: sorry, you're never going to convince me these things are geniuses when I chat to them for a couple of back and forth exchanges and they're already obviously losing track of everything, even what they just said. The good thing is that what they are is enough to do a lot, if you're a person who can be satisfied that they're not going to be your god anytime soon.
I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.
I'm not sure how similar this is but I tried the same quite a while back with a simple 5x5 nonogram (Picross) and had similar difficulties.
I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over.
Also, there's already a complete database of valid answers at [1], so I'm not sure why the correct answer couldn't just come from that, and the 'reasoning' can be 'We solved this here, look...' ;)
> I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over.
Because its in the context window and a lot of training material refers to earlier stuff for later stuff it is trained to bring up that stuff again and again. Even if it is in the window as a negative.
I really think that the problem is with tokenizing vision.
Any kind of visually based reasoning and they become dumb as rocks. It feels similar to having a person play sokoban but blindfolded and only with text prompts. The same issue cropped up with playing pokemon. Like the image gets translated to text, and then the model works on that.
I'm no expert on transformers, but it just feels like there is some kind of limit that prevents the models from "thinking" visually.
Yes, vision is a problem, but I don't think it's the biggest problem for the specific task I'm testing. The memory problem is bigger. The models frequently do come up with the right answer, but they promptly forget it between turns.
Sometimes they forget because the full reasoning trace is not preserved in context (either due to API limitations or simply because the context isn't big enough to hold dozens or hundreds of steps of full reasoning traces). Sometimes it's because retrieval from context is bad for abstract concepts and rules vs. keyword matching, and to me the reason for that is that text is lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures.
I think the problem is though that they need to store it in text context.
When I am solving a sokoban style game, it's entirely visual. I don't need to remember a lot because the visual holds so much information.
It's like the average person trying to play a game of chess with just text. It's nightmarishly hard compared to having a board in front of you. The LLMs seem stuck having to play everything through just text.
It's not just visual. You also need a representation of the rules of the game and the strategies that make sense. The puzzles I'm solving are not straight Sokoban, they have per-game varying rules that need to be discovered (again, ARC-AGI-3 style) that affect the strategies that you need to use. For example, in classic Sokoban you can't push two crates at once, but in some of the puzzles I'm using you can, and this is taught by forcing you to do it in the first level, and you need to remember it through the rest of the levels. This is not a purely visual concept and models still struggle with it.
Try to get your LLM of choice to find its way out of a labyrinth that you describe in text form. It's absolutely awful even with the simplest mazes. I'm not sure the problem here is memory, though? I think it has to do with spatial reasoning. I'd be willing to bet every company right now is working on spatial reasoning (at least up to 3D) and as soon as that is working, a huge amount of pieces will fall into place.
Spatial reasoning is weak, but still I frequently see models come up with the right answer in reasoning steps, only to make the wrong move in the following turn because they forget what they just learned. For models with hidden reasoning it's often not even possible to retain the reasoning tokens in context through multiple steps, but even if you could the context windows are big but not big enough to contain all the past reasoning for every step for hundreds of steps. And then even if they were the retrieval from context for abstract concepts (vs verbatim copying) is terrible.
Text is too lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures.
I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.
In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.
I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.
I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
I suspect (to use the language of the author) current LLMs have a bit of a "reasoning dead zone" when it comes to images. In my limited experience they struggle with anything more complex than "transcribe the text" or similarly basic tasks. Like I tried to create an automated QA agent with Claude Sonnet 3.5 to catch regressions in my frontend, and it will look at an obviously broken frontend component (using puppeteer to drive and screenshot a headless browser) and confidently proclaim it's working correctly, often making up a supporting argument too. I've had much more success passing the code for the component and any console logs directly to the agent in text form.
My memory is a bit fuzzy, but I've seen another QA agent that takes a similar approach of structured text extraction rather than using images. So I suspect I'm not the only one finding image-based reasoning an issue. Could also be for cost reasons though, so take that with a pinch of salt.
LLM image frontends suck, and a lot of them suck big time.
The naive approach of "use a pretrained encoder to massage the input pixels into a bag of soft tokens and paste those tokens into the context window" is good enough to get you a third of the way to humanlike vision performance - but struggles to go much further.
Claude's current vision implementation is also notoriously awful. Like, "a goddamn 4B Gemma 3 beats it" level of awful. For a lot of vision-heavy tasks, you'd be better off using literally anything else.
Wild, I found it hard to believe that a 4b model could beat sonnet-3.5 at anything, but at least on the vision arena (https://lmarena.ai/leaderboard/vision) it seems like sonnet-3.5 is at the same ELO as a 27b gemma (~1150), so it's plausible. I guess that just says more about how bad vision LLMs are right now that anything else.
Can someone explain to me why a new LLMs ability to solve highly publicized puzzles is not "just" (sorry) it having access to the blog posts talking about those puzzles?
It's fine, that's what I would do to solve them, but it doesn't obviously and immediately make me confident in new reasoning capability w that suspicion floating around.
Should be easy to test by picking two similar models with different publishing dates (before and after ARC v2), and also comparing with/without the new reasoning technique from the article.
Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
Transformer models, typically architected for and trained on 1d text streams, are not going to perform well on ARC-AGI. I like that the test corpus exists as I believe it suggests that other model architectures (perhaps co-existing with LLMs in a MoE fashion) are needed to generalize AI performance further. For example, if we constructed a 3d version of ARC-AGI (rather than relying on grids) humans would probably still outperform reasoning LLMs handily. However, expand ARC-AGI to 4d and I think human performance might start to become more comparable to LLM performance. 4d is as alien to us as 2d is to LLMs, in this narrow test corpus.
But the core issue seems to be: How do you come up with the fitness function that drives the evolutionary process without human intervention in the first place?
(I've tried something similar with a coding agent where I let the agent modify parts of its system prompt... But it got stuck very fast since there was no clear fitness function)
Seeing how ARC-AGI is pretty much the only non-embodied short-duration type of challenge where humans are still an order of magnitude better than AIs, beating it would possibly bring us a lot closer to actual AGI.
I don't think so. The author isn't training an LLM, but rather using an LLM to solve a specific problem. This method could also be applied to solve other problems.
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
Religion often is, as "the Lord's ways are inscrutable"
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
We have dead-zones in adductive reasoning, not in induction or deduction. Almost all failures of reasoning in people are in abducing what model describes the situation at hand.
eg., we can apply the rule, "-A cannot follow from A", etc. regardless of the A
eg., we always know that if the number of apples is 2, then it cannot be any of "all numbers without 2" -- which quantifies over all numbers
You will not find a "gap" for a given number, whereas with LLMs, gaps of this kind are common
People who know alcohol is bad for them and don't want to keep being drunks but keep drinking, people who believe phones are bad for their kids but still buy them, people who understand AI will significantly degrade the environment if it becomes ubiquitous but still work to help it become ubiquitous...
Mathematicians who publish proofs that are later proven inconsistent!
I suspect we have fundamentally different views of how humans work. I see our behavior and beliefs as _mostly_ irrational, with only a few "reasoning live-zones" where, with great effort, we can achieve logical thought.
How can you know? One could argue that the entire phenomenon of cognitive dissonance is "people (internally) recognize the contradiction and then perform it"
To me the reason ARC-AGI puzzles are difficult for LLMs and possible for humans is that they are expressed in a format for which humans have powerful preprocessing capabilities.
Imagine the puzzle layouts were expressed in JSON instead of as a pattern of visual blocks. How many humans could solve them in that case?
We have powerful preprocessing blocks for images: Strong computer vision capabilities predates LLMs by several years. Image classification, segmentation, object detection, etc. All differential and trainable in same way as LLMs, including jointly.
To the best of my knowledge, no team has shown really high scores by adding in a image preprocessing block?
Bingo. We simply made a test for which we are well trained. We are constantly making real time decisions with our eyes. Interestingly certain monkeys are much better at certain visual pattern recognition than we are. They might laugh and think humans haven’t reached AGI yet.
Every one who had access to a computer that could convert json into something more readable for humans, and would know that was the first thing they needed to do?
You might as well have asked how many English speakers could solve the questions if they were in Chinese. All of them. They would call up someone who spoke Chinese, pay them to translate the questions, then solve them. Or failing that, they would go to the bookstore, buy books on learning Chinese, and solve them three years from now.
I love this sort of self-starter experimenting. Curious what models have been tried, I saw Grok4 mentioned, curious how well it transfers to other models.
>With RL, models no longer just learn what sounds correct based on patterns they've seen. They learn what words to output to be correct. RL is the process of forcing the pre-trained weights to be logically consistent.
How does Reinforcement Learning force the weights to be logically consistent? Isn't it just about training using a coarser/more-fuzzy granularity of fitness?
More generally, is it really solving the task if it's given a large number of attempts and an oracle to say whether it's correct? Humans can answer the questions in one shot and self-check the answer, whereas this is like trial and error with an external expert who tells you to try again.
Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.
The biggest issue I have with ARC-AGI is it's a visual problem. LLMs (even the newfangled multi-modal ones) are still far worse at vision than at purely text based problems. I don't think it's possible to build a test of purely text-based questions that would be easy for humans and hard for SOTA models. Yes, there's a few gotchas you can throw at them but not 500.
Congrats, you made LLMs perform slightly better at a contrived puzzle. This finally proves that we've cracked intelligence and are well on our way towards AGI.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Because they are not.
Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
Do submarines swim? I don't really care if it gets me where I want to go. The fact is that just two days ago, I asked Claude to look at some reasonably complicated concurrent code to which I had added a new feature, and asked it to list what tests needed to be added; and then when I asked GPT-5 to add them, it one-shot nailed the implementations. I've written a gist of it here:
https://gitlab.com/-/snippets/4889253
Seriously just even read the description of the test it's trying to write.
In order to one-shot that code, it had to understand:
- How the cache was supposed to work
- How conceptually to set up the scenario described
- How to assemble golang's concurrency primitives (channels, goroutines, and waitgroups), in the correct order, to achieve the goal.
Did it have a library of concurrency testing patterns in its head? Probably -- so do I. Had it ever seen my exact package before in its training? Never.
I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.
If anything, the examples in this article are the opposite. Take the second example, which is basically 'assemble these assorted pieces into a rectangle'. Nearly every adult has assembled a minimum of dozens of things in their lives; many have assembled thousands of things. So it's humans in this case who are simply "pattern matching questions on a contrived test", and the LLMs, which almost certainly didn't have a lot of "assemble these items" in their training data, that are reasoning out what's going on from first principles.
> Do submarines swim?
It doesn't matter HOW LLMs "swim" as long as they can, but the point being raised is whether they actually can.
It's as if LLMs can swim in the ocean, in rough surf, but fail to swim in rivers or swimming pools, because they don't have a generalized ability to swim - they've just been RL-trained on the solution steps to swimming in surf, but since those exact conditions don't exist in a river (which might seem like a less challenging environment), they fail there.
So, the question that might be asked is when LLMs are trained to perform well in these vertical domains like math and programming, where it's easy to verify results and provide outcome- or process-based RL rewards, are they really learning to reason, or are they just learning to pattern match to steer generation in the direction of problem-specific reasoning steps that they had been trained on?
Does the LLM have the capability to reason/swim, or is it really just an expert system that has been given the rules to reason/swim in certain cases, but would need to be similarly hand fed the reasoning steps to be successful in other cases?
I think the answer is pretty obvious given that LLM's can't learn at runtime - can't try out some reasoning generalization they may have arrived at, find that it doesn't work in a specific case, then explore the problem and figure it out for next time.
Given that it's Demis Hassabis who it pointing out this deficiency of LLMs (and has a 5-10 year plan/timeline to fix it - AGI), not some ill-informed LLM critic, it seems silly to deny it.
>> Do submarines swim?
>It doesn't matter HOW LLMs "swim" as long as they can, but the point being raised is whether they actually can.
>It's as if LLMs can swim in the ocean, in rough surf, but fail to swim in rivers or swimming pools
Just like submarines!
What? Submarines can definitely “swim” in rivers, although shallow water is certainly more challenging for a submerged vessel. Most submarines are a bit big for most swimming pools, but small ones like ROVs are frequently tested in pools.
> I think the answer is pretty obvious given that LLM's can't learn at runtime - can't try out some reasoning generalization they may have arrived at, find that it doesn't work in a specific case, then explore the problem and figure it out for next time.
This is just a problem of memory. Supposing that an LLM did generate a genuinely novel insight, it could in theory they could write a note for itself so that next time they come online, they can read through a summary of the things they learned. And it could also write synthetic training data for itself so that the next time they're trained, that gets incorporated into its general knowledge.
OpenAI allows you to fine-tune GPT models, I believe. You could imagine a GPT system working for 8 hours in a day, then spending a bunch of time looking over all its conversation looking for patterns or insights or things to learn, and then modifying its own fine-tuning data (adding, removing, or modifying as appropriate), which it then used to train itself overnight, waking up the next morning having synthesized the previous day's experience.
> This is just a problem of memory
How does memory (maybe later incorporated via fine tuning) help if you can't figure out how to do something in the first place ?
That would be a way to incorporate new declarative data at "runtime" - feedback to the AI intern as to what it is doing wrong. However, in order to do something effectively by yourself generally requires more than just new knowledge - it requires personal practice/experimentation etc, since you need to learn how to act based on the contents of your own mind, not that of the instructor.
Even when you've had enough practice to become proficient at a taught skill, you may not be able to verbalize exactly what you are doing (which is part of the teacher-student gap), so attempting to describe then capture that as textual/context "sensory input" is not always going to work.
> are they really learning to reason, or are they just learning to pattern match to steer generation in the direction of problem-specific reasoning steps that they had been trained on?
Are you sure there's a real difference? Do you have a definition of "reasoning" that excludes this?
It's trivial to demonstrate that LLMs are pattern matching rather than reasoning. A good way is to provide modified riddles-that-aren't. As an example:
> Prompt: A man working at some white collar job gets an interview scheduled with an MBA candidate. The man says "I can't interview this candidate, he's my son." How is this possible?
> ChatGPT: Because the interviewer is the candidate’s mother. (The riddle plays on the assumption that the interviewer must be a man.)
This is clearly pattern matching and overfitting to the "doctor riddle" and a good demonstration of how there's no actual reasoning going on. A human would read the prompt and initially demonstrate confusion, which LLMs don't demonstrate because they don't actually reason.
Over fitting isn't evidence of non-reasoning, but that aside, what's interesting is that ChatGPT (free) trips on this, as did older models. But GPT-5 thinking, Opus 4, and Gemini 2.5 Pro all pointed out that there is no trick and it's likely the man just views it as a conflict of interest to interview his son.
It's hard to say whether this has been trained out (it's an old example) or if it's just another hurdle that general model progression has overcome.
OK. But, in Claude Sonnet 4:
'This is possible because the man is the candidate's father. When he says "he's my son," he's simply stating their family relationship. The scenario doesn't present any logical contradiction - a father could very well be in a position where he's supposed to interview his own son for a job. This would create a conflict of interest, which is why he's saying he can't conduct the interview. It would be inappropriate and unfair for a parent to interview their own child for a position, so he would need to recuse himself and have someone else handle the interview. The phrasing might initially seem like it's setting up a riddle, but it's actually a straightforward situation about professional ethics and avoiding conflicts of interest in hiring.'
EDIT - this is described better by other posters.
We kinda move from the situation “LLM can only do what it seen before” to “LLM can do something by composing several things it has seen before”. We didn’t get to the situation “LLM can do things it has not seen before”.
The practicality of the situation is that a lot of problems fall into the second bucket. We all like to think we deal with novel problems, but most of what we can think of was already considered by another human and captured by llm. You had to invent something deliberately unique, and that’s telling. Most startup ideas are invented more than once, for example.
The key shortcoming of the llm is that it is not aware of its own limits. If it ever becomes aware it can outsource such rare things to mechanical Turk.
People make the same sort of mistakes.
Please explain how this is relevant to the topic at hand. Thanks!
You claim that AI is patterned matching instead of reasoning, but the psychological literature is clear that people reason by pattern matching. As evidenced by the fact that people tend to make the same sorts of mistakes when reasoning quickly.
Ask someone who has made such a mistake to think a little more on it, and they’ll notice their error. Ask a reasoning model to do literally the same thing, to “think” on it, and it will also notice its error.
If you’re still insist that AI are not reasoning here, then neither are people.
> It's trivial to demonstrate that LLMs are pattern matching rather than reasoning.
Again, this is just asserting the premise that reasoning cannot include pattern matching, but this has never been justified. What is your definition for "reasoning"?
> This is clearly pattern matching and overfitting to the "doctor riddle" and a good demonstration of how there's no actual reasoning going on.
Not really, no. "Bad reasoning" does not entail "no reasoning". Your conclusion is simply too strong for the evidence available, which is why I'm asking for a rigourous definition of reasoning that doesn't leave room for disagreement about whether pattern matching counts.
If your assertion is that you can't prove reasoning isn't just pattern matching, then I counter by saying you can't prove reasoning isn't just chaining a large number of IF/THEN/ELSE logic statements and therefore computers have been generally intelligent since ~1960.
The difference between ML models and computers since the 1960s is that the ML models weren't programmed with predicates, they "learned" them from analyzing data, and can continue to learn in various ways from further data. That's a meaningful difference, and why the former may qualify as intelligent and the latter cannot.
But I agree in principle that LLMs can be distilled into large IF/THEN/ELSE trees, that's the lesson of BitNet 1-bit LLMs. The predicate tree being learned from data is the important qualifier for intelligence though.
Edit: in case I wasn't clear, I agree that a specific chain of IF/THEN/ELSE statements in a loop can be generally intelligent. How could it not, specific kinds of these chains are Turing complete after all, so unless you think the brain has some kind of magic, it too is reducible to such a program, in principle. We just haven't yet discovered what kind of chain this is, just like we didn't understand what kind of chain could produce distributed consensus before PAXOS.
So I do think there are two distinct types of activities involved in knowledge work:
1. Taking established techniques or concepts and appropriately applying them to novel situations.
2. Inventing or synthesizing new, never-before-seen techniques or concepts
The vast majority of the time, humans do #1. LLMs certainly do this in some contexts as well, as demonstrated by my example above. This to me counts as "understanding" and "thinking". Some people define "understanding" such that it's something only humans can do; to which I respond, I don't care what you call it, it's useful.
Can LLMs do #2? I don't know. They've got such extensive experience that how would you know if they'd invented a technique vs had seen it somewhere?
But I'd venture to argue that most humans never or rarely do #2.
> But I'd venture to argue that most humans never or rarely do #2.
That seems fair, although the distinction between synthesizing something new and combining existing techniques is a bit blurry.
What's missing from LLMs though is really part of 1). If techniques A, B, C & D are all the tools you need to solve a novel problem, then a human has the capability of learning WHEN to use each of these tools, and in what order/combination, to solve that problem - a process of trial and error, generalization and exception, etc. It's not just the techniques (bag of tools) you need, but also the rules (acquired knowledge) of how they can be used to solve different problems.
LLMs aren't able to learn at runtime from their own experience, so the only way they can learn these rules of when to apply given tools (aka reasoning steps) - is by RL training on how they have been successfully used to solve a range of problems in the training data. So, the LLM may have learnt that in specific context it should first apply tool A (generate that reasoning step), etc, etc, but that doesn't help it to solve a novel problem where the same solution step selection process doesn't apply, even if the tools A-D are all it needs (if only it could learn how to apply them to this novel problem).
I define intelligence as prediction (degree of ability to use past experience to correctly predict future action outcomes), and reasoning/planning as multi-step what-if prediction.
Certainly if a human (or some AI) has learned to predict/reason over some domain, then what they will be doing is pattern matching to determine the generalizations and exceptions that apply in a given context (including a hypothetical context in a what-if reasoning chain), in order to be able to select a next step that worked before.
However, I think what we're really talking about here isn't the mechanics of applying learnt reasoning (context pattern matching), but rather the ability to reason in the general case, which requires the ability to LEARN to solve novel problems, which is what is missing from LLMs.
A system that has a fixed set of (reasoning/prediction) rules, but can't learn new ones for itself, seems better regarded as an expert system. We need to make the distinction between a system that can only apply rules, and one that can actually figure out the rules in the first place.
In terms of my definitions of intelligence and reasoning, based around ability to use past experience to learn to predict, then any system that can't learn from fresh experience doesn't meet that definition.
Of course in humans and other intelligent animals the distinction between past and ongoing experience doesn't apply since they can learn continually and incrementally (something that is lacking from LLMs), so for AI we need to use a different vocabulary, and "expert system" seems the obvious label for something that can use rules, but not discover them for itself.
> but rather the ability to reason in the general case, which requires the ability to LEARN to solve novel problems, which is what is missing from LLMs.
I don't think it's missing, zero shot prompting is quite successful in many cases. Maybe you find the extent that LLMs can do this to be too limited, but I'm not sure that means they don't reason at all.
> A system that has a fixed set of (reasoning/prediction) rules, but can't learn new ones for itself, seems better regarded as an expert system.
I think expert systems are a lot more limited than LLMs, so I don't agree with that classification. LLMs can generate output that's out of distribution, for instance, which is not something that's classic expert systems can do (even if you think LLM OOD is still limited compared to humans).
I've elaborated in another comment [1] what I think part of the real issue is, and why people keep getting tripped up by saying that pattern matching is not reasoning. I think it's perfectly fine to say that pattern matching is reasoning, but pattern matching has levels of expressive power. First-order pattern matching is limited (and so reasoning is limited), and clearly humans are capable of higher order pattern matching which is Turing complete. Transformers are also Turing complete, and neural networks can learn any function, so it's not a matter of expressive power, in principle.
Aside from issues stemming from tokenization, I think many of these LLM failures are because they aren't trained in higher order pattern matching. Thinking models and the generalization seen from grokking are the first steps on this path, but it's not quite there yet.
[1] https://news.ycombinator.com/item?id=45277098
Powerful pattern matching is still just pattern matching.
How is an LLM going to solve a novel problem with just pattern matching?
Novel means it has never seen it before, maybe doesn't even have the knowledge needed to solve it, so it's not going to be matching any pattern, and even if it did, that would not help if it required a solution different to whatever the pattern match had come from.
Human level reasoning includes ability to learn, so that people can solve novel problems, overcome failures by trial and error, exploration, etc.
So, whatever you are calling "reasoning" isn't human level reasoning, and it's therefore not even clear what you are trying to say? Maybe just that you feel LLMs have room for improvement by better pattern matching?
> Powerful pattern matching is still just pattern matching.
Higher order pattern matching is Turing complete. Transformers are Turing complete. Memory augmented LLMs are Turing complete. Neural networks can learn to reproduce any function. These have all been proven.
So if computers can be intelligent and can solve novel problems in principle, then LLMs can too if given the right training. If you don't think computers can be intelligent, you have a much higher burden to meet.
> Human level reasoning includes ability to learn, so that people can solve novel problems, overcome failures by trial and error, exploration, etc.
You keep bringing this up as if it's lacking, but basically all existing LLM interfaces provide facilities for memory to store state. Storing progress just isn't an issue if the LLM has the right training. HN has some recent articles about Claude code just being given the task to port some GitHub repos to other programming languages, and they woke up the next morning and it did it autonomously, using issue tracking, progress reports, PRs the hole nine yards. This is frankly not the hard part IMO.
It seems readily apparent there is a difference given their inability to do tasks we would otherwise reasonably describe as achievable via basic reasoning on the same facts.
I agree LLMs have many differences in abilities relative to humans. I'm not sure what this implies for their ability to reason though. I'm not even sure what examples about their bad reasoning can prove about the presence or absence of any kind of "reasoning", which is why I keep asking for definitions to remove the ambiguity. If examples of bad reasoning sufficed, then this would prove that humans can't reason either, which is silly.
A rigourous definition of "reasoning" is challenging though, which is why people consistently can't provide a general one that's satisfactory when I ask, and this is why I'm skeptical that pattern matching isn't a big part of it. Arguments that LLMs are "just pattern matching" are thus not persuasive arguments that they are not "reasoning" at some cruder level.
Maybe humans are just higher order pattern matchers and LLMs are only first or second-order pattern matchers. Maybe first-order pattern matching shouldn't count as "reasoning", but should second-order? Third-order? Is there evidence or some proof that LLMs couldn't be trained to be higher order pattern matchers, even in principle?
None of the arguments or evidence I've seen about LLMs and reasoning is rigourous or persuasive on these questions.
Nothing about the uncertainty of the definition for 'reasoning' requires that pattern matching be part of the definition.
Did someone in this thread claim that?
>and the LLMs, which almost certainly didn't have a lot of "assemble these items" in their training data
I don't think this assumption is sound. Humans write a huge amount on "assemble components x and y to make entity z". I'd expect all LLMs to have consumed every IKEA type instruction manual, the rules for Jenga, all geometry textbooks and papers ever written.
Most of our coding is just plumbing. Getting data from one place to where it needs to be. There is no advanced reasoning necessary. Just a good idea of the structure of the code and the data-structures.
Even high school maths tests are way harder than what most professional programmers do on a daily basis.
I could be mistaken but generally LLMs cannot tackle out-of-domain problems whereas humans do seem to have that capability. Relatedly, the energy costs are wildly different suggesting that LLMs are imitating some kind of thought but not simulating it. They’re doing a remarkable job of passing the Turing test but that says more about the limitations of the Turing test than it does about the capabilities of the LLMs.
> I just don't see how you can argue with a straight face that this is "pattern matching". If that's pattern matching, then pattern matching is not an insult.
IMO its still "just" a, very good, autocomplete. No actual reasoning, but lots of statistics on what is the next token to spit out.
> Do submarines swim?
That's the main point of the parent comment. Arguing about the definition of "reasoning" or "pattern matching" is just a waste of time. What really matters is if it produces helpful output. Arguing about that is way better!
Instead of saying: "It's just pattern matching -> It won't improve the world", make an argument like: "AI's seem to have trouble specializing like humans -> adopting AI will increase error rates in business processes -> due to the amount of possible edge cases, most people will get into an edge case with no hope of escaping it -> many people's lives will get worse".
The first example relies on us agreeing on the definition of pattern matching, and then taking a conclusion based on how those words feel. This has no hope of convincing me if I don't like your definition! The second one is an argument that could potentially convince me, even if I'm an AI optimist. It is also just by itself an interesting line of reasoning.
No it's not "just a very good autocomplete". I don't know why people repeat this thing (it's wrong) but I find it an extremely counterproductive position. Some people just love to dismiss the capabilities of AI with a very shallow understanding of how it works. Why?
It generates words one by one, like we all do. This doesn't mean it does just that and nothing else. It's the mechanics of how they are trained and how they do inference. And most importantly how they communicate with us. It doesn't define what they are or their limits. This is reductionism. Ignoring the mathematical complexity of a giant neural network.
> like we all do
Do we though? Sure, we communicate sequentially, but that doesn't mean that our internal effort is piecewise and linear. A modern transformer LLM however is. Each token is sampled from a population exclusively dependent on the tokens that came before it.
Mechanistically speaking, it works similarly to autocomplete, but at a very different scale.
Now how much of an unavoidable handicap this incurs, if any, is absolutely up for debate.
But yes, taking this mechanistic truth and only considering it in a shallow manner underestimates the capability of LLMs by a large degree.
Our thinking is also based only on events that occurred previously in time. We don’t use events in the future.
Is this a certainty? I thought it was an open question whether quantum effects are at play in the brain, and those have a counterintuitive relationship with time (to vastly dumb things down in a way my grug mind can comprehend).
I'm aware of a counterintuitive relationship with space, but what's the one with time?
Well there’s no evidence of this that I’ve seen. If so, then maybe that is what is the blocker for AGI.
I think it's more that there isn't yet evidence against it. In other words, we're not sure or not if the brain has some kind of special sauce that doesn't just reduce to linear algebra.
"I think it's more that there isn't yet evidence against it."
We don't? AFAIK we have no proof of anyone being able to see into the future. Now maybe there are other manifestations of this, but I know of no test today that even hints at it.
Quantum effects definitely reduce to linear algebra however.
This is unhelpfully obtuse
What's obtuse about it? It's honestly a very straightforward statement. Every thing we think or say is a function of past events. We don't incorporate future events into what we think or say. Even speculation or imagination of future events occurred in the past (that is the act of imagining it occurred in the past).
It's really a super simple concept -- maybe it's so simple that it seems obtuse.
Because the other poster's point wasn't that it was a 'past event.' The point was that it's just predicting based upon the previous token. It's disingenuous to mix the two concepts up.
> The point was that it's just predicting based upon the previous token.
Well that's just wrong. None of the LLMs of interest predict based upon the previous token.
I don't know why people repeat this thing (it's wrong)
Because they simply don't care if they're wrong. At this point, given what we've seen, that seems like the only explanation left.
I can't say for certain that our wetware isn't "just a very good autocomplete".
A very good autocomplete is realized by developing an understanding.
>Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
I think most of the problem i solve is also a pattern matching. The problems i am good at solving are the ones i've seen before or the ones i can break into problems i've seen before.
> It’s the same reason why most of the people who pass your leetcode tests don’t actually know how to build anything real. They are taught to the test not taught to reality.
True, and "Agentic Workflows" are now playing the same role as "Agile" in that both take the idea that if you have many people/LLMs that can solve toy problems but not real ones then you can still succeed by breaking down the real problems into toy problems and assigning them out.
"Not understanding or reasoning" is anthropocentric cope. There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.
One notable difference, however, is that LLMs disproportionately suck at spatial reasoning. Which shouldn't be surprising, considering that their training datasets are almost entirely text. The ultimate wordcel makes for a poor shape rotator.
All ARC-AGI tasks are "spatial reasoning" tasks. They aren't in any way special. They just force LLMs to perform in an area they're spectacularly weak at. And LLMs aren't good enough yet to be able to brute force through this innate deficiency with raw intelligence.
> There is very little practical difference between "understanding" and "reasoning" implemented in human mind and that implemented in LLMs.
Source?
The primary source is: measured LLM performance on once-human-exclusive tasks - such as high end natural language processing or commonsense reasoning.
Those things were once thought to require a human mind - clearly, not anymore. Human commonsense knowledge can be both captured and applied by a learning algorithm trained on nothing but a boatload of text.
But another important source is: loads and loads of mech interpret research that tried to actually pry the black box open and see what happens on the inside.
This found some amusing artifacts - such as latent world models that can be extracted from the hidden state, or neural circuits corresponding to high level abstracts being chained together to obtain the final outputs. Very similar to human "abstract thinking" in function - despite being implemented on a substrate of floating point math and not wet meat.
I haven't seen LLMs perform common sense reasoning. Feel free to share some links. Your post reads like anthropomorphized nonsense.
One of the most astonishing things about LLMs is that they actually seem to have achieved general common-sense reasoning to a signficant extent. Example from the thread about somebody ordering 18000 waters at a drive-through: https://news.ycombinator.com/item?id=45067653
TL;DR: Even without being explicitly prompted to, a pretty weak LLM "realized" that a thousand glasses of water was an unreasonable order. I'd say that's good enough to call "common sense".
You can try it out yourself! Just pick any AI chatbot, make up situations with varying levels of absurdity, maybe in a roleplay setting (e.g. "You are a fast food restaurant cashier. I am a customer. My order is..."), and test how it responds.
What? Do you even know what "commonsense reasoning" means?
Do you?
https://en.wikipedia.org/wiki/Commonsense_reasoning
So, you don't, but wikipedia does? I'll believe they can do commonsense reasoning when they can figure out that people have 4 fingers and 1 thumb. Here I was thinking common sense reasoning was what we call reasoning based on common sense. Go figure some AI folks needed to write a wikipedia article to redefine common sense.
Like they say, common sense ain't so common at all.
Least you could do is look up what an unfamiliar term means before rolling in with all the hot takes.
So take the link, and read it. That would help you to be less ignorant the next time around.
>Least you could do is look up what an unfamiliar term means before rolling in with all the hot takes.
Thanks for proving my point that common sense ain't so common. To be clear, common sense reasoning is not an "unfamiliar term" save for this new (article was written in 2021) redefinition of it to be something AI related. It's kinda laughable that you are being this snitty about.
> That would help you to be less ignorant the next time around.
Better to be "ignorant" than slow and humorless.
There is no source and arguing this is dumb because no one knows what reasoning or understanding is. No one.
So all we have is "Does it swim like a duck, look like a duck, quack like a duck?"
I’m sympathetic to your point, but this isn’t quite fair. The field of psychology does exist.
Neuroscience is the field that would be closest to this. But even they are empty handed with evidence and heavy with hypotheses.
No, psychology is right. Psychology studies what the properties of thought are. Neuroscience studies the specific biochemical mechanisms of the brain. Psychology is the study of what mental reasoning IS, while neuroscience is the study of HOW neurons in our brain implement it.
If you are asking “ok, but what is reasoning, really? What definition of reasoning would enable us to recognize whether it is going on in this AI or not?” it is a question of psychology. Unless we are restricting ourselves to whole brain emulation only.
Psychology is stuck in pre-Galilean era. Even if it studies "properties of thought", as you put it, it does so without formal basis, let alone understanding from first principles. As Chomsky said, about psychology and the like, "You want to move from behavioral science to authentic science." [1]
[1] Chomsky & Krauss (2015) An Origins Project Dialogue at https://youtu.be/Ml1G919Bts0
...literally benchmarks the post is all about?
practical difference is about results - and results are here
Very much agree with this. Looking at the dimensionality of a given problem space is a very helpful heuristic when analyzing how likely an llm is going to be suitable/reliable for that task. Consider how important positional encodings are LLM performance. You also then have an attention model that operates in that 1-dimensional space. With multidimensional data significant transformations to encode into a higher dimensional abstraction needs to happen within the model itself, before the model can even attempt to intelligently manipulate it.
For many people, the difference between how a language model solves a problem and how a human solves a problem is actually very important.
[flagged]
please consider a less emotive, flaming/personal tone in the future, hacker news is much more readable without it!
I would broadly agree that it's a bit far, but the OPs point does have some validity, its often the same formulaic methodology
> Pattern matching questions on a contrived test is not the same thing as understanding or reasoning.
Pattern matching is definitely the same thing as understanding and reasoning.
The problem is that LLMs can't recognize patterns that are longer than a few paragraphs, because the tokens would have to be far too long. LLMs are a thing we are lucky to have because we have very fast computers and very smart mathematicians making very hard calculations very efficient and parallelizable. But they sit on top of a bed of an enormous amount of human written knowledge, and can only stretch so far from that bed before completely falling apart.
Humans don't use tokenizers.
The goal right now is to build a scaffolding of these dummies in order to get really complicated work done, but that work is only ever going to accidentally be correct because of an accumulation of errors. This may be enough for a lot if we try it 1000x and run manually-tuned algos over the output to find the good ones. But this is essentially manual work, done in the traditional way.
edit: sorry, you're never going to convince me these things are geniuses when I chat to them for a couple of back and forth exchanges and they're already obviously losing track of everything, even what they just said. The good thing is that what they are is enough to do a lot, if you're a person who can be satisfied that they're not going to be your god anytime soon.
I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.
LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.
I'm not sure how similar this is but I tried the same quite a while back with a simple 5x5 nonogram (Picross) and had similar difficulties.
I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over.
Also, there's already a complete database of valid answers at [1], so I'm not sure why the correct answer couldn't just come from that, and the 'reasoning' can be 'We solved this here, look...' ;)
[1] The wonderful https://pixelogic.app/every-5x5-nonogram
> I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over.
Because its in the context window and a lot of training material refers to earlier stuff for later stuff it is trained to bring up that stuff again and again. Even if it is in the window as a negative.
I really think that the problem is with tokenizing vision.
Any kind of visually based reasoning and they become dumb as rocks. It feels similar to having a person play sokoban but blindfolded and only with text prompts. The same issue cropped up with playing pokemon. Like the image gets translated to text, and then the model works on that.
I'm no expert on transformers, but it just feels like there is some kind of limit that prevents the models from "thinking" visually.
Yes, vision is a problem, but I don't think it's the biggest problem for the specific task I'm testing. The memory problem is bigger. The models frequently do come up with the right answer, but they promptly forget it between turns.
Sometimes they forget because the full reasoning trace is not preserved in context (either due to API limitations or simply because the context isn't big enough to hold dozens or hundreds of steps of full reasoning traces). Sometimes it's because retrieval from context is bad for abstract concepts and rules vs. keyword matching, and to me the reason for that is that text is lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures.
I think the problem is though that they need to store it in text context.
When I am solving a sokoban style game, it's entirely visual. I don't need to remember a lot because the visual holds so much information.
It's like the average person trying to play a game of chess with just text. It's nightmarishly hard compared to having a board in front of you. The LLMs seem stuck having to play everything through just text.
It's not just visual. You also need a representation of the rules of the game and the strategies that make sense. The puzzles I'm solving are not straight Sokoban, they have per-game varying rules that need to be discovered (again, ARC-AGI-3 style) that affect the strategies that you need to use. For example, in classic Sokoban you can't push two crates at once, but in some of the puzzles I'm using you can, and this is taught by forcing you to do it in the first level, and you need to remember it through the rest of the levels. This is not a purely visual concept and models still struggle with it.
Try to get your LLM of choice to find its way out of a labyrinth that you describe in text form. It's absolutely awful even with the simplest mazes. I'm not sure the problem here is memory, though? I think it has to do with spatial reasoning. I'd be willing to bet every company right now is working on spatial reasoning (at least up to 3D) and as soon as that is working, a huge amount of pieces will fall into place.
Spatial reasoning is weak, but still I frequently see models come up with the right answer in reasoning steps, only to make the wrong move in the following turn because they forget what they just learned. For models with hidden reasoning it's often not even possible to retain the reasoning tokens in context through multiple steps, but even if you could the context windows are big but not big enough to contain all the past reasoning for every step for hundreds of steps. And then even if they were the retrieval from context for abstract concepts (vs verbatim copying) is terrible.
Text is too lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures.
I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.
In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.
I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.
I am working on something similar but with an AST for legal documents. So far, it seems promising but still rudimentary.
If you've ever used Claude Code + Plan mode - you know that exactly this is true.
This sounds interesting.
I would really like to read a full research paper made out of this, which describes the method in more detail, gives some more examples, does more analysis on it, etc.
Btw, this uses LLMs on pure text-level? Why not images? Most of these patterns are easy to detect on image-level, but I assume when presented as text, it's much harder.
> LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
I think this argument is a bit flawed. Yes, you can define AGI as being better than (average) humans in every possible task. But isn't this very arbitrary? Isn't it more reasonable to expect that different intelligent systems (including animals, humans) can have different strengths, and it is unreasonable to expect that one system is really better in everything? Maybe it's more reasonable to define ASI that way, but even for ASI, if a system is already better in a majority of tasks (but not necessarily in every task), I think this should already count as ASI. Maybe really being better in every possible task is just not possible. You could design a task that is very specifically tailored for human intelligence.
I suspect (to use the language of the author) current LLMs have a bit of a "reasoning dead zone" when it comes to images. In my limited experience they struggle with anything more complex than "transcribe the text" or similarly basic tasks. Like I tried to create an automated QA agent with Claude Sonnet 3.5 to catch regressions in my frontend, and it will look at an obviously broken frontend component (using puppeteer to drive and screenshot a headless browser) and confidently proclaim it's working correctly, often making up a supporting argument too. I've had much more success passing the code for the component and any console logs directly to the agent in text form.
My memory is a bit fuzzy, but I've seen another QA agent that takes a similar approach of structured text extraction rather than using images. So I suspect I'm not the only one finding image-based reasoning an issue. Could also be for cost reasons though, so take that with a pinch of salt.
LLM image frontends suck, and a lot of them suck big time.
The naive approach of "use a pretrained encoder to massage the input pixels into a bag of soft tokens and paste those tokens into the context window" is good enough to get you a third of the way to humanlike vision performance - but struggles to go much further.
Claude's current vision implementation is also notoriously awful. Like, "a goddamn 4B Gemma 3 beats it" level of awful. For a lot of vision-heavy tasks, you'd be better off using literally anything else.
Wild, I found it hard to believe that a 4b model could beat sonnet-3.5 at anything, but at least on the vision arena (https://lmarena.ai/leaderboard/vision) it seems like sonnet-3.5 is at the same ELO as a 27b gemma (~1150), so it's plausible. I guess that just says more about how bad vision LLMs are right now that anything else.
Can someone explain to me why a new LLMs ability to solve highly publicized puzzles is not "just" (sorry) it having access to the blog posts talking about those puzzles?
It's fine, that's what I would do to solve them, but it doesn't obviously and immediately make me confident in new reasoning capability w that suspicion floating around.
Should be easy to test by picking two similar models with different publishing dates (before and after ARC v2), and also comparing with/without the new reasoning technique from the article.
Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
Transformer models, typically architected for and trained on 1d text streams, are not going to perform well on ARC-AGI. I like that the test corpus exists as I believe it suggests that other model architectures (perhaps co-existing with LLMs in a MoE fashion) are needed to generalize AI performance further. For example, if we constructed a 3d version of ARC-AGI (rather than relying on grids) humans would probably still outperform reasoning LLMs handily. However, expand ARC-AGI to 4d and I think human performance might start to become more comparable to LLM performance. 4d is as alien to us as 2d is to LLMs, in this narrow test corpus.
That's a super neat approach.
But the core issue seems to be: How do you come up with the fitness function that drives the evolutionary process without human intervention in the first place?
(I've tried something similar with a coding agent where I let the agent modify parts of its system prompt... But it got stuck very fast since there was no clear fitness function)
isn't the author actually overfitting a solution ? He'll sure beat ARC AGI, but that will be all.
Seeing how ARC-AGI is pretty much the only non-embodied short-duration type of challenge where humans are still an order of magnitude better than AIs, beating it would possibly bring us a lot closer to actual AGI.
I don't think so. The author isn't training an LLM, but rather using an LLM to solve a specific problem. This method could also be applied to solve other problems.
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
Religion often is, as "the Lord's ways are inscrutable"
And people have started seeing LLM's as a quasi-religion.
> LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones.
blank stare
We have dead-zones in adductive reasoning, not in induction or deduction. Almost all failures of reasoning in people are in abducing what model describes the situation at hand.
eg., we can apply the rule, "-A cannot follow from A", etc. regardless of the A
eg., we always know that if the number of apples is 2, then it cannot be any of "all numbers without 2" -- which quantifies over all numbers
You will not find a "gap" for a given number, whereas with LLMs, gaps of this kind are common
> we can apply the rule, "-A cannot follow from A", etc. regardless of the A
You can't think of any domains where we are unable to apply this rule? I feel like I'm surrounded by people claiming "A, therefore -A!!"
And if I'm one of them, and this were a reasoning dead-zone for me, I wouldn't be able to tell!
That's an abductive failure to recognise that something is A, and something else is not-A
I dont see cases where people recognise the contradiction and then perform it.
People who know alcohol is bad for them and don't want to keep being drunks but keep drinking, people who believe phones are bad for their kids but still buy them, people who understand AI will significantly degrade the environment if it becomes ubiquitous but still work to help it become ubiquitous...
Mathematicians who publish proofs that are later proven inconsistent!
I suspect we have fundamentally different views of how humans work. I see our behavior and beliefs as _mostly_ irrational, with only a few "reasoning live-zones" where, with great effort, we can achieve logical thought.
How can you know? One could argue that the entire phenomenon of cognitive dissonance is "people (internally) recognize the contradiction and then perform it"
To me the reason ARC-AGI puzzles are difficult for LLMs and possible for humans is that they are expressed in a format for which humans have powerful preprocessing capabilities.
Imagine the puzzle layouts were expressed in JSON instead of as a pattern of visual blocks. How many humans could solve them in that case?
We have powerful preprocessing blocks for images: Strong computer vision capabilities predates LLMs by several years. Image classification, segmentation, object detection, etc. All differential and trainable in same way as LLMs, including jointly. To the best of my knowledge, no team has shown really high scores by adding in a image preprocessing block?
Bingo. We simply made a test for which we are well trained. We are constantly making real time decisions with our eyes. Interestingly certain monkeys are much better at certain visual pattern recognition than we are. They might laugh and think humans haven’t reached AGI yet.
Every one who had access to a computer that could convert json into something more readable for humans, and would know that was the first thing they needed to do?
You might as well have asked how many English speakers could solve the questions if they were in Chinese. All of them. They would call up someone who spoke Chinese, pay them to translate the questions, then solve them. Or failing that, they would go to the bookstore, buy books on learning Chinese, and solve them three years from now.
I love this sort of self-starter experimenting. Curious what models have been tried, I saw Grok4 mentioned, curious how well it transfers to other models.
>With RL, models no longer just learn what sounds correct based on patterns they've seen. They learn what words to output to be correct. RL is the process of forcing the pre-trained weights to be logically consistent.
How does Reinforcement Learning force the weights to be logically consistent? Isn't it just about training using a coarser/more-fuzzy granularity of fitness?
More generally, is it really solving the task if it's given a large number of attempts and an oracle to say whether it's correct? Humans can answer the questions in one shot and self-check the answer, whereas this is like trial and error with an external expert who tells you to try again.
Are there any existing scripts/ tools to use these evolutionary algorithms also at home with e.g. Codex/GPT-5 / Claude Code?
dspy approach seems rather similar to that: https://dspy.ai/tutorials/gepa_ai_program/
This sounds like it is just slightly smarter than brute forcing your way to a solution.
Oh well, more support for my prediction: nobody will win a Nobel prize for reaching AGI.
Those are bold claims
Code: https://github.com/jerber/arc-lang-public
Kaggle: https://www.kaggle.com/code/jerber/jeremy-arc2
Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.
The biggest issue I have with ARC-AGI is it's a visual problem. LLMs (even the newfangled multi-modal ones) are still far worse at vision than at purely text based problems. I don't think it's possible to build a test of purely text-based questions that would be easy for humans and hard for SOTA models. Yes, there's a few gotchas you can throw at them but not 500.
you would be interested in dSPY
Congrats, you made LLMs perform slightly better at a contrived puzzle. This finally proves that we've cracked intelligence and are well on our way towards AGI.