[HN Gopher] OK, I can partly explain the LLM chess weirdness now
       ___________________________________________________________________
        
       OK, I can partly explain the LLM chess weirdness now
        
       Author : dmazin
       Score  : 422 points
       Date   : 2024-11-21 17:55 UTC (1 days ago)
        
 (HTM) web link (dynomight.net)
 (TXT) w3m dump (dynomight.net)
        
       | amrrs wrote:
       | >Theory 1: Large enough base models are good at chess, but this
       | doesn't persist through instruction tuning to chat models.
       | 
       | I lean mostly towards this and also the chess notations - not
       | sure if it might get chopped during tokenization unless it's very
       | precisely processed.
       | 
       | It's like designing an LLM just for predicting protein sequence
       | because the sequencing matters. The base data might have it but i
       | don't think that's the intention for it to continue.
        
         | com2kid wrote:
         | This makes me wonder what scenarios would be unlocked if OpenAI
         | gave access to gpt4-instruct.
         | 
         | I wonder if they avoid that due to the potential for negative
         | press from the outputs of a more "raw" model.
        
       | tromp wrote:
       | > For one, gpt-3.5-turbo-instruct rarely suggests illegal moves,
       | even in the late game. This requires "understanding" chess.
       | 
       | Here's one way to test whether it really understands chess. Make
       | it play the next move in 1000 random legal positions (in which no
       | side is checkmated yet). Such positions can be generated using
       | the ChessPositionRanking project at [1]. Does it still rarely
       | suggest illegal moves in these totally weird positions, that will
       | be completely unlike any it would have seen in training (and in
       | which the legal move choice is often highly restricted) ?
       | 
       | While good for testing legality of next moves, these positions
       | are not so useful for distinguishing their quality, since usually
       | one side already has an overwhelming advantage.
       | 
       | [1] https://github.com/tromp/ChessPositionRanking
        
         | BurningFrog wrote:
         | Not that I understand the internals of current AI tech, but...
         | 
         | I'd expect that an AI that has seen billions of chess
         | positions, and the moves played in them, can figure out the
         | rules for legal moves without being told?
        
           | rscho wrote:
           | Statistical 'AI' doesn't 'understand' anything, strictly
           | speaking. It predicts a move with high probability, which
           | could be legal or illegal.
        
             | griomnib wrote:
             | Likewise with LLM you don't know if it is truly in the
             | "chess" branch of the statistical distribution or it is
             | picking up something else entirely, like some arcane
             | overlap of tokens.
             | 
             | So much of the training data (eg common crawl, pile,
             | Reddit) is dogshit, so it generates reheated dogshit.
        
               | Helonomoto wrote:
               | You generalize this without mentioning that there are
               | LLMs which do not just use random 'dogshit'.
               | 
               | Also what does a normal human do? It looks around how to
               | move one random piece and it uses a very small dictionary
               | / set of basic rules to move it. I do not remember me
               | learning to count every piece and its options by looking
               | up that rulebook. I learned to 'see' how i can move one
               | type of chess piece.
               | 
               | If a LLM uses only these piece moves on a mathematical
               | level, it would do the same thing as i do.
               | 
               | And yes there is also absolutly the option for an LLM to
               | learn some kind of meta game.
        
             | Helonomoto wrote:
             | How do you define 'understand'?
             | 
             | There is plenty of AI which learns the rules of games like
             | Alpha Zero.
             | 
             | LLMs might not have the architecture to 'learn', but it
             | also might. If it optimizes all possible moves one chess
             | peace can do (which is not that much to learn) it can
             | easily only 'move' from one game set to another by this
             | type of dictionary.
        
               | rscho wrote:
               | Understanding a rules-based system (chess) means to be
               | able to learn non-probabilistic rules (an abstraction
               | over the concrete world). Humans are a mix of symbolic
               | and probabilistic learning, allowing them to get a huge
               | boost in performance by admitting rules. It doesn't mean
               | a human will never make an illegal move, but it means a
               | much smaller probability of illegal move based on less
               | training data. Asymptotically, performance from humans
               | and purely probabilistic systems converge. But that also
               | means that in appropriate situations, humans are hugely
               | more data-efficient.
        
               | david-gpu wrote:
               | _> in appropriate situations, humans are hugely more
               | data-efficient_
               | 
               | After spending some years raising my children I gave up
               | the notion that humans are data efficient. It takes a
               | mind numbing amount of training to get them to learn the
               | most basic skills.
        
               | rscho wrote:
               | You could compare childhood with the training phase of a
               | model. Still think humans are not data-efficient ?
        
               | david-gpu wrote:
               | Yes, that is exactly the point I am making. It takes many
               | repetitions (epochs) to teach them anything.
        
               | rscho wrote:
               | Compared to the amount of data needed to train an even
               | remotely impressive 'AI' model , that is not even AGI and
               | hallucinates on a regular basis ? On the contrary, it
               | seems to me that humans and their children are hugely
               | efficient.
        
               | david-gpu wrote:
               | _> On the contrary, it seems to me that humans and their
               | children are hugely efficient._
               | 
               | Does a child remotely know as much as ChatGPT? Is it able
               | to reason remotely as well?
        
               | rscho wrote:
               | I'd say the kid knows more about the world than ChatGPT,
               | yes. For starters, the kid has representations of
               | concepts such as 'blue color' because eyes... ChatGPT can
               | answer difficult questions for sure, but overall I'd say
               | it's much more specialized and limited than a kid.
               | However, I also think that's mostly comparing apples and
               | oranges, and that one's judgement about that is very
               | personal. So, in the end I don't know.
        
               | chongli wrote:
               | Neither AlphaZero nor MuZero can learn the rules of chess
               | from an empty chess board and a pile of pieces. There is
               | no objective function so there's nothing to train upon.
               | 
               | That would be like alien archaeologists of the future
               | finding a chess board and some pieces in a capsule
               | orbiting Mars after the total destruction of Earth and
               | all recorded human thought. The archaeologists could
               | invent their own games to play on the chess board but
               | they'd have no way of ever knowing they were playing
               | chess.
        
               | BurningFrog wrote:
               | AlphaZero was given the rules of the game, but it figured
               | out how to beat everyone else all by itself!
        
               | rscho wrote:
               | All by itself, meaning playing against itself...
               | 
               | Interestingly, Bobby Fischer did it in the same way.
               | Maybe AlphaZero also hates chess ? :-)
        
             | fragmede wrote:
             | The illegal moves are interesting as it goes to
             | "understanding". In children learning to play chess, how
             | often do they try and make illegal moves? When first
             | learning the game I remember that I'd lose track of all the
             | things going on at once and try to make illegal moves, but
             | eventually the rules became second nature and I stopped
             | trying to make illegal moves. With an ELO of 1800, I'd
             | expect ChatGPT not to make any illegal moves.
        
             | sixfiveotwo wrote:
             | I think the article briefly touch on that topic at some
             | point:
             | 
             | > For one, gpt-3.5-turbo-instruct rarely suggests illegal
             | moves, even in the late game. This requires "understanding"
             | chess. If this doesn't convince you, I encourage you to
             | write a program that can take strings like 1. e4 d5 2. exd5
             | Qxd5 3. Nc3 and then say if the last move was legal.
             | 
             | However, I can't say if LLMs fall in the "statistical AI"
             | category.
        
           | pvitz wrote:
           | A system that would just output the most probable tokens
           | based on the text it was fed and trained on the games played
           | by players with ratings greater than 1800 would certainly
           | fail to output the right moves to totally unlikely board
           | positions.
        
           | Helonomoto wrote:
           | Yes in theory it could. Depends on how it learns. Does it
           | learn by memorization or by learning the rules. It depends on
           | the architecture and the amount of 'pressure' you put on it
           | to be more efficient or not.
        
         | griomnib wrote:
         | I think at this point it's very clear LLM aren't achieving any
         | form of "reasoning" as commonly understood. Among other factors
         | it can be argued that true reasoning involves symbolic logic
         | and abstractions, and LLM are next token predictors.
        
           | DiogenesKynikos wrote:
           | Effective next-token prediction requires reasoning.
           | 
           | You can also say humans are "just XYZ biological system," but
           | that doesn't mean they don't reason. The same goes for LLMs.
        
             | griomnib wrote:
             | Take a word problem for example. A child will be told the
             | first step is to translate the problem from human language
             | to mathematical notation (symbolic representation), then
             | solve the math (logic).
             | 
             | A human doesn't use next token prediction to solve word
             | problems.
        
               | Majromax wrote:
               | But the LLM isn't "using next-token prediction" to solve
               | the problem, that's only how it's evaluated.
               | 
               | The "real processing" happens through the various
               | transformer layers (and token-wise nonlinear networks),
               | where it seems as if progressively richer meanings are
               | added to each token. That rich feature set then _decodes_
               | to the next predicted token, but that decoding step is
               | throwing away a lot of information contained in the
               | latent space.
               | 
               | If language models (per Anthropic's work) can have a
               | direction in latent space correspond to the concept of
               | the Golden Gate Bridge, then I think it's reasonable
               | (albeit far from certain) to say that LLMs are performing
               | some kind of symbolic-ish reasoning.
        
               | griomnib wrote:
               | Anthropic had a vested interest in people thinking Claude
               | is reasoning.
               | 
               | However, in coding tasks I've been able to find it
               | directly regurgitating Stack overflow answers (like
               | literally a google search turns up the code).
               | 
               | Giving coding is supposed to be Claude's strength, and
               | it's clearly just parroting web data, I'm not seeing any
               | sort of "reasoning".
               | 
               | LLM may be _useful_ but they don't _think_. They've
               | already plateaued, and given the absurd energy
               | requirements I think they will prove to be far less
               | impactful than people think.
        
               | DiogenesKynikos wrote:
               | The claim that Claude is just regurgitating answers from
               | Stackoverflow is not tenable, if you've spent time
               | interacting with it.
               | 
               | You can give Claude a complex, novel problem, and it will
               | give you a reasonable solution, which it will be able to
               | explain to you and discuss with you.
               | 
               | You're getting hung up on the fact that LLMs are trained
               | on next-token prediction. I could equally dismiss human
               | intelligence: "The human brain is just a biological
               | neural network that is adapted to maximize the chance of
               | creating successful offspring." Sure, but the way it
               | solves that task is clearly intelligent.
        
               | griomnib wrote:
               | I've literally spent 100s of hours with it. I'm mystified
               | why so many people use the "you're holding it wrong"
               | explanation when somebody points out real limitations.
        
               | vidarh wrote:
               | When we've spent time with it and gotten novel code, then
               | if you claim that doesn't happen, it is natural to say
               | "you're holding it wrong". If you're just arguing it
               | doesn't happen _often enough_ to be useful to you, that
               | likely depends on your expectations and how complex tasks
               | you need it to carry out to be useful.
        
               | gonab wrote:
               | In many ways, Claude feels like a miracle to me. I no
               | longer have to stress over semantics or searching for
               | patterns I can recognize and work with, but I've never
               | actually coded them myself in that language. Now, I don't
               | have to waste energy looking up things that I find boring
        
               | int_19h wrote:
               | You might consider that other people have also spent
               | hundreds of hours with it, and have seen it correctly
               | solve tasks that cannot be explained by regurgitating
               | something from the training set.
               | 
               | I'm not saying that your observations aren't correct, but
               | this is not a binary. It is entirely possible that the
               | tasks you observe the models on are exactly the kind
               | where they tend to regurgitate. But that doesn't mean
               | that it is all they can do.
               | 
               | Ultimately, the question is whether there is a "there"
               | there at all. Even if 9 times out of 10, the model
               | regurgitates, but that one other time it can actually
               | reason, that means that it is _capable_ of reasoning in
               | principle.
        
               | vrighter wrote:
               | The LLM isn't solving the problem. The LLM is just
               | predicting the next word. It's not "using next-token
               | prediction to solve a problem". It has no concept of
               | "problem". All it can do is predict 1 (one) token that
               | follows another provided set. That running this in a loop
               | provides you with bullshit (with bullshit defined here as
               | things someone or something says neither with good nor
               | bad intent, but just with complete disregard for any
               | factual accuracy or lack thereof, and so the information
               | is unreliable for everyone) does not mean it is thinking.
        
               | DiogenesKynikos wrote:
               | All the human brain does is determine how to fire some
               | motor neurons. No, it does not reason.
               | 
               | No, the human brain does not "understand" language. It
               | just knows how to control the firing of neurons that
               | control the vocal chords, in order to maximize an
               | endocrine reward function that has evolved to maximize
               | biological fitness.
               | 
               | I can speak about human brains the same way you speak
               | about LLMs. I'm sure you can spot the problem in my
               | conclusions: just because the human brain is "only"
               | firing neurons, it does actually develop an understanding
               | of the world. The same goes for LLMs and next-word
               | prediction.
        
               | mhh__ wrote:
               | I don't see why this isn't a good model for how human
               | reasoning happens either, certainly as a first-order
               | assumption (at least).
        
               | TeMPOraL wrote:
               | > _A human doesn't use next token prediction to solve
               | word problems._
               | 
               | Of course they do, unless they're particularly
               | conscientious noobs that are able to repeatedly execute
               | the "translate to mathematical notation, then solve the
               | math" algorithm, without going insane. But those people
               | are the exception.
               | 
               | Everyone else either gets bored half-way through reading
               | the problem, or has already done dozens of similar
               | problems before, or both - and jump straight to "next
               | token prediction", aka. searching the problem space "by
               | feels", and checking candidate solutions to sub-problems
               | on the fly.
               | 
               | This kind of methodical approach you mention? We leave
               | that to symbolic math software. The "next token
               | prediction" approach is something we call
               | "experience"/"expertise" and a source of the thing we
               | call "insight".
        
               | vidarh wrote:
               | Indeed. Work on any project that requires humans to carry
               | out largely repetitive steps, and a large part of the
               | problem involves how to put processes around people to
               | work around humans "shutting off" reasoning and going
               | full-on automatic.
               | 
               | E.g. I do contract work on an LLM-related project where
               | one of the systemic changes introduced - in addition to
               | multiple levels of quality checks - is to force to make
               | people input a given sentence word for word followed by a
               | word from a set of 5 or so, and a _minority_ of the
               | submissions get that sentence correct including the final
               | word despite the system refusing to let you submit unless
               | the initial sentence is correct. Seeing the data has been
               | an absolutely shocking indictment of _human_ reasoning.
               | 
               | These are submissions from a pool of people who have
               | passed reasoning tests...
               | 
               | When I've tested the process myself as well, it takes
               | only a handful of steps before the tendency is to "drift
               | off" and start replacing a word here and there and fail
               | to complete even the initial sentence without a
               | correction. I shudder to think how bad the results would
               | be if there wasn't that "jolt" to try to get people back
               | to paying attention.
               | 
               | Keeping humans consistently carrying out a learned
               | process is incredibly hard.
        
               | fragmede wrote:
               | is that based on a vigorous understanding of how humans
               | think, derived from watching people (children) learn to
               | solve word problems? How do thoughts get formed? Because
               | I remember being given word problems with extra
               | information, and some children trying to shove that
               | information into a math equation despite it not being
               | relevant. The "think things though" portion of ChatGPT
               | o1-preview is hidden from us, so even though a o1-preview
               | can solve word problems, we don't know how it internally
               | computes to arrive at that answer. But we do we _really_
               | know how we do it? We can 't even explain consciousness
               | in the first place.
        
           | brookst wrote:
           | > Among other factors it can be argued that true reasoning
           | involves symbolic logic and abstractions, and LLM are next
           | token predictors.
           | 
           | I think this is circular?
           | 
           | If an LLM is "merely" predicting the next tokens to put
           | together a description of symbolic reasoning and
           | abstractions... how is that different from really exercisng
           | those things?
           | 
           | Can you give me an example of symbolic reasoning that I can't
           | handwave away as just the likely next words given the
           | starting place?
           | 
           | I'm not saying that LLMs have those capabilities; I'm
           | question whether there is any utility in distinguishing the
           | "actual" capability from identical outputs.
        
             | griomnib wrote:
             | Mathematical reasoning is the most obvious area where it
             | breaks down. This paper does an excellent job of proving
             | this point with some elegant examples:
             | https://arxiv.org/pdf/2410.05229
        
               | brookst wrote:
               | Sure, but _people_ fail at mathematical reasoning. That
               | doesn 't mean people are incapable of reasoning.
               | 
               | I'm not saying LLMs are perfect reasoners, I'm
               | questioning the value of asserting that they cannot
               | reason with some kind of "it's just text that looks like
               | reasoning" argument.
        
               | dartos wrote:
               | People can communicate each step, and review each step as
               | that communication is happening.
               | 
               | LLMs must be prompted for everything and don't act on
               | their own.
               | 
               | The value in the assertion is in preventing laymen from
               | seeing a statistical guessing machine be correct and
               | assuming that it always will be.
               | 
               | It's dangerous to put so much faith in what in reality is
               | a very good guessing machine. You can ask it to retrace
               | its steps, but it's just guessing at what it's steps
               | were, since it didn't actually go through real reasoning,
               | just generated text that reads like reasoning steps.
        
               | brookst wrote:
               | > since it didn't actually go through real reasoning,
               | just generated text that reads like reasoning steps.
               | 
               | Can you elaborate on the difference? Are you bringing
               | sentience into it? It kind of sounds like it from "don't
               | act on their own". But reasoning and sentience are wildly
               | different things.
               | 
               | > It's dangerous to put so much faith in what in reality
               | is a very good guessing machine
               | 
               | Yes, exactly. That's why I think it is good we are
               | supplementing fallible humans with fallible LLMs; we
               | already have the processes in place to assume that not
               | every actor is infallible.
        
               | david-gpu wrote:
               | So true. People who argue that we should not trust/use
               | LLMs because they sometimes get it wrong are holding them
               | to a higher standard than people -- we make mistakes too!
               | 
               | Do we blindly trust or believe every single thing we hear
               | from another person? Of course not. But hearing what they
               | have to say can still be fruitful, and it is not like we
               | have an oracle at our disposal who always speaks the
               | absolute truth, either. We make do with what we have, and
               | LLMs are another tool we can use.
        
               | vundercind wrote:
               | > Can you elaborate on the difference?
               | 
               | They'll fail in different ways than something that thinks
               | (and doesn't have some kind of major disease of the brain
               | going on) and often smack in the middle of _appearing_ to
               | think.
        
               | ben_w wrote:
               | > People can communicate each step, and review each step
               | as that communication is happening.
               | 
               | Can, but don't by default. Just as LLMs can be asked for
               | chain of thought, but the default for most users is just
               | chat.
               | 
               | This behaviour of humans is why we software developers
               | have daily standup meetings, version control, and code
               | review.
               | 
               | > LLMs must be prompted for everything and don't act on
               | their own
               | 
               | And this is why we humans have task boards like JIRA, and
               | quarterly goals set by management.
        
               | int_19h wrote:
               | A human brain in a vat doesn't act on its own, either.
        
               | vidarh wrote:
               | LLMs "don't act on their own" because we only reanimate
               | them when we want something from them. Nothing stops you
               | from wiring up an LLM to keep generating, and feeding it
               | sensory inputs to keep it processing. In other words,
               | that's a limitation of the harness we put them in, not of
               | LLMs.
               | 
               | As for people communicating each step, we have plenty of
               | experiments showing that it's pretty _hard_ to get people
               | to reliably report what they actually do as opposed to a
               | rationalization of what they 've actually done (e.g.
               | split brain experiments have shown both your brain halves
               | will happily lie about having decided to do things they
               | haven't done if you give them reason to think they've
               | done something)
               | 
               | You can categorically _not_ trust peoples reasoning about
               | "why" they've made a decision to reflect what actually
               | happened in their brain to make them do something.
        
               | NBJack wrote:
               | The idea is the average person would, sure. A
               | mathematically oriented person would fair far better.
               | 
               | Throw all the math problems you want at a LLM for
               | training; it will still fail if you step outside of the
               | familiar.
        
               | ben_w wrote:
               | > it will still fail if you step outside of the familiar.
               | 
               | To which I say:
               | 
               | so:do:humacns
        
               | trashtester wrote:
               | but humacn hubris prewent them from reaclizing thhact
        
               | ben_w wrote:
               | indeed:it:is:hubris
               | 
               | i:hacwe:often:seen:in:diskussions:suk:acs:this:klacims:th
               | act:humacn:minds:kacn:do:impossible:things:suk:acs:genera
               | cllyr:solwe:the:haclting:problem
               | 
               | edit: Snap, you said the same in your other comment :)
        
               | trashtester wrote:
               | Switching back to latin letters...
               | 
               | It seems to me that the idea of the Universal Turing
               | Machine is quite misleading for a lot of people, such as
               | David Deutsch.
               | 
               | My impression is that the amount of compute to solve most
               | problems that can really only be solved by Turing
               | Machines is always going to remain inaccessible (unless
               | they're trivally small).
               | 
               | But at the same time, the universe seems to obey a
               | principle of locality (as long as we only consider the
               | Quantum Wave Function, and don't postulate that it
               | collapses).
               | 
               | Also, the quantum fields are subject to some simple
               | (relative to LLMs) geometric symmetries, such as
               | invariance under the U(1)xSU(2)xSU(3) group.
               | 
               | As it turns out, similar group symmetries can be found in
               | all sorts of places in the real world.
               | 
               | Also it seems to me that at some level, both ANN's and
               | biological brains set up a similar system to this
               | physical reality, which may explain why brains develop
               | this way and why both kinds are so good at simulating at
               | least some aspects of the physical world, such as
               | translation, rotation, some types of deformation,
               | gravity, sound, light etc.
               | 
               | And when biological brains that initially developed to
               | predict the physical world is then use to create
               | language, that language is bound to use the same type of
               | machinere. And this may be why LLM's do language so well
               | with a similar architecture.
        
               | vidarh wrote:
               | There are _no_ problems that can be solved only by Turing
               | Machines as any Turing complete system can simulate any
               | other Turing complete system.
               | 
               | The point of UTM's is not to ever _use them_ , but that
               | they're a shortcut to demonstrating Turing completeness
               | because of their simplicity. Once you've proven Turing
               | completeness, you've proven that your system can compute
               | all Turing computable functions _and simulate any other
               | Turing complete system_ , and we _don 't know of any
               | computable functions outside this set_.
        
               | Workaccount2 wrote:
               | Maybe I am not understanding the paper correctly, but it
               | seems they tested "state of the art models" which is
               | almost entirely composed of open source <27B parameter
               | models. Mostly 8B and 3B models. This is kind of like
               | giving algebra problems to 7 year olds to "test human
               | algebra ability."
               | 
               | If you are holding up a 3B parameter model as an example
               | of "LLM's can't reason" I'm not sure if the authors are
               | confused or out of touch.
               | 
               | I mean, they do test 4o and O1 preview, but their
               | performance is notablely absent from the paper's
               | conclusion.
        
               | dartos wrote:
               | It's difficult to reproducibly test openai models, since
               | they can change from under you and you don't have control
               | over every hyperparameter.
               | 
               | It would've been nice to see one of the larger llama
               | models though.
        
               | og_kalu wrote:
               | The results are there, it's just hidden away in the
               | appendix. The result is that those models they don't
               | actually suffer drops on 4/5 of their modified
               | benchmarks. The one benchmark that does see actual drops
               | that aren't explained by margin of error is the benchmark
               | that adds "seemingly relevant but ultimately irrelevant
               | information to problems"
               | 
               | Those results are absent from the conclusion because the
               | conclusion falls apart otherwise.
        
             | dartos wrote:
             | There isn't much utility, but tbf the outputs aren't
             | identical.
             | 
             | One danger is the human assumption that, since something
             | appears to have that capability in some settings, it will
             | have that capability in all settings.
             | 
             | Thats a recipe for exploding bias, as we've seen with
             | classic statistical crime detection systems.
        
             | NBJack wrote:
             | Inferring patterns in unfamiliar problems.
             | 
             | Take a common word problem in a 5th grade math text book.
             | Now, change as many words as possible; instead of two
             | trains, make it two different animals; change the location
             | to a rarely discussed town; etc. Even better, invent
             | words/names to identify things.
             | 
             | Someone who has done a word problem like that will very
             | likely recognize the logic, even if the setting is
             | completely different.
             | 
             | Word tokenization alone should fail miserably.
        
               | djmips wrote:
               | I have noted over my life that a lot of problems end up
               | being a variation on solved problems from another more
               | familiar domain but frustratingly take a long time to
               | solve before realizing this was just like that thing you
               | had already solved. Nevertheless, I do feel like humans
               | do benefit from identifying meta patterns but as the
               | chess example shows even we might be weak in unfamiliar
               | areas.
        
               | Propelloni wrote:
               | Learn how to solve one problem and apply the approach,
               | logic and patterns to different problems. In German
               | that's called "Transferleistung" (roughly "transfer
               | success") and a big thing at advanced schools. Or, at
               | least my teacher friends never stop talking about it.
               | 
               | We get better at it over time, as probably most of us can
               | attest.
        
               | roywiggins wrote:
               | A lot of LLMs do weird things on the question "A farmer
               | needs to get a bag of grain across a river. He has a boat
               | that can transport himself and the grain. How does he do
               | this?"
               | 
               | (they often pattern-match on the farmer/grain/sheep/fox
               | puzzle and start inventing pointless trips ("the farmer
               | returns alone. Then, he crosses again.") in a way that a
               | human wouldn't)
        
             | vidarh wrote:
             | It is. As it stands, throw a loop around an LLM and act as
             | the tape, and an LLM can obviously be made Turing complete
             | (you can get it to execute all the steps of a minimal
             | Turing machine, so drop temperature so its deterministic,
             | and you have a Turing complete system). To argue that they
             | _can 't_ be made to reason is effectively to argue that
             | there is some unknown aspect of the brain that allows us to
             | compute functions not in the Turing computable set, which
             | would be an astounding revelation if it could be proven.
             | Until someone comes up with evidence for that, it is more
             | reasonable to assume that it is a question of whether we
             | have yet found a training mechanism that can lead to
             | reasoning or not, not whether or not LLMs can learn to.
        
               | vundercind wrote:
               | It doesn't follow that because a system is Turing
               | complete the _approach_ being used will eventually
               | achieve reasoning.
        
               | vidarh wrote:
               | No, but that was also not the claim I made.
               | 
               | The point is that as the person I replied to pointed out,
               | that LLM's are "next token predictors" is a meaningless
               | dismissal, as they can be both next token predictors and
               | Turing complete, and given that unless reasoning requires
               | functions outside the Turing computable (we know of no
               | way of constructing such functions, or no way for them to
               | exist) calling them "next token predictors" says nothing
               | about their capabilities.
        
           | hathawsh wrote:
           | I think the question we're grappling with is whether token
           | prediction may be more tightly related to symbolic logic than
           | we all expected. Today's LLMs are so uncannily good at faking
           | logic that it's making me ponder logic itself.
        
             | griomnib wrote:
             | I felt the same way about a year ago, I've since changed my
             | mind based on personal experience and new research.
        
               | hathawsh wrote:
               | Please elaborate.
        
               | dartos wrote:
               | I work in the LLM search space and echo OC's sentiment.
               | 
               | The more I work with LLMs the more the magic falls away
               | and I see that they are just very good at guessing text.
               | 
               | It's very apparent when I want to get them to do a very
               | specific thing. They get inconsistent about it.
        
               | griomnib wrote:
               | Pretty much the same, I work on some fairly specific
               | document retrieval and labeling problems. After some
               | initial excitement I've landed on using LLM to help train
               | smaller, more focused, models for specific tasks.
               | 
               | Translation is a task I've had good results with,
               | particularly mistral models. Which makes sense as it's
               | basically just "repeat this series of tokens with
               | modifications".
               | 
               | The closed models are practically useless from an
               | empirical standpoint as you have no idea if the model you
               | use Monday is the same as Tuesday. "Open" models at least
               | negate this issue.
               | 
               | Likewise, I've found LLM code to be of poor quality. I
               | think that has to do with being a very experienced and
               | skilled programmer. What the LLM produce is at best the
               | top answer in stack overflow-level skill. The top answers
               | on stack overflow are typically not optimal solutions,
               | they are solutions up voted by novices.
               | 
               | I find LLM code is not only bad, but when I point this
               | out the LLM then "apologizes" and gives better code. My
               | worry is inexperienced people can't even spot that and
               | won't get this best answer.
               | 
               | In fact try this - ask an LLM to generate some code then
               | reply with "isn't there a simpler, more maintainable, and
               | straightforward way to do this?"
        
               | blharr wrote:
               | There have even been times where an LLM will spit out
               | _the exact same code_ and you have to give it the answer
               | or a hint how to do it better
        
               | david-gpu wrote:
               | Yeah. I had the same experience doing code reviews at
               | work. Sometimes people just get stuck on a problem and
               | can't think of alternative approaches until you give them
               | a good hint.
        
               | david-gpu wrote:
               | _> I've found LLM code to be of poor quality_
               | 
               | Yes. That was my experience with most human-produced code
               | I ran into professionally, too.
               | 
               |  _> In fact try this - ask an LLM to generate some code
               | then reply with "isn't there a simpler, more
               | maintainable, and straightforward way to do this?"_
               | 
               | Yes, that sometimes works with humans as well. Although
               | you usually need to provide more specific feedback to
               | nudge them in the right track. It gets tiring after a
               | while, doesn't it?
        
               | dartos wrote:
               | What is the point of your argument?
               | 
               | I keep seeing people say "yeah well I've seen humans that
               | can't do that either."
               | 
               | What's the point you're trying to make?
        
               | david-gpu wrote:
               | The point is that the person I responded to criticized
               | LLMs for making the exact sort of mistakes that
               | professional programmers make all the time:
               | 
               |  _> I've found LLM code to be of poor quality. I think
               | that has to do with being a very experienced and skilled
               | programmer. What the LLM produce is at best the top
               | answer in stack overflow-level skill. The top answers on
               | stack overflow are typically not optimal solutions_
               | 
               | Most professional developers are unable to produce code
               | up to the standard of _" the top answer in stack
               | overflow"_ that the commenter was complaining about, with
               | the additional twist that most developers' breadth of
               | knowledge is going to be limited to a very narrow range
               | of APIs/platforms/etc. whereas these LLMs are able to be
               | comparable to decent programmers in just about any
               | API/language/platform, _all at once_.
               | 
               | I've written code for thirty years and I wish I had the
               | breadth and depth of knowledge of the free version of
               | ChatGPT, even if I can outsmart it in narrow domains. It
               | is already very decent and I haven't even tried more
               | advanced models like o1-preview.
               | 
               | Is it perfect? No. But it is arguably better than most
               | programmers in at least some aspects. Not every
               | programmer out there is Fabrice Bellard.
        
               | dartos wrote:
               | But LLMs aren't people. And people do more than just
               | generate code.
               | 
               | The comparison is weird and dehumanizing.
               | 
               | I, personally, have never worked with someone who
               | consistently puts out code that is as bad as LLM
               | generated code either.
               | 
               | > Most professional developers are unable to produce code
               | up to the standard of "the top answer in stack overflow"
               | 
               | How could you possibly know that?
               | 
               | All these types of arguments come from a belief that your
               | fellow human is effectively useless.
               | 
               | It's sad and weird.
        
               | david-gpu wrote:
               | _> > > Most professional developers are unable to produce
               | code up to the standard of "the top answer in stack
               | overflow"_
               | 
               |  _> How could you possibly know that?_
               | 
               | I worked at four multinationals and saw a bunch of their
               | code. Most of it wasn't _" the top answer in stack
               | overflow"_. Was some of the code written by some of the
               | people better than that? Sure. And a lot of it wasn't, in
               | my opinion.
               | 
               |  _> All these types of arguments come from a belief that
               | your fellow human is effectively useless._
               | 
               | Not at all. I think the top answers in stack overflow
               | were written by humans, after all.
               | 
               |  _> It's sad and weird._
               | 
               | You are entitled to your own opinion, no doubt about it.
        
               | Sharlin wrote:
               | > In fact try this - ask an LLM to generate some code
               | then reply with "isn't there a simpler, more
               | maintainable, and straightforward way to do this?"
               | 
               | These are called "code reviews" and we do that amongst
               | human coders too, although they tend to be less Socratic
               | in nature.
               | 
               | I think it has been clear from day one that LLMs don't
               | display superhuman capabilities, and a human expert will
               | always outdo one in tasks related to their particular
               | field. But the _breadth_ of their knowledge is
               | unparalleled. They 're the ultimate jacks-of-all-trades,
               | and the astonishing thing is that they're even "average
               | Joe" good at a vast number of tasks, never mind "fresh
               | college graduate" good.
               | 
               | The _real_ question has been: what happens when you scale
               | them up? As of now it appears that they scale decidedly
               | sublinearly, but it was not clear at all two or three
               | years ago, and it was definitely worth a try.
        
               | vidarh wrote:
               | I do contract work in the LLM space which involves me
               | seeing a lot of human prompts, and its made the magic of
               | _human_ reasoning fall away: Humans are shocking bad at
               | reasoning on the large.
               | 
               | One of the things I find extremely frustrating is that
               | almost no research on LLM reasoning ability _benchmarks
               | them against average humans_.
               | 
               | Large proportions of humans struggle to comprehend even a
               | moderately complex sentence with any level of precision.
        
           | xg15 wrote:
           | I don't want to say that LLMs can reason, but this kind of
           | argument always feels to shallow for me. It's kind of like
           | saying that bats cannot possibly fly because they have no
           | feathers or that birds cannot have higher cognitive functions
           | because they have no neocortex. (The latter having been an
           | actual longstanding belief in science which has been
           | disproven only a decade or so ago).
           | 
           | The "next token prediction" is just the API, it doesn't tell
           | you anything about the complexity of the thing that actually
           | does the prediction. (In think there is some temptation to
           | view LLMs as glorified Markov chains - they aren't. They are
           | just "implementing the same API" as Markov chains).
           | 
           | There is still a limit how much an LLM could reason during
           | prediction of a single token, as there is no recurrence
           | between layers, so information can only be passed "forward".
           | But this limit doesn't exist if you consider the generation
           | of the entire text: Suddenly, you do have a recurrence, which
           | is the prediction loop itself: The LLM can "store"
           | information in a generated token and receive that information
           | back as input in the next loop iteration.
           | 
           | I think this structure makes it quite hard to really say how
           | much reasoning is possible.
        
             | griomnib wrote:
             | I agree with most of what you said, but "LLM can reason" is
             | an _insanely huge claim_ to make and most of the "evidence"
             | so far is a mixture of corporate propaganda, "vibes", and
             | the like.
             | 
             | I've yet to see anything close to the level of evidence
             | needed to support the claim.
        
               | Propelloni wrote:
               | It's largely dependent on what we think "reason" means,
               | is it not? That's not a pro argument from me, in my world
               | LLMs are stochastic parrots.
        
               | vidarh wrote:
               | To say _any specific_ LLM can reason is a somewhat
               | significant claim.
               | 
               | To say _LLMs as a class_ is _architecturally able to be
               | trained to reason_ is - in the complete absence of
               | evidence to suggest humans can compute functions outside
               | the Turing computable - is effectively only an argument
               | that they can implement a minimal Turing machine given
               | the context is used as IO. Given the size of the rules
               | needed to implement the smallest known Turing machines,
               | it 'd take a _really_ tiny model for them to be unable
               | to.
               | 
               | Now, you can then argue that it doesn't "count" if it
               | needs to be fed a huge program step by step via IO, but
               | if it _can_ do something that way, I 'd need some really
               | convincing evidence for why the static elements those
               | steps could not progressively be embedded into a model.
        
               | wizzwizz4 wrote:
               | No such evidence exists: we can construct such a model
               | manually. I'd need some quite convincing evidence that
               | any given training process is approximately equivalent to
               | that, though.
        
               | vidarh wrote:
               | That's fine. I've made no claim about any given training
               | process. I've addressed the annoying repetitive dismissal
               | via the "but they're next token predictors" argument. The
               | point is that being next token predictors does not limit
               | their theoretical limits, so it's a meaningless argument.
        
               | hackinthebochs wrote:
               | Then say "no one has demonstrated that LLMs can reason"
               | instead of "LLMs can't reason, they're just token
               | predictors". At least that would be intellectually
               | honest.
        
               | Xelynega wrote:
               | By that logic isn't it "intellectually dishonest" to say
               | "dowsing rods don't work" if the only evidence we have is
               | examples of them not working?
        
               | hackinthebochs wrote:
               | Not really. We know enough about how the world to know
               | that dowsing rods have no plausible mechanism of action.
               | We do not know enough about intelligence/reasoning or how
               | brains work to know that LLMs definitely aren't doing
               | anything resembling that.
        
               | int_19h wrote:
               | "LLM can reason" is trivially provable - all you need to
               | do is give it a novel task (e.g. a logical puzzle) that
               | requires reasoning, and observe it solving that puzzle.
        
               | staticman2 wrote:
               | How do you intend to show your task is novel?
        
             | vidarh wrote:
             | > But this limit doesn't exist if you consider the
             | generation of the entire text: Suddenly, you do have a
             | recurrence, which is the prediction loop itself: The LLM
             | can "store" information in a generated token and receive
             | that information back as input in the next loop iteration.
             | 
             | Now consider that you can trivially show that you can get
             | an LLM to "execute" on step of a Turing machine where the
             | context is used as an IO channel, and will have shown it to
             | be Turing complete.
             | 
             | > I think this structure makes it quite hard to really say
             | how much reasoning is possible.
             | 
             | Given the above, I think any argument that they can't be
             | made to reason is effectively an argument that humans can
             | compute functions outside the Turing computable set, which
             | we haven't the slightest shred of evidence to suggest.
        
               | Xelynega wrote:
               | It's kind of ridiculous to say that functions computable
               | by turing computers are the only ones that can exist(and
               | that trained llms are Turing computers).
               | 
               | What evidence do you have for either of these, since I
               | don't recall any proof that "functions computable by
               | Turing machines" is equal to the set of functions that
               | can exist. And I don't recall pretrained llms being
               | proven to be Turing machines.
        
               | vidarh wrote:
               | We don't have hard evidence that no other functions exist
               | that are computable, but we have no examples of any such
               | functions, and no theory for how to even begin to
               | formulate any.
               | 
               | As it stands, Church, Turing, and Kleene have proven that
               | the set of generally recursive functions, the lambda
               | calculus, and the Turing computable set are equivalent,
               | and no attempt to categorize computable functions outside
               | those sets has succeeded since.
               | 
               | If you want your name in the history books, all you need
               | to do is find a _single_ function that humans can compute
               | that a is outside the Turing computable set.
               | 
               | As for LLMs, you can trivially test that they can act
               | like a Turing machine if you give them a loop and use the
               | context to provide access to IO: Turn the temperature
               | down, and formulate a prompt to ask one to follow the
               | rules of the simplest known Turing machine. A reminder
               | that the simplest known Turing machine is a 2-state,
               | 3-symbol Turing machine. It's _quite hard_ to find a
               | system that can carry out any kind of complex function
               | that can 't act like a Turing machine if you allow it to
               | loop and give it access to IO.
        
           | nuancebydefault wrote:
           | After reading the article I am more convinced it does
           | reasoning. The base model's reasoning capabilities are partly
           | hidden by the chatty derived model's logic.
        
           | Uehreka wrote:
           | Does anyone have a hard proof that language doesn't somehow
           | encode reasoning in a deeper way than we commonly think?
           | 
           | I constantly hear people saying "they're not intelligent,
           | they're just predicting the next token in a sequence", and
           | I'll grant that I don't think of what's going on in my head
           | as "predicting the next token in a sequence", but I've seen
           | enough surprising studies about the nature of free will and
           | such that I no longer put a lot of stock in what seems
           | "obvious" to me about how my brain works.
        
             | spiffytech wrote:
             | > I'll grant that I don't think of what's going on in my
             | head as "predicting the next token in a sequence"
             | 
             | I can't speak to whether LLMs can think, but current
             | evidence indicates humans can perform complex reasoning
             | without the use of language:
             | 
             | > Brain studies show that language is not essential for the
             | cognitive processes that underlie thought.
             | 
             | > For the question of how language relates to systems of
             | thought, the most informative cases are cases of really
             | severe impairments, so-called global aphasia, where
             | individuals basically lose completely their ability to
             | understand and produce language as a result of massive
             | damage to the left hemisphere of the brain. ...
             | 
             | > You can ask them to solve some math problems or to
             | perform a social reasoning test, and all of the
             | instructions, of course, have to be nonverbal because they
             | can't understand linguistic information anymore. ...
             | 
             | > There are now dozens of studies that we've done looking
             | at all sorts of nonlinguistic inputs and tasks, including
             | many thinking tasks. We find time and again that the
             | language regions are basically silent when people engage in
             | these thinking activities.
             | 
             | https://www.scientificamerican.com/article/you-dont-need-
             | wor...
        
               | SAI_Peregrinus wrote:
               | I'd say that's a separate problem. It's not "is the use
               | of language necessary for reasoning?" which seems to be
               | obviously answered "no", but rather "is the use of
               | language sufficient for reasoning?".
        
               | cortic wrote:
               | > ..individuals basically lose completely their ability
               | to understand and produce language as a result of massive
               | damage to the left hemisphere of the brain. ...
               | 
               | The right hemisphere almost certainly uses internal
               | 'language' either consciously or unconsciously to define
               | objects, actions, intent.. the fact that they passed
               | these tests is evidence of that. The brain damage is
               | simply stopping them expressing that 'language'. But the
               | existence of language was expressed in the completion of
               | the task..
        
           | Scarblac wrote:
           | This is the argument that submarines don't really "swim" as
           | commonly understood, isn't it?
        
             | Jensson wrote:
             | And planes doesn't fly like a bird, it has very different
             | properties and many things birds can do can't be done by a
             | plane. What they do is totally different.
        
             | saithound wrote:
             | I think so, but the badness of that argument is context-
             | dependent. How about the hypothetical context where 70k+
             | startups are promising investors that they'll win the 50
             | meter freestyle in 2028 by entering a fine-tuned USS Los
             | Angeles?
        
           | Sharlin wrote:
           | What proof do you have that human reasoning involves
           | "symbolic logic and abstractions"? In daily life, that is,
           | not in a math exam. We know that people are actually quite
           | bad at reasoning [1][2]. And it definitely doesn't seem right
           | to define "reasoning" as only the sort that involves formal
           | logic.
           | 
           | [1] https://en.wikipedia.org/wiki/List_of_fallacies
           | 
           | [2] https://en.wikipedia.org/wiki/List_of_cognitive_biases
        
             | trashtester wrote:
             | Some very intelligent people, including Godel and Penrose,
             | seem to think that humans have some kind of ability to
             | arrive directly on correct propositions in ways that bypass
             | the incompleteness theorem. Penrose seems to think this can
             | be due to Quantum Mechanics, Goder may have thought it came
             | frome something divine.
             | 
             | While I think they're both wrong, a lot of people seem to
             | think they can do abstract reasoning for symbols or symbol-
             | like structures without having to use formal logic for
             | every step.
             | 
             | Personally, I think such beliefs about concepts like
             | consciousness, free will, qualia and emotions emerge from
             | how the human brain includes a simplified version of itself
             | when setting up a world model. In fact, I think many such
             | elements are pretty much hard coded (by our genes) into the
             | machinery that human brains use to generate such world
             | models.
             | 
             | Indeed, if this is true, concepts like consciousness, free
             | will, various qualia and emotions can in fact be considered
             | "symbols" within this world model. While the full reality
             | of what happens in the brain when we exercise what we
             | represent by "free will" may be very complex, the world
             | model may assign a boolean to each action we (and others)
             | perform, where the action is either grouped into "voluntary
             | action" or "involuntary action".
             | 
             | This may not always be accurate, but it saves a lot of
             | memory and compute costs for the brain when it tries to
             | optimize for the future. This optimization can (and usually
             | is) called "reasoning", even if the symbols have only an
             | approximated correspondence with physical reality.
             | 
             | For instance, if in our world model somebody does something
             | against us and we deem that it was done exercising "free
             | will", we will be much more likely to punish them than if
             | we categorize the action as "forced".
             | 
             | And on top of these basic concepts within our world model,
             | we tend to add a lot more, also in symbol form, to enable
             | us to use symbolic reasoning to support our interactions
             | with the world.
        
               | TeMPOraL wrote:
               | > _While I think they 're both wrong, a lot of people
               | seem to think they can do abstract reasoning for symbols
               | or symbol-like structures without having to use formal
               | logic for every step._
               | 
               | Huh.
               | 
               | I don't know bout incompleteness theorem, but I'd say
               | it's pretty obvious (both in introspection and in
               | observation of others) that people _don 't_ naturally use
               | formal logic for _anything_ , they only painstakingly
               | _emulate_ it when forced to.
               | 
               | If anything, "next token prediction" seems much closer to
               | how human thinking works than anything even remotely
               | formal or symbolic that was proposed before.
               | 
               | As for hardcoding things in world models, one thing that
               | LLMs do conclusively prove is that you can create a
               | coherent system capable of encoding and working with
               | meaning of concepts without providing anything that looks
               | like explicit "meaning". Meaning is not inherent to a
               | term, or a concept expressed by that term - it exists in
               | the relationships between an the concept, and all other
               | concepts.
        
               | ben_w wrote:
               | > I don't know bout incompleteness theorem, but I'd say
               | it's pretty obvious (both in introspection and in
               | observation of others) that people don't naturally use
               | formal logic for anything, they only painstakingly
               | emulate it when forced to.
               | 
               | Indeed, this is one reason why I assert that Wittgenstein
               | was wrong about the nature of human thought when writing:
               | 
               | """If there were a verb meaning "to believe falsely," it
               | would not have any significant first person, present
               | indicative."""
               | 
               | Sure, it's logically incoherent for us to have such a
               | word, but there's what seems like several different ways
               | for us to hold contradictory and incoherent beliefs
               | within our minds.
        
               | trashtester wrote:
               | ... but I'd say it's pretty obvious (both in
               | introspection and in observation of others) that people
               | don't naturally use formal logic for anything ...
               | 
               | Yes. But some place too much confidence in how "rational"
               | their intuition is, including some of the most
               | intelligent minds the world has seen.
               | 
               | Specifically, many operate as if their intuition (that
               | they treat as completely rational) has some kind of
               | supernatural/magic/divine origin, including many who
               | (imo) SHOULD know better.
               | 
               | While I think (like you do) that this intuition has a lot
               | in common with LLM's and other NN architectures than pure
               | logic, or even the scientific method.
        
               | raincole wrote:
               | > Some very intelligent people, including Godel and
               | Penrose, seem to think that humans have some kind of
               | ability to arrive directly on correct propositions in
               | ways that bypass the incompleteness theorem. Penrose
               | seems to think this can be due to Quantum Mechanics,
               | Goder may have thought it came frome something divine.
               | 
               | Did Godel really say this? It sounds like quite a stretch
               | of incompleteness theorem.
               | 
               | It's like saying because halting problem is undecidable,
               | but humans can debug programs, therefore human brains
               | must having some supernatural power.
        
           | olalonde wrote:
           | This argument reminds me the classic "intelligent design"
           | critique of evolution: "Evolution can't possibly create an
           | eye; it only works by selecting random mutations."
           | Personally, I don't see why a "next token predictor" couldn't
           | develop the capability to reason and form abstractions.
        
         | NitpickLawyer wrote:
         | Interesting tidbit I once learned from a chess livestream. Even
         | human super-GMs have a really hard time "scoring" or "solving"
         | extremely weird positions. That is, positions that shouldn't
         | come from logical opening - mid game - end game regular play.
         | 
         | It's absolutely amazing to see a super-GM (in that case it was
         | Hikaru) see a position, and basically "play-by-play" it from
         | the beginning, to show people how they got in that position. It
         | wasn't his game btw. But later in that same video when asked he
         | explained what I wrote in the first paragraph. It works with
         | proper games, but it rarely works with weird random chess
         | puzzles, as he put it. Or, in other words, chess puzzles that
         | come from real games are much better than "randomly generated",
         | and make more sense even to the best of humans.
        
           | saghm wrote:
           | Super interesting (although it also makes some sense that
           | experts would focus on "likely" subsets given how the number
           | of permutations of chess games is too high for it to be
           | feasible to learn them all)! That said, I still imagine that
           | even most intermediate chess players would perfectly make
           | only _legal_ moves in weird positions, even if they're low
           | quality.
        
           | MarcelOlsz wrote:
           | Would love a link to that video!
        
           | lukan wrote:
           | "Even human super-GMs have a really hard time "scoring" or
           | "solving" extremely weird positions. "
           | 
           | I can sort of confirm that. I never learned all the formal
           | theoretical standard chess strategies except for the basic
           | ones. So when playing against really good players, way above
           | my level, I could win sometimes (or allmost) simply by making
           | unconventional (dumb by normal strategy) moves in the
           | beginning - resulting in a non standard game where I could
           | apply pressure in a way the opponent was not prepared for
           | (also they underestimated me after the initial dumb moves).
           | For me, the unconventional game was just like a standard
           | game, I had no routine - but for the experienced one, it was
           | way more challenging. But then of course in the standard
           | situations, to which allmost every chess game evolves to -
           | they destroyed me, simply for experience and routine.
        
             | hhhAndrew wrote:
             | The book Chess for Tigers by Simon Webb explicitly advises
             | this. Against "heffalumps" who will squash you, make the
             | situation very complicated and strange. Against "rabbits",
             | keep the game simple.
        
             | Reimersholme wrote:
             | In The Art of Learning, Joshua Waitzkin talks about how
             | this was a strategy for him in tournaments as a child as
             | well. While most other players were focusing on opening
             | theory, he focused on end game and understanding how to use
             | the different pieces. Then, by going with unorthodox
             | openings, he could easily bring most players outside of
             | their comfort zone where they started making mistakes.
        
             | aw1621107 wrote:
             | > So when playing against really good players, way above my
             | level, I could win sometimes (or allmost) simply by making
             | unconventional (dumb by normal strategy) moves in the
             | beginning - resulting in a non standard game where I could
             | apply pressure in a way the opponent was not prepared for
             | (also they underestimated me after the initial dumb moves).
             | 
             | IIRC Magnus Carlsen is said to do something like this as
             | well - he'll play opening lines that are known to be
             | theoretically suboptimal to take his opponent out of prep,
             | after which he can rely on his own prep/skills to give him
             | better winning chances.
        
             | dmoy wrote:
             | Huh it's funny, in fencing that also works to a certain
             | degree.
             | 
             | You can score points against e.g. national team members
             | who've been 5-0'ing the rest of the pool by doing weird
             | cheap tricks. You won't win though, because after one or
             | two points they will adjust and then wreck you.
             | 
             | And on the flip side, if you're decently rated (B ~ A ish)
             | and are used to just standard fencing, if you run into
             | someone who's U ~ E and does something weird like literally
             | not move their feet, it can take you a couple touches to
             | readjust to someone who doesn't behave normally.
             | 
             | Unlike chess though, in fencing the unconventional stuff
             | only works for a couple points. You can't stretch that into
             | a victory, because after each point everything resets.
             | 
             | Maybe that's why pentathlon (single touch victory) fencing
             | is so weird.
        
               | Trixter wrote:
               | Watching my son compete at a fighting game tournament at
               | a professional level, can confirm this also exists in
               | that realm. And problem other realms; I think it's more
               | of a general concept of unsettling the better opponent so
               | that you can have a short-term advantage at the
               | beginning.
        
           | Someone wrote:
           | That Expert players are better at recreate real games than
           | 'fake' positions is one of the things Adriaan de Groot
           | (https://en.wikipedia.org/wiki/Adriaan_de_Groot) noticed in
           | his studies with expert chess players. ("Thought and choice
           | in chess" is worth reading if you're interested in how chess
           | players think. He anonymized his subjects, but Euwe
           | apparently was on of them)
           | 
           | Another thing he noticed is that, when asked to set up a game
           | they were shown earlier, the errors expert players made often
           | were insignificant. For example, they would set up the pawn
           | structure on the king side incorrectly if the game's action
           | was on the other side of the board, move a bishop by a square
           | in such a way didn't make a difference for the game, or even
           | add an piece that wasn't active on the board.
           | 
           | Beginners would make different errors, some of them hugely
           | affecting the position on the board.
        
           | samatman wrote:
           | As someone who finds chess problems interesting (I'm bad at
           | them), they're really a third sort of thing. In that good
           | chess problems are rarely taken from live play, they're a
           | specific sort of thing which follows its own logic.
           | 
           | Good ones are never randomly generated, however. Also, the
           | skill doesn't fully transfer in either direction between live
           | play and solving chess problems. Definitely not
           | reconstructing the prior state of the board, since there's
           | nothing there to reconstruct.
           | 
           | So yes, everything Hikaru was saying there makes sense to me,
           | but I don't think your last sentence follows from it. Good
           | chess problems come from good chess problem authors
           | (interestingly this included Vladimir Nabokov), they aren't
           | random, but they rarely come from games, and tickle a
           | different part of the brain from live play.
        
           | hyperpape wrote:
           | This is technically true, but the kind of comment that
           | muddies the waters. It's true that GM performance is better
           | in realistic games.
           | 
           | It is false that GMs would have any trouble determining legal
           | moves in randomly generated positions. Indeed, even a 1200
           | level player on chess.com will find that pretty trivial.
        
         | fragmede wrote:
         | How well does it play modified versions of chess? eg, a
         | modified opening board like the back row is all knights, or
         | modified movement eg rooks can move like a queen. A human
         | should be able to reason their way through playing a modified
         | game, but I'd expect an LLM, if it's just parroting its
         | training data, to suggest illegal moves, or stick to previously
         | legal moves.
        
         | snowwrestler wrote:
         | It's kind of crazy to assert that the systems understand chess,
         | and then disclose further down the article that sometimes he
         | failed to get a legal move after 10 tries and had to sub in a
         | random move.
         | 
         | A person who understands chess well (Elo 1800, let's say) will
         | essentially never fail to provide a legal move on the first
         | try.
        
           | og_kalu wrote:
           | He is testing several models, some of which cannot reliably
           | output legal moves. That's different from saying all models
           | including the one he thinks understands can't generate a
           | legal move in 10 tries.
           | 
           | 3.5-turbo-instruct's illegal move rate is about 5 or less in
           | 8205
        
             | IanCal wrote:
             | I also wonder what kind of invalid moves they are. There's
             | "you can't move your knight to j9 that's off the board",
             | "there's already a piece there" and "actually that would
             | leave you in check".
             | 
             | I think it's also significantly harder to play chess if you
             | were to hear a sequence of moves over the phone and had to
             | reply with a followup move, with no space or time to think
             | or talk through moves.
        
           | navane wrote:
           | Pretty sure elo 1200 will only give legal moves. It's really
           | not hard to make legal moves in chess.
        
             | thaumasiotes wrote:
             | Casual players make illegal moves all the time. The problem
             | isn't knowing how the pieces move. It's that it's illegal
             | to leave your own king in check. It's not so common to
             | accidentally move your king into check, though I'm sure it
             | happens, but it's very common to accidentally move a piece
             | that was blocking an attack on your king.
             | 
             | I would tend to agree that there's a big difference between
             | attempting to make a move that's illegal because of the
             | state of a different region of the board, and attempting to
             | make one that's illegal because of the identity of the
             | piece being moved, but if your only category of interest is
             | "illegal moves", you can't see that difference.
             | 
             | Software that knows the rules of the game shouldn't be
             | making either mistake.
        
               | philipwhiuk wrote:
               | Casual players don't make illegal moves so often that you
               | have to assign them a random move after 10 goes.
        
           | Certhas wrote:
           | What do you mean by "understand chess"?
           | 
           | I think you don't appreciate how good the level of chess
           | displayed here is. It would take an average adult years of
           | dedicated practice to get to 1800.
           | 
           | The article doesn't say how often the LLM fails to generate
           | legal moves in ten tries, but it can't be often or the level
           | of play would be much much much worse.
           | 
           | As seems often the case, the LLM seems to have a brilliant
           | intuition, but no precise rigid "world model".
           | 
           | Of course words like intuition are anthropomorphic. At best a
           | model for what LLMs are doing. But saying "they don't
           | understand" when they can do _this well_ is absurd.
        
             | vundercind wrote:
             | > I think you don't appreciate how good the level of chess
             | displayed here is. It would take an average adult years of
             | dedicated practice to get to 1800.
             | 
             | Since we already have programs that can do this, that
             | definitely aren't really thinking and don't "understand"
             | anything at all, I don't see the relevance of this part.
        
             | photonthug wrote:
             | > But saying "they don't understand" when they can do _this
             | well_ is absurd.
             | 
             | When we talk about understanding a simple axiomatic system,
             | understanding means exactly that the entirety of the axioms
             | are modeled and applied correctly 100% of the time. This is
             | chess, not something squishy like literary criticism.
             | There's no need to debate semantics at all. One illegal
             | move is a deal breaker
             | 
             | Undergraduate CS homework for playing any game with any
             | technique would probably have the stipulation that any
             | illegal move disqualifies the submission completely.
             | Whining that it works most of the time would just earn
             | extra pity/contempt as well as an F on the project.
             | 
             | We can argue whether an error rate of 1 in a million means
             | that it plays like a grandmaster or a novice, but that's
             | less interesting. It failed to model a simple system
             | correctly, and a much shorter/simpler program could do
             | that. Doesn't seem smart if our response to this as an
             | industry is to debate semantics, ignore the issue, and work
             | feverishly to put it to work modeling more complicated /
             | critical systems.
        
               | Certhas wrote:
               | You just made up a definition of "understand". According
               | to that definition, you are of course right. I just don't
               | think it's a reasonable definition. It's also
               | contradicted by the person I was replying to in the
               | sibling comment, where they argue that Stockfish doesn't
               | understand chess, despite Stockfish of course having the
               | "axioms" modeled and applied correctly 100% of the time.
               | 
               | Here are things people say:
               | 
               | Magnus Carlsen has a better understanding of chess than I
               | do. (Yet we both know the precise rules of the game.)
               | 
               | Grandmasters have a very deep understanding of Chess,
               | despite occasionally making illegal moves that are not
               | according to the rules
               | (https://www.youtube.com/watch?v=m5WVJu154F0).
               | 
               | "If AlphaZero were a conventional engine its developers
               | would be looking at the openings which it lost to
               | Stockfish, because those indicate that there's something
               | Stockfish understands better than AlphaZero."
               | (https://chess.stackexchange.com/questions/23206/the-
               | games-al...)
               | 
               | > Undergraduate CS homework for playing any game with any
               | technique would probably have the stipulation that any
               | illegal move disqualifies the submission completely.
               | Whining that it works most of the time would just earn
               | extra pity/contempt as well as an F on the project.
               | 
               | How exactly is this relevant to the question whether LLMs
               | can be said to have some understanding of chess? Can they
               | consistently apply the rules when game states are given
               | in pgn? No. _Very_ few humans without specialized
               | training could either (without using a board as a tool to
               | keep track of the implicit state). They certainly "know"
               | the rules (even if they can't apply them) in the sense
               | that they will state them correctly if you ask them to.
               | 
               | I am not particularly interested in "the industry". It's
               | obvious that if you want a system to play chess, you use
               | a chess engine, not an LLM. But I am interested in what
               | their chess abilities teaches us about how LLMs build
               | world models. E.g.:
               | 
               | https://aclanthology.org/2024.acl-srw.48/
        
               | photonthug wrote:
               | Thanks for your thoughtful comment and refs to chase
               | down.
               | 
               | > You just made up a definition of "understand".
               | According to that definition, you are of course right. I
               | just don't think it's a reasonable definition. ... Here
               | are things people say:
               | 
               | Fine. As others have pointed out and I hinted at..
               | debating terminology is kind of a dead end. I personally
               | don't expect that "understanding chess" is the same as
               | "understanding Picasso", or that those phrases would mean
               | the same thing if they were applied to people vs for AI.
               | Also.. I'm also not personally that interested in how
               | performance stacks up compared to humans. Even if it were
               | interesting, the topic of human-equivalent performance
               | would not have static expectations either. For example
               | human-equivalent error rates in AI are much easier for me
               | to expect and forgive in robotics than they are in
               | axiomatic game-play.
               | 
               | > I am interested in what their chess abilities teaches
               | us about how LLMs build world models
               | 
               | Focusing on the single datapoint that TFA is
               | establishing: some LLMs can play some chess with some
               | amount of expertise, with some amount of errors. With no
               | other information at all, this tells us that it failed to
               | model the rules, or it failed in the application of those
               | rules, or both.
               | 
               | Based on that, some questions worth asking: Which one of
               | these failure modes is really acceptable and in which
               | circumstances? Does this failure mode apply to domains
               | other than chess? Does it help if we give it the model
               | directly, say by explaining the rules directly in the
               | prompt and also explicitly stating to not make illegal
               | moves? If it's failing to apply rules, but excels as a
               | model-maker.. then perhaps it can spit out a model
               | directly from examples, and then I can feed the model
               | into a separate engine that makes correct, deterministic
               | steps that actually honor the model?
               | 
               | Saying that LLMs do or don't understand chess is lazy I
               | guess. My basic point is that the questions above and
               | their implications are so huge and sobering that I'm very
               | uncomfortable with premature congratulations and optimism
               | that seems to be in vogue. Chess performance is
               | ultimately irrelevant of course, as you say, but what
               | sits under the concrete question is more abstract but
               | very serious. Obviously it is dangerous to create
               | tools/processes that work "most of the time", especially
               | when we're inevitably going to be giving them tasks where
               | we can't check or confirm "legal moves".
        
           | stuaxo wrote:
           | I hate the use of words like "understand" in these
           | conversations.
           | 
           | The system understands nothing, it's anthropomorphising it to
           | say it does.
        
             | Sharlin wrote:
             | Trying to appropriate perfectly well generalizable terms as
             | "something that only humans do" brings zero value to a
             | conversation. It's a "god in the gaps" argument,
             | essentially, and we don't exactly have a great track record
             | of correctly identifying things that are uniquely human.
        
               | fao_ wrote:
               | There's very literally currently a whole wealth of papers
               | proving that LLMs do not understand, cannot reason, and
               | cannot perform basic kinds of reasoning that even a dog
               | can perform. But, ok.
        
               | TeMPOraL wrote:
               | There's very literally currently a whole wealth of papers
               | proving the opposite, too, so -\\_(tsu)_/-.
        
               | wizzwizz4 wrote:
               | There's a whole wealth of papers proving that LLMs do not
               | understand _the concepts they write about_. That doesn 't
               | mean they don't understand _grammar_ - which (as I 've
               | claimed since the GPT-2 days) we _should_ ,
               | theoretically, expect them to "understand". And what is
               | chess, but a particularly sophisticated grammar?
        
             | trashtester wrote:
             | I have the same conclusion, but for the opposite reason.
             | 
             | It seems like many people tend to use the word "understand"
             | to that not only does someone believe that a given move is
             | good, they also belive that this knowledge comes from a
             | rational evaluation.
             | 
             | Some attribute this to a non-material soul/mind, some to
             | quantum mechanics or something else that seems magic, while
             | others never realized the problem with such a belief in the
             | first place.
             | 
             | I would claim that when someone can instantly recognize
             | good moves in a given situation, it doesn't come from
             | rationality at all, but from some mix of memory and an
             | intuition that has been build by playing the game many
             | times, with only tiny elements of actual rational thought
             | sprinkled in.
             | 
             | This even holds true when these people start to calculate.
             | It is primarily their intuition that prevens them from
             | spending time on all sorts of unlikely moves.
             | 
             | And this intuition, I think, represents most of their real
             | "understanding" of the game. This is quite different from
             | understanding something like a mathematical proof, which is
             | almost exclusively inducive logic.
             | 
             | And since "understand" so often is associated with rational
             | inductive logic, I think the proper term would be to have
             | "good intuition" when playing the game.
             | 
             | And this "good intuition" seems to me precisely the kind of
             | thing that is trained within most neural nets, even LLM's.
             | (Q*, AlphaZero, etc also add the ability to "calculate",
             | meaning traverse the search space efficiently).
             | 
             | If we wanted to measure how good this intuition is compared
             | to human chess intuition, we could limit an engine like
             | AlphaZero to only evaluate the same number of moves per
             | second that good humans would be able to, which might be
             | around 10 or so.
             | 
             | Maybe with this limitation, the engine wouldn't currently
             | be able to beat the best humans, but even if it reaches a
             | rating of 2000-2500 this way, I would say it has a pretty
             | good intuitive understanding.
        
             | int_19h wrote:
             | The whole point of this exercise is to understand what
             | "understand" even means. Because we really don't have a
             | good definition for this, and until we do, statements like
             | "the system understands nothing" are vacuous.
        
         | cma wrote:
         | Its training set would include a lot of randomly generated
         | positions like that that then get played out by chess engines
         | wouldn't it? Just from people messing around andbposting
         | results. Not identical ones, but similarly oddball.
        
         | thaumasiotes wrote:
         | > Here's one way to test whether it really understands chess.
         | Make it play the next move in 1000 random legal positions
         | 
         | Suppose it tries to capture en passant. How do you know whether
         | that's legal?
        
           | BalinKing wrote:
           | I feel like you could add "do not capture en passant unless
           | it is the only possible move" to the test without changing
           | what it's trying to prove--if anything, some small
           | permutation like this might even make it a stronger test of
           | "reasoning capability." (Personally I'm unconvinced of the
           | utility of this test in the first place, but I think it can
           | be reasonably steelmanned.)
        
         | namaria wrote:
         | Assigning "understanding" to an undefined entity is an
         | undefined statement.
         | 
         | It isn't even wrong.
        
         | _heimdall wrote:
         | Would that be enough to prove it? If the LLM was trained only
         | on a set of legal moves, isn't it possible that it functionally
         | learned how each piece is allowed to move without learning how
         | to actually reason about it?
         | 
         | Said differently in case I phrased that poorly - couldn't the
         | LLM still learn the it only ever saw bishops move diagonally
         | and therefore only considering those moves without actually
         | reasoning through the concept of legal and illegal moves?
        
         | zbyforgotp wrote:
         | The problem is that the llm don't learn to play moves from a
         | position, the internet archives contain only game records. They
         | might be building something to represent position
         | internationally but it will not be automatically activated with
         | an encoded chess position.
        
           | tromp wrote:
           | The ChessPositionRanking project, with help from the Texel
           | chess engine author, tries to prove random positions (that
           | are not obviously illegal) legal by constructing a game
           | ending in the position. If that fails it tries to prove the
           | position illegal. This now works for over 99.99% of randomly
           | generated positions, so one can feed the legal game record
           | found for random legal positions.
        
       | viraptor wrote:
       | I'm glad he improved the promoting, but he's still leaving out
       | two likely huge improvements.
       | 
       | 1. Explain the current board position and the plan going
       | forwards, _before_ proposing a move. This lets the model actually
       | think more, kind of like o1, but here it would guarantee a more
       | focused processing.
       | 
       | 2. Actually draw the ascii board for each step. Hopefully
       | producing more valid moves since board + move is easier to
       | reliably process than 20xmove.
        
         | unoti wrote:
         | I came here to basically say the same thing. The improvements
         | the OP saw by asking it to repeat all the moves so far gives
         | the LLM more time and space to think. I have this hypothesis
         | giving it more time and space to think in other ways could
         | improve performance even more, something like showing the
         | current board position and asking it to perform an analysis of
         | the position, list key challenges and strengths, asking it for
         | a list of strategies possible from here, then asking it to
         | select a strategy amongst the listed strategies, then asking it
         | for its move. In general, asking it to really think rather than
         | blurt out a move. The examples would be key here.
         | 
         | These ideas were proven to work very well in the ReAct paper
         | (and by extension, the CoT Chain of Thought paper). Could also
         | extend this by asking it to do this N times and stop when we
         | get the same answer a majority of times (this is an idea stolen
         | from the CoT-SC paper, chain of through self-consistency).
        
           | viraptor wrote:
           | It would be awesome if the author released a framework to
           | play with this. I'd like to test things out, but I don't want
           | to spend time redoing all his work from scratch.
        
             | fragmede wrote:
             | Just have ChatGPT write the framework
        
         | duskwuff wrote:
         | > 2. Actually draw the ascii board for each step.
         | 
         | I doubt that this is going to make much difference. 2D
         | "graphics" like ASCII art are foreign to language models - the
         | models perceive text as a stream of tokens (including
         | newlines), so "vertical" relationships between lines of text
         | aren't obvious to them like they would be to a human viewer.
         | Having that board diagram in the context window isn't likely to
         | help the model reason about the game.
         | 
         | Having the model list out the positions of each piece on the
         | board in plain text (e.g. "Black knight at c5") might be a more
         | suitable way to reinforce the model's positional awareness.
        
           | yccs27 wrote:
           | With positional encoding, an ascii board diagram actually
           | shouldn't be that hard to read for an LLM. Columns and
           | diagonals are just different strides through the flattened
           | board representation.
        
           | magicalhippo wrote:
           | I've had _some_ success getting models to recognize simple
           | electronic circuits drawn using ASCII art, including stuff
           | like identifying a buck converter circuit in various guises.
           | 
           | However, as you point out, the way we feed these models
           | especially make them vertically challenged, so to speak. This
           | makes them unable to reliably identify vertically separated
           | components in a circuit for example.
           | 
           | With combined vision+text models becoming more common place,
           | perhaps running the rendered text input through the vision
           | model might help.
        
         | daveguy wrote:
         | > Actually draw the ascii board for each step.
         | 
         | The relative rarity of this representation in training data
         | means it would probably degrade responses rather than improve
         | them. I'd like to see the results of this, because I would be
         | very surprised if it improved the responses.
        
         | ilaksh wrote:
         | The fact that he hasn't tried this leads me to think that deep
         | down he doesn't want the models to succeed and really just
         | wants to make more charts.
        
         | TeMPOraL wrote:
         | RE 2., I doubt it'll help - for at least two reasons, already
         | mentioned by 'duskwuff and 'daveguy.
         | 
         | RE 1., definitely worth trying, and there's more variants of
         | such tricks specific to models. I'm out of date on OpenAI docs,
         | but with Anthropic models, the docs suggest _using XML
         | notation_ to label and categorize most important parts of the
         | input. This kind of soft structure seems to improve the results
         | coming from Claude models; I imagine they specifically trained
         | the model to recognize it.
         | 
         | See: https://docs.anthropic.com/en/docs/build-with-
         | claude/prompt-...
         | 
         | In author's case, for Anthropic models, the final prompt could
         | look like this:                 <role>You are a chess
         | grandmaster.</role>       <instructions>       You will be
         | given a partially completed game, contained in <game-log> tags.
         | After seeing it, you should repeat the ENTIRE GAME and then
         | give ONE new move       Use standard algebraic notation, e.g.
         | "e4" or "Rdf8" or "R1a3".       ALWAYS repeat the entire
         | representation of the game so far, putting it in <new-game-log>
         | tags.       Before giving the new game log, explain your
         | reasoning inside <thinking> tag block.       </instructions>
         | <example>         <request>           <game-log>
         | *** example game ***           </game-log>         </request>
         | <reply>           <thinking> *** some example explanation
         | ***</thinking>           <new-game-log> *** game log + next
         | move *** </new-game-log>         </reply>
         | </example>              <game-log>        *** the incomplete
         | game goes here ***       </game-log>
         | 
         | This kind of prompting is supposed to provide noticeable
         | improvement for Anthropic models. Ironically, I only discovered
         | it few weeks ago, despite having been using Claude 3.5 Sonnet
         | extensively for months. Which goes to say, _RTFM is still a
         | useful skill_. Maybe OpenAI models have similar affordances
         | too, simple but somehow unnoticed? (I 'll re-check the docs
         | myself later.)
        
         | tedsanders wrote:
         | Chain of thought helps with many problems, but it actually
         | tanks GPT's chess performance. The regurgitation trick was the
         | best (non-fine tuning) technique in my own chess experiments
         | 1.5 years ago.
        
       | seizethecheese wrote:
       | All the hand wringing about openAI cheating suggests a question:
       | why so much mistrust?
       | 
       | My guess would be that the persona of the openAI team on
       | platforms like Twitter is very cliquey. This, I think, naturally
       | leads to mistrust. A clique feels more likely to cheat than some
       | other sort of group.
        
         | simonw wrote:
         | I wrote about this last year. The levels of trust people have
         | in companies working in AI is notably low:
         | https://simonwillison.net/2023/Dec/14/ai-trust-crisis/
        
         | nuancebydefault wrote:
         | My take on this is that people tend to be afraid of what they
         | can't understand or explain. To do away with that feeling, they
         | just say 'it can't reason'. While nobody on earth can put a
         | finger on what reasoning is, other than that it is a human
         | trait.
        
       | gallerdude wrote:
       | Very interesting - have you tried using `o1` yet? I made a
       | program which makes LLM's complete WORDLE puzzles, and the
       | difference between `4o` and `o1` is absolutely astonishing.
        
         | gallerdude wrote:
         | 4o-mini: 16% 4o: 50% o1-mini: 97% o1: 100%
         | 
         | * disclaimer - only n=7 on o1. Others are like 100-300 each
        
         | simonw wrote:
         | OK, that was fun. I just tried o1-preview on today's Wordle and
         | it got it on the third guess:
         | https://chatgpt.com/share/673f9169-3654-8006-8c0b-07c53a2c58...
        
           | gallerdude wrote:
           | With some transcribing (using another LLM instance) I've even
           | gotten it to solve NYT mini crosswords.
        
       | ChrisArchitect wrote:
       | Related from last week:
       | 
       |  _Something weird is happening with LLMs and Chess_
       | 
       | https://news.ycombinator.com/item?id=42138276
        
       | kibwen wrote:
       | _> I was astonished that half the internet is convinced that
       | OpenAI is cheating._
       | 
       | If you have a problem and all of your potential solutions are
       | unlikely, then it's fine to assume the least unlikely solution
       | while acknowledging that it's statistically probable that you're
       | also wrong. IOW if you have ten potential solutions to a problem
       | and you estimate that the most likely solution has an 11% chance
       | of being true, it's fine to assume that solution despite the fact
       | that, by your own estimate, you have an 89% chance of being
       | wrong.
       | 
       | The "OpenAI is secretly calling out to a chess engine" hypothesis
       | always seemed unlikely to me (you'd think it would play much
       | better, if so), but it seemed the easiest solution (Occam's
       | razor) and I wouldn't have been _surprised_ to learn it was true
       | (it 's not like OpenAI has a reputation of being trustworthy).
        
         | bongodongobob wrote:
         | That's not really how Occam's razor works. The entire company
         | colluding and lying to the public isn't "easy". Easy is more
         | along the lines of "for some reason it is good at chess but
         | we're not sure why".
        
           | simonw wrote:
           | One of the reasons I thought that was unlikely was personal
           | pride. OpenAI researchers are proud of the work that they do.
           | Cheating by calling out to a chess engine is something they
           | would be ashamed of.
        
             | kibwen wrote:
             | _> OpenAI researchers are proud of the work that they do._
             | 
             | Well, the failed revolution from last year combined with
             | the non-profit bait-and-switch pretty much conclusively
             | proved that OpenAI researchers are in it for the money
             | first and foremost, and pride has a dollar value.
        
               | fkyoureadthedoc wrote:
               | How much say do individual researchers even have in this
               | move?
               | 
               | And how does that prove anything about their motivations
               | "first and foremost"? They could be in it because they
               | like the work itself, and secondary concerns like open or
               | not don't matter to them. There's basically infinite
               | interpretations of their motivations.
        
           | dogleash wrote:
           | > The entire company colluding and lying to the public isn't
           | "easy".
           | 
           | Why not? Stop calling it "the entire company colluding and
           | lying" and start calling it a "messaging strategy among the
           | people not prevented from speaking by NDA." That will pass a
           | casual Occam's test that "lying" failed. But they both mean
           | the same exact thing.
        
             | TeMPOraL wrote:
             | It won't, for the same reason - whenever you're proposing a
             | conspiracy theory, you have to explain what stops every
             | person involved from leaking the conspiracy, whether on
             | purpose or by accident. This gets superlinearly harder with
             | number of people involved, and extra hard when there are
             | incentives rewarding leaks (and leaking OpenAI secrets has
             | some strong potential rewards).
             | 
             | Occam's test applies to the full proposal, _including_ the
             | explanation of things outlined above.
        
         | og_kalu wrote:
         | >but it seemed the easiest solution (Occam's razor)
         | 
         | In my opinion, it only seems like the easiest solution on the
         | surface taking basically nothing into account. By the time you
         | start looking at everything in context, it just seems bizarre.
        
           | kibwen wrote:
           | To reiterate, your assessment is true and we can assign it a
           | low probability, but in the context of trying to explain why
           | one model would be an outrageous outlier, manual intervention
           | was the simplest solution out of all the other hypotheses,
           | despite being admittedly bizarre. The thrust of the prior
           | comment is precisely to caution against conflating relative
           | and absolute likelihoods.
        
         | slibhb wrote:
         | I don't think it has anything to do with your logic here.
         | Actually, people just like talking shit about OpenAI on HN. It
         | gets you upvotes.
        
           | Legend2440 wrote:
           | LLM cynicism exceeds LLM hype at this point.
        
         | influx wrote:
         | I wouldn't call delegating specialized problems to specialized
         | engines cheating. While it should be documented, in a full AI
         | system, I want the best answer regardless of the technology
         | used.
        
       | tmalsburg2 wrote:
       | Why not use temperature 0 for sampling? If the top-ranked move is
       | not legal, it can't play chess.
        
         | thornewolf wrote:
         | sometimes skilled chess players make illegal moves
        
           | atiedebee wrote:
           | Extremely rare. The only time this happened that I'm aware of
           | was quite recent but the players only had a second or 2
           | remaining on the clock, so time pressure is definitely the
           | reason there
        
             | GaggiX wrote:
             | It often happens when the players play blondfold chess, as
             | in this case.
        
               | a2128 wrote:
               | Is this really equivalent to blindfold chess? The LLM has
               | access to the full move history, unlike blindfold chess
               | where memorization is necessary
        
       | atemerev wrote:
       | Ah, half of the commentariat still think that "LLMs can't
       | reason". Even if they have enough state space for reasoning, and
       | clearly demonstrate that.
        
         | brookst wrote:
         | But it's not real reasoning because it is just outputting
         | likely next tokens that are identical to what we'd expect with
         | reasoning. /s
        
         | lottin wrote:
         | "The question of whether a computer can think is no more
         | interesting than the question of whether a submarine can swim."
         | - Edsger Dijkstra
        
         | sourcepluck wrote:
         | Most people, as far as I'm aware, don't have an issue with the
         | idea that LLMs are producing behaviour which gives the
         | appearance of reasoning as far as we understand it today. Which
         | essentially means, it makes sentences that are gramatical,
         | responsive and contextual based on what you said (quite often).
         | It's at least pretty cool that we've got machines to do that,
         | most people seem to think.
         | 
         | The issue is that there might be more to _reason_ than
         | _appearing to reason_. We just don 't know. I'm not sure how
         | it's apparently so unknown or unappreciated by people in the
         | computer world, but there are major unresolved questions in
         | science and philosophy around things like thinking, reasoning,
         | language, consciousness, and the mind. No amount of techno-
         | optimism can change this fact.
         | 
         | The issue is we have not gotten further than more or less
         | educated guesses as to what those words mean. LLMs bring that
         | interesting fact to light, even providing humanity with a
         | wonderful nudge to keep grappling with these unsolved
         | questions, and perhaps make some progress.
         | 
         | To be clear, they certainly are sometimes passably good when it
         | comes to summarising selectively and responsively the terabytes
         | and terabytes of data they've been trained on, don't get me
         | wrong, and I am enjoying that new thing in the world. And if
         | you want to define _reason_ like that, feel free.
        
           | atemerev wrote:
           | LLMs can _play chess_. With the game positions previously
           | unseen. How's that not actual logical reasoning?
        
             | sourcepluck wrote:
             | I guess you don't follow TCEC, or computer chess
             | generally[0]. Chess engines have been _playing chess_ at
             | superhuman levels using neural networks for years now, it
             | was a revolution in the space. AlphaZero, Lc0, Stockfish
             | NNUE. I don't recall yards of commentary arguing that they
             | were _reasoning_.
             | 
             | Look, you can put as many underscores as you like, the
             | question of whether these machines are _really reasoning_
             | or _emulating reason_ is not a solved problem. We don 't
             | know what reasoning is! We don't know if _we_ are really
             | reasoning, because we have major unresolved questions
             | regarding the mind and consciousness[1].
             | 
             | These may not be intractable problems either, there's
             | reason for hope. In particular, studying brains with more
             | precision is obviously exciting there. More computational
             | experiments, including the recent explosion in LLM
             | research, is also great.
             | 
             | Still, reflexively believing in the computational theory of
             | the mind[2] without engaging in the actual difficulty of
             | those questions, though commonplace, is not reasonable.
             | 
             | [0] Jozarov on YT has great commentary of top engine games,
             | worth checking out.
             | 
             | [1] https://plato.stanford.edu/entries/consciousness/
             | 
             | [2] https://plato.stanford.edu/entries/computational-mind/
        
               | atemerev wrote:
               | I am not implying that LLMs are conscious or something.
               | Just that they can reason, i.e. draw logical conclusions
               | from observations (or, in their case, textual inputs),
               | and make generalizations. This is a much weaker
               | requirement.
               | 
               | Chess engines can reason about chess (they can even
               | explain their reasoning). LLMs can reason about many
               | other things, with varied efficiency.
               | 
               | What everyone is currently trying to build is something
               | like AlphaZero (adversarial self-improvement for
               | superhuman performance) with the state space of LLMs
               | (general enough to be useful for most tasks). When we'll
               | have this, we'll have AGI.
        
           | og_kalu wrote:
           | If it displays the outwards appearances of reasoning then it
           | is reasoning. We don't evaluate humans any differently.
           | There's no magic intell-o-meter that can detect the amount of
           | intelligence flowing through a brain.
           | 
           | Anything else is just an argument of semantics. The idea that
           | there is "true" reasoning and "fake" reasoning but that we
           | can't tell the latter apart from the former is ridiculous.
           | 
           | You can't eat your cake and have it. Either "fake reasoning"
           | is a thing and can be distinguished or it can't and it's just
           | a made up distinction.
        
             | suddenlybananas wrote:
             | If I have a calculator with a look-up table of all
             | additions of natural numbers under 100, the calculator can
             | "appear" to be adding despite the fact it is not.
        
               | sourcepluck wrote:
               | Yes, indeed. Bullets know how to fly, and my kettle
               | somehow _knows_ that water boils at 373.15K! There 's
               | been an explosion of intelligence since the LLMs came
               | about :D
        
               | og_kalu wrote:
               | Bullets don't have the outward appearance of flight. They
               | follow the motion of projectiles and look it. Finding the
               | distinction is trivial.
               | 
               | The look up table is the same. It will fall apart with
               | numbers above 100. That's the distinction.
               | 
               | People need to start bringing up the supposed distinction
               | that exists with LLMs instead of nonsense examples that
               | don't even pass the test outlined.
        
               | og_kalu wrote:
               | Until you ask it to add number above 100 and it falls
               | apart. That is the point here. You found a distinction.
               | If you can't find one then you're arguing semantics.
               | People who say LLMs can't reason are yet to find a
               | distinction that doesn't also disqualify a bunch of
               | humans.
        
               | int_19h wrote:
               | This argument would hold up if LMs were large enough to
               | hold a look-up table of all possible valid inputs that
               | they can correctly respond to. They're not.
        
       | furyofantares wrote:
       | LLMs are fundamentally text-completion. The Chat-based tuning
       | that goes on top of it is impressive but they are fundamentally
       | text-completion, that's where most of the training energy goes. I
       | keep this in mind with a lot of my prompting and get good
       | results.
       | 
       | Regurgitating and Examples are both ways to lean into that and
       | try to recover whatever has been lost by Chat-based tuning.
        
         | zi_ wrote:
         | what else do you think about when prompting, which you've found
         | to be useful?
        
       | blixt wrote:
       | Really interesting findings around fine-tuning. Goes to show it
       | doesn't really affect the deeper "functionality" of the LLM (if
       | you think of the LLM running a set of small functions on very
       | high-dimensional numbers to produce a token).
       | 
       | Using regurgitation to get around the assistant/user token
       | separation is another fun tool for the toolbox, relevant for
       | whenever you want a model that doesn't support continuation
       | actually perform continuation (at the cost of a lot of latency).
       | 
       | I wonder if any type of reflection or chains of thought would
       | help it play better. I wouldn't be surprised if getting the LLM
       | to write an analysis of the game in English is more likely to
       | move it out of distribution than to make it pick better chess
       | moves.
        
       | MisterTea wrote:
       | This happened to a friend who was trying to sim basketball games.
       | It kept forgetting who had the ball or outright made illegal or
       | confusing moves. After a few days of wrestling with the AI he
       | gave up. GPT is amazing at following a linear conversation but
       | had no cognitive ability to keep track of a dynamic scenario.
        
       | xg15 wrote:
       | > _In many ways, this feels less like engineering and more like a
       | search for spells._
       | 
       | This is still my impression of LLMs in general. It's amazing that
       | they work, but for the next tech disruption, I'd appreciate
       | something that doesn't make you feel like being in a bad sci-fi
       | movie all the time.
        
       | jey wrote:
       | Could be interesting to create a tokenizer that's optimized for
       | representing chess moves and then training a LLM (from scratch?)
       | on stockfish games. (Using a custom tokenizer should improve the
       | quality for a given size of the LLM model. So it doesn't have to
       | waste a lot of layers on encode and decode, and the "natural"
       | latent representation is more straightforward)
        
       | sourcepluck wrote:
       | I don't like being directly critical, people learning in public
       | can be good and instructive. But I regret the time I've put into
       | both this article and the last one and perhaps someone else can
       | be saved the same time.
       | 
       | This is someone with limited knowledge of chess, statistics and
       | LLMs doing a series of public articles as they learn a little
       | tiny bit about chess, statistics and LLMs. And it garners upvotes
       | and attention off the coat-tails of AI excitement. Which is fair
       | enough, it's the (semi-)public internet, but it sort of
       | masquerades as being half-serious "research", and it kind of held
       | things together for the first article, but this one really is
       | thrown together to keep the buzz going of the last one.
       | 
       | The TL;DR :: one of the AIs being just-above-terrible, compared
       | to all the others being completely terrible, a fact already of
       | dubious interest, is down to - we don't know. Maybe a difference
       | in training sets. Tons of speculation. A few graphs.
        
       | phkahler wrote:
       | You can easily construct a game board from a sequence of moves by
       | maintaining the game state somewhere. But you can also know where
       | a piece is bases on only its last move. I'm curious what happens
       | if you don't feed it a position, but feed it a sequence of moves
       | including illegal ones but end up at a given valid position. The
       | author mention that LLMs will play differently when the same
       | position is arrived at via different sequences. I'm suggesting to
       | really play with that by putting illegal moves in the sequence.
       | 
       | I doubt it's doing much more than a static analysis of the a
       | board position, or even moving based mostly on just a few recent
       | moves by key pieces.
        
       | drivingmenuts wrote:
       | Why would a chess-playing AI be tuned to do anything except play
       | chess? Just seems like a waste. A bunch of small, specialized
       | AI's seems like a better idea than spending time trying to build
       | a new one.
       | 
       | Maybe less morally challenging, as well. You wouldn't be trying
       | to install "sentience".
        
         | int_19h wrote:
         | It's the other way around - you might want a general-purpose
         | model to learn to play chess because e.g. it improves its
         | ability to reason logically in other cases (which has been
         | claimed for humans pretty much ever since chess was invented).
         | 
         | Considering that training models on code seems to improve their
         | abilities on non-coding tasks in actual testing, this isn't
         | even all that far-fetched. Perhaps that is why GPT-3.5 was
         | specifically trained on chess in the first place.
        
       | PaulHoule wrote:
       | People have to quit this kind of stumbling in the dark with
       | commercial LLMs.
       | 
       | To get to the bottom of this it would be interesting to train
       | LLMs on nothing but chess games (can synthesize them endlessly by
       | having Stockfish play against itself) with maybe a side helping
       | of chess commentary and examples of chess dialogs "how many pawns
       | are on the board?", "where are my rooks?", "draw the board",
       | competence at which would demonstrate that it has a
       | representation of the board.
       | 
       | I don't believe in "emergent phenomena" or that the general
       | linguistic competence or ability to feign competence is necessary
       | for chess playing (being smart at chess doesn't mean you are
       | smart at other things and vice versa). With experiments like this
       | you might prove me wrong though.
       | 
       | This paper came out about a week ago
       | 
       | https://arxiv.org/pdf/2411.06655
       | 
       | seems to get good results with a fine-tuned Llama. I also like
       | this one as it is about competence in chess commentary
       | 
       | https://arxiv.org/abs/2410.20811
        
         | toxik wrote:
         | Predicting next moves of some expert chess policy is just
         | imitation learning, a well-studied proposal. You can add
         | return-to-go to let the network try to learn what kinds of
         | moves are made in good vs bad games, which would be an offline
         | RL regime (eg, Decision Transformers).
         | 
         | I suspect chess skill is completely useless for LLMs in general
         | and not an emergent phenomenon, just consuming gradient
         | bandwidth and parameter space to do this neat trick. This is
         | clear to me because the LLMs that aren't trained specifically
         | on chess do not do chess well.
        
           | PaulHoule wrote:
           | In either language or chess I'm still a bit baffled how a
           | representation over continuous variables (differentiable no
           | less) works for something that is discrete such as words,
           | letters, chess moves, etc. Add the word "not" a sentence and
           | it is not a perturbation of the meaning but a reversal (or is
           | it?)
           | 
           | A difference between communication and chess is that your
           | partner in conversation is your ally in meaning making and
           | will help fix your mistakes which is how they get away with
           | bullshitting. ("Personality" makes a big difference, by the
           | time you are telling your programming assistant "Dude,
           | there's a red squiggle on line 92" you are under its spell)
           | 
           | Chess on the other hand is adversarial and your mistakes are
           | just mistakes that your opponent will take advantage of. If
           | you make a move and your hunch that your pieces are not in
           | danger is just slightly wrong (one piece in danger) that's
           | almost as bad as having all your non-King pieces in danger
           | (they can only take one next turn.)
        
       | joshka wrote:
       | Why are you telling it not to explain? Allowing the LLM space to
       | "think" may be helpful, and would be definitely worth explorying?
       | 
       | Why are you manually guessing ways to improve this? Why not let
       | the LLMs do this for themselves and find iteratively better
       | prompts?
        
       | bambax wrote:
       | Very good follow-up to the original article. Thank you!
        
       | kqr wrote:
       | I get that it would make evals even more expensive, but I would
       | also try chain-of-thought! Have it explain its goals and
       | reasoning for the next move before making it. It might be an
       | awful idea for something like chess, but it seems to help
       | elsewhere.
        
       | Palmik wrote:
       | It might be worth trying the experiment where the prompt is
       | formatted such that each chess turn corresponds to one chat
       | message.
        
       | Jean-Papoulos wrote:
       | >According to that figure, fine-tuning helps. And examples help.
       | But it's examples that make fine-tuning redundant, not the other
       | way around.
       | 
       | This is extremely interesting. In this specific case at least,
       | simply giving examples is equivalent to fine-tuning. This is a
       | great discovery for me, I'll try using examples more often.
        
         | jdthedisciple wrote:
         | To me this is very intuitively true.
         | 
         | I can't explain why.I always had the intuition that fine-tuning
         | was overrated.
         | 
         | One reason perhaps is that examples are "right there" and thus
         | implicitly weighted much more in relation to the fine-tuned
         | neurons.
        
         | s5ma6n wrote:
         | Agreed on providing examples is definitely a useful insight vs
         | fine-tuning.
         | 
         | While it is not very important for this toy case, it's good to
         | keep in mind that each provided example in the input will
         | increase the prediction time and cost compared to fine-tuning.
        
       | marcus_holmes wrote:
       | I notice there's no prompt saying "you should try to win the
       | game" yet the results are measured by how much the LLM wins.
       | 
       | Is this implicit in the "you are a grandmaster chess player"
       | prompt?
       | 
       | Is there some part of the LLM training that does "if this is a
       | game, then I will always try to win"?
       | 
       | Could the author improve the LLM's odds of winning just by
       | telling it to try and win?
        
         | Nashooo wrote:
         | IMO this is clearly implicit in the "you are a grandmaster
         | chess player" prompt. As that should make generating best
         | possible move tokens more likely.
        
           | Ferret7446 wrote:
           | Is it? What if the AI is better than a grandmaster chess
           | player and is generating the most likely next move that a
           | grandmaster chess player might make and not the most likely
           | move to win, which may be different?
        
             | lukan wrote:
             | Depends on the training data I think. If the data divides
             | in games by top chess engines - and human players, then
             | yes, it might make a difference to tell it, to play like a
             | grandmaster of chess vs. to play like the top chess engine.
        
           | cma wrote:
           | Grandmasters usually play grandmasters of similar ELO, so it
           | might think it doesn't always win. Even if it should
           | recognize the player isn't a grandmaster, it still may be
           | better to include that, though who knows without testing.
        
         | tinco wrote:
         | I think you're putting too much weight on its intentions, it
         | doesn't have intentions it is a mathematical model that is
         | trained to give the most likely outcome.
         | 
         | In almost all examples and explanations it has seen from chess
         | games, each player would be trying to win, so it is simply the
         | most logical thing for it to make a winning move. So I wouldn't
         | expect explicitly prompting it to win to improve its
         | performance by much if at all.
         | 
         | The reverse would be interesting though, if you would prompt it
         | to make losing/bad moves, would it be effective in doing so,
         | and would the moves still be mostly legal? That might reveal a
         | bit more about how much relies on concepts it's seen before.
        
           | graypegg wrote:
           | Might also be interesting to see if mentioning a target ELO
           | score actually works over enough simulated games. I can
           | imagine there might be regular mentions of a player's ELO
           | score near their match history in the training data.
           | 
           | That way you're trying to emulate cases where someone is
           | trying, but isn't very good yet, versus trying to emulate
           | cases where someone is clearly and intentionally losing which
           | is going to be orders of magnitude less common in the
           | training data. (And I also would bet "losing" is also a
           | vector/token too closely tied to ANY losing game, but those
           | players were still putting up a fight to try and win the
           | game. Could still drift towards some good moves!)
        
         | montjoy wrote:
         | I came to the comments to say this too. If you were prompting
         | it to generate code, you generally get better results when you
         | ask it for a result. You don't just tell it, "You are a python
         | expert and here is some code". You give it a direction you want
         | the code to go. I was surprised that there wasn't something
         | like, "and win", or, "black wins", etc.
        
         | tananan wrote:
         | It would surely just be fluff in the prompt. The model's
         | ability to generate chess sequences will be bounded by the
         | expertise in the pool of games in the training set.
         | 
         | Even if the pool was poisoned by games in which some players
         | are trying to lose (probably insignificant), no one annotates
         | player intent in chess games, and so prompting it to win or
         | lose doesn't let the LLM pick up on this.
         | 
         | You can try this by asking an LLM to play to lose. ChatGPT ime
         | tries to set itself up for scholar's mate, but if you don't go
         | for it, it will implicitly start playing to win (e.g. taking
         | your unprotected pieces). If you ask it "why?", it gives you
         | the usual bs post-hoc rationalization.
        
           | danw1979 wrote:
           | > It would surely just be fluff in the prompt. The model's
           | ability to generate chess sequences will be bounded by the
           | expertise in the pool of games in the training set.
           | 
           | There are drawn and loosing games in the training set though.
        
         | boredhedgehog wrote:
         | Further, the prompt also says to "choose the next move" instead
         | of the best move.
         | 
         | It would be fairly hilarious if the reinforcement training has
         | made the LLM unwilling to make the human feel bad through
         | losing a game.
        
       | byyoung3 wrote:
       | sometimes new training techniques will lead to regressions in
       | certain tasks. My guess is this is exactly what has happened.
        
       | boesboes wrote:
       | It would be interesting to see if it can also play chess with
       | altered rules, or actually just a novel 'game' that relies on
       | logic & reasoning. Still not sure if that would 'prove' LLMs do
       | reasoning, but I'd be pretty close to convinced.
        
         | blueboo wrote:
         | Fun idea. Let's change how the knight behaves. Or try it on
         | Really Bad Chess (puzzles with impossible layouts) or 6x6 chess
         | or 8x9 chess.
         | 
         | I wonder if there are variants that have good baselines. It
         | might be tough to evaluate vis a vis human performance on novel
         | games..
        
         | Miraltar wrote:
         | If they were trained on multiple chess variants that might work
         | but as is it's impossible I think. Their internal model to play
         | chess is probably very specific
        
       | leumassuehtam wrote:
       | I'm convinced that "completion" models are much more useful (and
       | smart) than "chat" models, being able to provide more nuanced and
       | original outputs. When gpt4 come out, text-davinci-003 would
       | still provide better completions with the correct prompt. Of
       | course this model was later replaced by gpt-3.5-turbo-instruct
       | which is explored in this post.
       | 
       | I believe the reason why such models were later deprecated was
       | "alignment".
        
         | a2128 wrote:
         | I don't believe alignment/safety is the only reason. You also
         | burn through significantly more output tokens in a back-and-
         | forth editing session because by default it keeps repeating the
         | entire code or document just to make one small change, and it
         | also adds useless fluff around the text ("You are absolutely
         | correct, and I apologize for the confusion...")
        
       | qnleigh wrote:
       | Two other theories that could explain why OpenAI's models do so
       | well:
       | 
       | 1. They generate chess games from chess engine self play and add
       | that to the training data (similar to the already-stated theory
       | about their training data).
       | 
       | 2. They have added chess reinforcement learning to the training
       | at some stage, and actually got it to work (but not very well).
        
       | sourcepluck wrote:
       | > Since gpt-3.5-turbo-instruct has been measured at around 1800
       | Elo
       | 
       | Where's the source for this? What's the reasoning? I don't see
       | it. I have just relooked, and stil l can't see it.
       | 
       | Is it 1800 lichess "Elo", or 1800 FIDE, that's being claimed? And
       | 1800 at what time control? Different time controls have different
       | ratings, as one would imagine/hope the author knows.
       | 
       | I'm guessing it's not 1800 FIDE, as the quality of the games
       | seems far too bad for that. So any clarity here would be
       | appreciated.
        
         | og_kalu wrote:
         | https://github.com/adamkarvonen/chess_gpt_eval
        
           | sourcepluck wrote:
           | Thank you. I had seen that, and had browsed through it, and
           | thought: I don't get it, the reason for this 1800 must be
           | elsewhere.
           | 
           | What am I missing? Where does it show there how the claim of
           | "1800 ELO" is arrived at?
           | 
           | I can see various things that might be relevant, for example,
           | the graph where it (GPT-3.5-turbo-instruct) is shown as going
           | from mostly winning to mostly losing when it gets to
           | Stockfish level 3. It's hard (/impossible) to estimate the
           | lichess or FIDE ELO of the different Stockfish levels, but
           | Lichess' Stockfish on level 3 is miles below 1800 FIDE, and
           | it seems to me very likely to be below lichess 1800.
           | 
           | I invite any FIDE 1800s and (especially) any Lichess 1800s to
           | play Stockfish level 3 and report back. Years ago when I
           | played a lot on Lichess I was low 2000s in rapid, and I win
           | comfortably up till Stockfish level 6, where I can win, but
           | also do lose sometimes. Basically I really have to start
           | paying attention at level 6.
           | 
           | Level 3 seems like it must be below lichess 1800, but it's
           | just my anecdotal feeling of the strengths. Seeing as how the
           | article is chocabloc full of unfounded speculation and bias
           | though, maybe we can indulge ourselves too.
           | 
           | So: someone please explain the 1800 thing to me? And any
           | lichess 1800s like to play guinea pig, and play a series of
           | games against stockfish 3, and report back to us?
        
             | og_kalu wrote:
             | In Google's paper, then titled "Grandmaster level chess
             | without search", they evaluate turbo-instruct to have a
             | lichess Elo of 1755 (vs bots)
             | 
             | https://arxiv.org/abs/2402.04494
             | 
             | Admittedly, this isn't really "the source" though. The
             | first people to break the news on turbo-instruct's chess
             | ability all pegged it around 1800.
             | https://x.com/GrantSlatton/status/1703913578036904431
        
               | sourcepluck wrote:
               | Thank you, I do appreciate it. I had a quick search
               | through the paper, and can at least confirm for myself
               | that it's a Lichess Elo, and one of 1755, that is found
               | in that arxiv paper. That tweet there that says 1800,
               | without specifying it's a Lichess rating, I can't see
               | where he gets it from (but I don't have Twitter, I could
               | be missing something).
               | 
               | At least the arxiv paper is serious:
               | 
               | > A direct comparison between all engines comes with a
               | lot of caveats since some engines use the game history,
               | some have very different training protocols (i.e., RL via
               | self-play instead of supervised learning), and some use
               | search at test time. We show these comparisons to situate
               | the performance of our models within the wider landscape,
               | but emphasize that some conclusions can only be drawn
               | within our family of models and the corresponding
               | ablations that keep all other factors fixed.
        
       | sourcepluck wrote:
       | > For one, gpt-3.5-turbo-instruct rarely suggests illegal moves,
       | even in the late game.
       | 
       | It's claimed that this model "understands" chess, and can
       | "reason", and do "actual logic" (here in the comments).
       | 
       | I invite anyone making that claim to find me an "advanced
       | amateur" (as the article says of the LLM's level) chess player
       | who ever makes an illegal move. Anyone familiar with chess can
       | confirm that it doesn't really happen.
       | 
       | Is there a link to the games where the illegal moves are made?
        
         | zarzavat wrote:
         | An LLM is essentially playing blindfold chess if it just gets
         | the moves and not the position. You have to be fairly good to
         | never make illegal moves in blindfold.
        
           | fmbb wrote:
           | Does it not always have a list of all the moves in the game
           | always at hand in the prompt?
           | 
           | You have to give this human the same log of the game to refer
           | to.
        
             | xg15 wrote:
             | I think even then it would still be blindfold chess,
             | because humans do a lot of "pattern matching" on the actual
             | board state in front of them. If you only have the moves,
             | you have to reconstruct this board state in your head.
        
           | pera wrote:
           | A chat conversation where every single move is written down
           | and accessible at any time is not the same as blindfold
           | chess.
        
             | zbyforgotp wrote:
             | You can make it available to the player and I suspect it
             | wouldn't change the outcomes.
        
             | gwd wrote:
             | OK, but the LLM is still playing without a board to look
             | at, except what's "in its head". How often would 1800 ELO
             | chess players make illegal moves when playing only using
             | chess notation over chat, with no board to look at?
             | 
             | What might be interesting is to see if there was some sort
             | of prompt the LLM could use to help itself; e.g., "After
             | repeating the entire game up until this point, describe
             | relevant strategic and tactical aspects of the current
             | board state, and then choose a move."
             | 
             | Another thing that's interesting is the 1800 ELO cut-off of
             | the training data. If the cut-off were 2000, or 2200, would
             | that improve the results?
             | 
             | Or, if you included training data but labeled with the
             | player's ELO, could you request play at a specific ELO?
             | Being able to play against a 1400 ELO computer _that made
             | the kind of mistakes a 1400 ELO human would make_ would be
             | amazing.
        
               | wingmanjd wrote:
               | MaiaChess [1] supposedly plays at a specific ELO, making
               | similar mistakes a human would make at those levels.
               | 
               | It looks like they have 3 public bots on lichess.org:
               | 1100, 1500, and 1900
               | 
               | [1] https://www.maiachess.com/
        
             | lukeschlather wrote:
             | The LLM can't refer to notes, it is just relying on its
             | memory of what input tokens it had.
        
         | GaggiX wrote:
         | I can confirm that an advanced amateur can play illegal moves
         | by playing blindfold chess as shown in this article.
        
         | _heimdall wrote:
         | This is the problem with LLM researchers all but giving up on
         | the problem of inspecting how the LLM actually works
         | internally.
         | 
         | As long as the LLM is a black box, its entirely possible that
         | (a) the LLM does reason through the rules and understands what
         | moves are legal or (b) was trained on a large set of legal
         | moves and therefore only learned to make legal moves. You can
         | claim either case is the real truth, but we have absolutely no
         | way to know because we have absolutely no way to actually
         | understand what the LLM was "thinking".
        
           | codeulike wrote:
           | Here's an article where they teach an LLM Othello and then
           | probe its internal state to assess whether it is 'modelling'
           | the Othello board internally
           | 
           | https://thegradient.pub/othello/
           | 
           | Associated paper: https://arxiv.org/abs/2210.13382
        
           | mattmcknight wrote:
           | It's weird because it is not a black box at the lowest level,
           | we can see exactly what all of the weights are doing. It's
           | just too complex for us to understand it.
           | 
           | What is difficult is finding some intermediate pattern in
           | between there which we can label with an abstraction that is
           | compatible with human understanding. It may not exist. For
           | example, it may be more like how our brain works to produce
           | language than it is like a logical rule based system. We
           | occasionally say the wrong word, skip a word, spell things
           | wrong...violate the rules of grammar.
           | 
           | The inputs and outputs of the model are human language, so at
           | least there we know the system as a black box can be
           | characterized, if not understood.
        
             | _heimdall wrote:
             | > The inputs and outputs of the model are human language,
             | so at least there we know the system as a black box can be
             | characterized, if not understood.
             | 
             | This is actually where the AI safety debates tend to lose.
             | From where I sit we can't characterize the black box
             | itself, we can only characterize the outputs themselves.
             | 
             | More specifically, we can decide what we think the quality
             | of the output for the given input and we can attempt to
             | infer what might have happened in between. We really have
             | no idea what happened in between, and though many of the
             | "doomers" raise concerns that seem far fetched, we have
             | absolutely no way of understanding whether they are
             | completely off base or raising concerns of a system that
             | just hasn't shown problems in the input/output pairs yet.
        
           | lukeschlather wrote:
           | > (a) the LLM does reason through the rules and understands
           | what moves are legal or (b) was trained on a large set of
           | legal moves and therefore only learned to make legal moves.
           | 
           | How can you learn to make legal moves without understanding
           | what moves are legal?
        
             | _heimdall wrote:
             | I'm spit balling here so definitely take this with a grain
             | of salt.
             | 
             | If I only see legal moves, I may not think outside the box
             | come up with moves other than what I already saw. Humans
             | run into this all the time, we see things done a certain
             | and effectively learn that that's just how to do it and we
             | don't innovate.
             | 
             | Said differently, if the generative AI isn't actually being
             | generative at all, meaning its just predicting based on the
             | training set, it could be providing only legal moves
             | without ever learning or understanding the rules of the
             | game.
        
             | ramraj07 wrote:
             | I think they'll acknowledge these models are truly
             | intelligent only when the LLMs also irrationally go circles
             | around logic to insist LLMs are statistical parrots.
        
               | _heimdall wrote:
               | Acknowledging an LLM is intelligent requires a general
               | agreement of what intelligence is and how to measure it.
               | I'd also argue that it requires a way of understanding
               | _how_ an LLM comes to its answer rather than just inputs
               | and outputs.
               | 
               | To me that doesn't seem unreasonable and has nothing to
               | do with irrationally going in circles, curious if you
               | disagree though.
        
               | Retric wrote:
               | Humans judge if other humans are intelligent without
               | going into philosophical circles.
               | 
               | How well they learn completely novel tasks (fail in
               | conversation, pass with training). How well they do
               | complex tasks (debated just look at this thread). How
               | generally knowledgeable they are (pass). How often they
               | do non sensical things (fail).
               | 
               | So IMO it really comes down if you're judging by peak
               | performances or minimum standards. If I had an employee
               | that preformed as well as an LLM I'd call them an idiot
               | because they needed constant supervision for even trivial
               | tasks, but that's not the standard everyone is using.
        
               | _heimdall wrote:
               | > Humans judge if other humans are intelligent without
               | going into philosophical circles
               | 
               | That's totally fair. I expect that to continue to work
               | well when kept in the context of something/someone else
               | that is roughly as intelligent as you are. Bonus points
               | for the fact that one human understands what it means to
               | be human and we all have _roughly_ similar experiences of
               | reality.
               | 
               | I'm not so sure if that kind of judging intelligence by
               | feel works when you are judging something that is (a)
               | totally different from your or (b) massively more (or
               | less) intelligent than you are.
               | 
               | For example, I could see something much smarter than me
               | as acting irrationally when in reality they may be
               | working with a much larger or complex set of facts and
               | context that don't make sense to me.
        
           | raincole wrote:
           | > we have absolutely no way to know
           | 
           | To me, this means that it absolutely doesn't matter whether
           | LLM does reason or not.
        
             | _heimdall wrote:
             | It might if AI/LLM safety is a concern. We can't begin to
             | really judge safety without understanding how they work
             | internally.
        
         | grumpopotamus wrote:
         | I am an expert level chess player and I have multiple people
         | around my level play illegal moves in classic time control
         | games over the board. I have also watched streamers various
         | levels above me try to play illegal moves repeatedly before
         | realizing the UI was rejecting the move because it is illegal.
        
           | zoky wrote:
           | I've been to many USCF rated tournaments and have never once
           | seen or even heard of anyone over the age of 8 try to play an
           | illegal move. It may happen every now and then, but it's
           | exceedingly rare. LLMs, on the other hand, will gladly play
           | the Siberian Swipe, and why not? There's no consequence for
           | doing so as far as they are concerned.
        
             | Dr_Birdbrain wrote:
             | There are illegal moves and there are illegal moves. There
             | is trying to move your king five squares forward (which no
             | amateur would ever do) and there is trying to move your
             | King to a square controlled by an unseen piece, which can
             | happen to somebody who is distracted or otherwise off their
             | game.
             | 
             | Trying to castle through check is one that occasionally
             | happens to me (I am rated 1800 on lichess).
        
               | dgfitz wrote:
               | Moving your king controlled by an unrealized opponent
               | square is simply responded to with "check" no?
        
               | james_marks wrote:
               | No, that would break the rule that one cannot move into
               | check
        
               | dgfitz wrote:
               | Sorry yes, I meant the opponent would point it out. I've
               | never played professional chess.
        
               | umanwizard wrote:
               | Sure, the opponent would point it out, just like they
               | would presumably point it out if you played any illegal
               | move. In serious tournament games they would probably
               | also stop the clock, call over the arbiter, and inform
               | him or her that you made an illegal move so you can be
               | penalized (e.g. under FIDE rules if you make an illegal
               | move your opponent gets 2 extra minutes on the clock).
               | 
               | That doesn't change that it's an illegal move.
        
               | CooCooCaCha wrote:
               | This is an important distinction. Anyone with chess
               | experience would never try to move their king 5 spaces,
               | but LLMs will do crazy things like that.
        
           | jeremyjh wrote:
           | I'm rated 1450 USCF and I think I've seen 3 attempts to play
           | an illegal move across around 300 classical games OTB. Only
           | one of them was me. In blitz it does happen more.
        
           | WhyOhWhyQ wrote:
           | Would you say the apparent contradiction between what you and
           | other commenters are saying is partly explained by the high
           | volume of games you're playing? Or do you think there is some
           | other reason?
        
             | da_chicken wrote:
             | I wouldn't. I never progressed beyond chess clubs in public
             | schools and I certainly remember people making illegal
             | moves in tournaments. Like that's why they make you both
             | record all the moves. Because people make mistakes. Though,
             | honestly, I remember more notation errors than play errors.
             | 
             | Accidentally moving into check is probably the most common
             | illegal move. Castling though check is surprisingly common,
             | too. Actually moving a piece incorrectly is fairly rare,
             | though. I remember one tournament where one of the matches
             | ended in a DQ because one of the players had two white
             | bishops.
        
               | ASUfool wrote:
               | Could one have two white bishops after promoting a pawn?
        
               | IanCal wrote:
               | Promoting to anything other than a queen is rare, and I
               | expect the next most common is to a knight. Promoting to
               | a bishop, while possible, is going to be extremely rare.
        
               | umanwizard wrote:
               | Yes it's theoretically possible to have two light-squared
               | bishops due to promotions but so exceedingly rare that I
               | think most professional chess players will go their whole
               | career without ever seeing that happen.
        
           | nurettin wrote:
           | At what level are you considered an expert? IM? CM? 1900 ELO
           | OTB?
        
             | umanwizard wrote:
             | In the US at least 2000 USCF is considered "expert".
        
         | rgoulter wrote:
         | > I invite anyone making that claim to find me an "advanced
         | amateur" (as the article says of the LLM's level) chess player
         | who ever makes an illegal move. Anyone familiar with chess can
         | confirm that it doesn't really happen.
         | 
         | This is somewhat imprecise (or inaccurate).
         | 
         | A quick search on YouTube for "GM illegal moves" indicates that
         | GMs have made illegal moves often enough for there to be
         | compilations.
         | 
         | e.g. https://www.youtube.com/watch?v=m5WVJu154F0 -- The Vidit
         | vs Hikaru one is perhaps the most striking, where Vidit uses
         | his king to attack Hikaru's king.
        
           | zoky wrote:
           | It's exceedingly rare, though. There's a big difference
           | between accidentally falling to notice a move that is illegal
           | in a complicated situation, and playing a move that may or
           | may not be illegal just because it sounds kinda "chessy",
           | which is pretty much what LLMs do.
        
             | ifdefdebug wrote:
             | yes but LLM illegal moves often are not chessy at all. A
             | chessy illegal move for instance would be trying to move a
             | rook when you don't notice that it's between your king and
             | an attacking bishop. LLMs would often happily play Ba4 when
             | there's no bishop anywhere near a square from where it
             | could reach that square, or even no bishop at all. That's
             | not chessy, that's just weird.
             | 
             | I have to admit it's been a while since I played chatgpt so
             | maybe it improved.
        
           | banannaise wrote:
           | A bunch of these are just improper procedure: several who hit
           | the clock before choosing a promotion piece, and one who
           | touches a piece that cannot be moved. Even those that aren't
           | look like rational chess moves, they just fail to notice a
           | detail of the board state (with the possible exception of
           | Vidit's very funny king attack, which actually might have
           | been clock manipulation to give him more time to think with
           | 0:01 on the clock).
           | 
           | Whereas the LLM makes "moves" that clearly indicate no
           | ability to play chess: moving pieces to squares well outside
           | their legal moveset, moving pieces that aren't on the board,
           | etc.
        
             | fl7305 wrote:
             | Can a blind man sculpt?
             | 
             | What if he makes mistakes that a seeing person would never
             | make?
             | 
             | Does that mean that the blind man is not capable of
             | sculpting at all?
        
             | sixfiveotwo wrote:
             | > Whereas the LLM makes "moves" that clearly indicate no
             | ability to play chess: moving pieces to squares well
             | outside their legal moveset, moving pieces that aren't on
             | the board, etc.
             | 
             | Do you have any evidence of that? TFA doesn't talk about
             | the nature of these errors.
        
               | krainboltgreene wrote:
               | Yeah like several hundred "Chess IM/GMs react to ChatGPT
               | playing chess" videos on youtube.
        
               | sixfiveotwo wrote:
               | Very strange, I cannot spot any specifically saying that
               | ChatGPT cheated or played an illegal move. Can you help?
        
               | SonOfLilit wrote:
               | But clearly the author got his GPT to play orders of
               | magnitude better than in those videos
        
           | quuxplusone wrote:
           | "Most striking" in the sense of "most obviously not ever even
           | remotely legal," yeah.
           | 
           | But the most interesting and thought-provoking one in there
           | is [1] Carlsen v Inarkiev (2017). Carlsen puts Inarkiev in
           | check. Inarkiev, instead of making a legal move to escape
           | check, does something else. Carlsen then replies to _that_
           | move. Inarkiev challenges: Carlsen 's move was illegal,
           | because the only legal "move" at that point in the game was
           | to flag down an arbiter and claim victory, which Carlsen
           | didn't!
           | 
           | [1] - https://www.youtube.com/watch?v=m5WVJu154F0&t=7m52s
           | 
           | The tournament rules at the time, apparently, fully covered
           | the situation where the game state is legal but the move is
           | illegal. They didn't cover the situation where the game state
           | was actually illegal to begin with. I'm not a chess person,
           | but it sounds like the tournament rules may have been amended
           | after this incident to clarify what should happen in this
           | kind of situation. (And Carlsen was still declared the winner
           | of this game, after all.)
           | 
           | LLM-wise, you could spin this to say that the "rational
           | grandmaster" is as fictional as the "rational consumer":
           | Carlsen, from an actually invalid game state, played "a move
           | that may or may not be illegal just because it sounds kinda
           | "chessy"," as zoky commented below that an LLM would have
           | done. He responded to the gestalt (king in check, move the
           | king) rather than to the details (actually this board
           | position is impossible, I should enter a special case).
           | 
           | OTOH, the real explanation could be that Carlsen was just
           | looking ahead: surely he knew that after his last move,
           | Inarkiev's only legal moves were harmless to him (or
           | fatalistically bad for him? Rxb7 seems like Inarkiev's
           | correct reply, doesn't it? Again I'm not a chess person) and
           | so he could focus elsewhere on the board. He merely happened
           | not to double-check that Inarkiev had actually _played_ one
           | of the legal continuations he 'd already enumerated in his
           | head. But in a game played by the rules, he shouldn't have to
           | double-check that -- it is already guaranteed _by_ the rules!
           | 
           | Anyway, that's why Carlsen v Inarkiev struck me as the most
           | thought-provoking illegal move, from a computer programmer's
           | perspective.
        
           | tacitusarc wrote:
           | The one where Caruana improperly presses his clock and then
           | claims he did not so as not to lose, and the judges believe
           | him, is frustrating to watch.
        
         | mattmcknight wrote:
         | > I invite anyone making that claim to find me an "advanced
         | amateur" (as the article says of the LLM's level) chess player
         | who ever makes an illegal move.
         | 
         | I would say the analogy is more like someone saying chess moves
         | aloud. So, just as we all misspeak or misspell things from time
         | to time, the model output will have an error rate.
        
         | jeremyjh wrote:
         | Yes, I don't even know what it means to say its 1800 strength
         | and yet plays illegal moves frequently enough that you have to
         | code retry logic into the test harness. Under FIDE rules after
         | two illegal moves the game is declared lost by the player
         | making that move. If this rule were followed, I'm wondering
         | what its rating would be.
        
           | og_kalu wrote:
           | >Yes, I don't even know what it means to say its 1800
           | strength and yet plays illegal moves frequently enough that
           | you have to code retry logic into the test harness.
           | 
           | People are really misunderstanding things here. The one model
           | that can actually play at lichess 1800 Elo does not need any
           | of those and will play thousands of moves before a single
           | illegal one. But he isn't just testing that one specific
           | model. He is testing several models, some of which cannot
           | reliably output legal moves (and as such, this logic is
           | required)
        
         | chis wrote:
         | I agree with others that it's similar to blindfold chess and
         | would also add that the AI gets no time to "think" without
         | chain of thought like the new o1 models. So it's equivalent to
         | an advanced player, blindfolded, making moves off pure
         | intuition without system 2 thought.
        
         | bjackman wrote:
         | So just because has different failure modes it doesn't count as
         | reasoning? Is reasoning just "behaving exact like a human"? In
         | that case the statement "LLMs can't reason" is unfalsifiable
         | and meaningless. (Which, yeah, maybe it is).
         | 
         | The bizarre intellectual quadrilles people dance to sustain
         | their denial of LLM capabilities will never cease to amaze me.
        
         | hamilyon2 wrote:
         | The discussion in this thread is amazing. People, even renowned
         | experts in their field make mistakes, a lot of mistakes,
         | sometimes very costly and very obvious in retrospect. In their
         | craft.
         | 
         | Yet when LLM, trained on corpus of human stupidity, no less,
         | make illegal moves in chess, our brain immediately goes: I
         | don't make illegal moves in chess, how can computer play chess
         | if it does?
         | 
         | Perfect examples of metacognitive bias and general attribution
         | error at least.
        
           | sourcepluck wrote:
           | You would be correct to be amazed if someone was arguing:
           | 
           | "Look! It made mistakes, therefore it's definitely _not_
           | reasoning! "
           | 
           | That's certainly not what I'm saying, anyway. I was
           | responding to the argument actually being made by many here,
           | which is:
           | 
           | "Look! It plays pretty poorly, but not totally crap, and it
           | wasn't trained for playing just-above-poor chess, therefore,
           | it _understands_ chess and definitely _is_ reasoning! "
           | 
           | I find this - and much of the surrounding discussion - to be
           | quite an amazing display of people's biases, myself. People
           | _want_ to believe LLMs are reasoning, and so we 're treated
           | to these merry-go-round "investigations".
        
           | stonemetal12 wrote:
           | It isn't a binary does\doesn't question. It is a question of
           | frequency and "quality" of mistakes. If it is making illegal
           | moves 0.1% of the time then sure everybody makes mistakes. If
           | it is 30% of the time then it isn't doing so well. If the
           | illegal moves it tries to make are basic "pieces don't move
           | like that" sort of errors then the predict next token isn't
           | predicting so well. If the legality of the moves is more
           | subtle then maybe it isn't too bad.
           | 
           | But more than being able to make moves, if we claim it
           | understands chess shouldn't be able to explain why it chose a
           | move over another move?
        
         | fl7305 wrote:
         | > It's claimed that this model "understands" chess, and can
         | "reason", and do "actual logic" (here in the comments).
         | 
         | You can divide reasoning into three levels:
         | 
         | 1) Can't reason - just regurgitates from memory
         | 
         | 2) Can reason, but makes mistakes
         | 
         | 3) Always reasons perfectly, never makes mistakes
         | 
         | If an LLM makes mistakes, you've proven that it doesn't reason
         | perfectly.
         | 
         | You haven't proven that it can't reason.
        
         | alain94040 wrote:
         | > find me an "advanced amateur" (as the article says of the
         | LLM's level) chess player who ever makes an illegal move
         | 
         | Without a board to look at, just with the same linear text
         | input given in the prompt? I bet a lot of amateurs would not
         | give you legal moves. No drawing or side piece of paper
         | allowed.
        
       | torginus wrote:
       | Sorry - I have a somewhat question - is it possible to train
       | models as instruct models straight away? Previously LLMs were
       | trained on raw text data, but now we can generate instruct data
       | directly either from 'teaching LLMs' or ask existing LLMs to
       | conver raw data into instruct format.
       | 
       | Or alternatively - if chat tuning diminishes some of the models'
       | capability, would it make sense to have a smaller chat model
       | prompt a large base model, and convert back the outputs?
        
         | DHRicoF wrote:
         | I don't think there is enough (non syntetic) data available to
         | get near what we are used to.
         | 
         | The big breakthrough of GPT was exactly that. You can train a
         | model with (for what that time was) stupidly high amount of
         | data and make it okis to a lot of task you haven't trained
         | explicitly.
        
           | torginus wrote:
           | You can make GPT rewrite all existing textual info into
           | chatbot format, so there's no loss there.
           | 
           | With newer techniques, such as chain of thought and self-
           | checking, you can also generate a ton of high-quality
           | training data, that won't degrade the output of the LLM.
           | Though the degree to which you can do that is not clear to
           | me.
           | 
           | Imo it makes sense to train an LLM as a chatbot from the
           | start.
        
       | GaggiX wrote:
       | You should not finetune the models on the strongest setting of
       | Stockfish as the move will not be understandable unless you
       | really dig deep into the position and the model would not be able
       | to find a pattern to make sense of it, instead I suggest training
       | on human games of a certain ELO (less than grandmaster).
        
       | codeflo wrote:
       | > everyone is wrong!
       | 
       | Well, not everyone. I wasn't the only one to mention this, so I'm
       | surprised it didn't show up in the list of theories, but here's
       | e.g. me, seven days ago (source
       | https://news.ycombinator.com/item?id=42145710):
       | 
       | > At this point, we have to assume anything that becomes a
       | published benchmark is specifically targeted during training.
       | 
       | This is not the same thing as cheating/replacing the LLM _output_
       | , the theory that's mentioned and debunked in the article. And
       | now the follow-up adds weight to this guess:
       | 
       | > Here's my best guess for what is happening: ... OpenAI trains
       | its base models on datasets with more/better chess games than
       | those used by open models. ... Meanwhile, in section A.2 of this
       | paper (h/t Gwern) some OpenAI authors mention that GPT-4 was
       | trained on chess games in PGN notation, filtered to only include
       | players with Elo at least 1800.
       | 
       | To me, it makes complete sense that OpenAI would "spike" their
       | training data with data for tasks that people might actually try.
       | There's nothing unethical about this. No dataset is ever truly
       | "neutral", you make choices either way, so why not go out of your
       | way to train the model on potentially useful answers?
        
         | stingraycharles wrote:
         | Yup, I remember reading your comment and that making the most
         | sense to me.
         | 
         | OpenAI just shifted their training targets, initially they
         | thought Chess was cool, maybe tomorrow they think Go is cool,
         | or maybe the ability to write poetry. Who knows.
         | 
         | But it seems like the simplest explanation and makes the most
         | sense.
        
           | qup wrote:
           | At current sizes, these things are like humans. They gotta
           | specialize.
           | 
           | Maybe that'll be enough moat to save us from AGI.
        
         | demaga wrote:
         | Yes, and I would like this approach to also be used in other,
         | more practical areas. I mean, more "expert" content than
         | "amateur" content in training data, regardless of area of
         | expertise.
        
         | dr_dshiv wrote:
         | I made a suggestion that they may have trained the model to be
         | good at chess to see if it helped with general intelligence,
         | just as training with math and code seems to improve other
         | aspects of logical thinking. Because, after all, OpenAI has a
         | lot of experience with game playing AI.
         | https://news.ycombinator.com/item?id=42145215
        
         | gwern wrote:
         | I think this is a little paranoid. No one is training extremely
         | large expensive LLMs on huge datasets in the hope that a
         | blogger will stumble across poor 1800 Elo performance and tweet
         | about it!
         | 
         | 'Chess' is not a standard LLM benchmark worth Goodharting; OA
         | has generally tried to solve problems the right way rather than
         | by shortcuts & cheating, and the GPTs have not heavily overfit
         | on the standard benchmarks or counterexamples that they so
         | easily could which would be so much more valuable PR (imagine
         | how trivial it would be to train on, say, 'the strawberry
         | problem'?), whereas some _other_ LLM providers do see their
         | scores drop much more in anti-memorization papers; they have a
         | clear research use of their own in that very paper mentioning
         | the dataset; and there is some interest in chess as a model
         | organism of supervision and world-modeling in LLMs because we
         | have access to oracles (and it 's less boring than many things
         | you could analyze), which explains why they would be doing
         | _some_ research (if not a whole lot). Like the bullet chess LLM
         | paper from Deepmind - they aren 't doing that as part of a
         | cunning plan to make Gemini cheat on chess skills and help GCP
         | marketing!
        
       | deadbabe wrote:
       | If you randomly position pieces on the board and then ask the LLM
       | to play chess, where each piece still moves according to its
       | normal rules, does it know how to play still?
        
         | Hilift wrote:
         | If the goal is to create a model that simulates normal human
         | intelligence, yes. Some try to measure accuracy or performance
         | based on expertise though.
        
       | keskival wrote:
       | "I'm not sure, because OpenAI doesn't deign to share gpt-4-base,
       | nor to allow queries of gpt-4o in completion mode."
       | 
       | I would guess GPT-4o isn't first pre-trained and then instruct-
       | tuned, but trained directly with refined instruction-following
       | material.
       | 
       | This material probably contains way fewer chess games.
        
         | toxik wrote:
         | Why do you think that? InstructGPT was predominantly trained as
         | a next-token predictor on whatever soup of data OpenAI curated
         | at the time. The alignment signal (both RL part and the
         | supervised prompt/answer pairs) are a tiny bit of the gradient.
        
       | wavemode wrote:
       | I have the exact same problem with this article that I had with
       | the previous one - the author fails to provide any data on the
       | frequency of illegal moves.
       | 
       | Thus it's impossible to draw any meaningful conclusions. It would
       | be similar to if I claimed that an LLM is an expert doctor, but
       | in my data I've filtered out all of the times it gave incorrect
       | medical advice.
        
         | falcor84 wrote:
         | I world argue that it's more akin to filtering out the chit-
         | chat with the patient, where the doctor explained things in an
         | imprecise manner, keeping only the formal and valid medical
         | notation
        
           | caddemon wrote:
           | There is no legitimate reason to make an illegal move in
           | chess though? There are reasons why a good doctor might
           | intentionally explain things imprecisely to a patient.
        
             | hnthrowaway6543 wrote:
             | > There is no legitimate reason to make an illegal move in
             | chess though?
             | 
             | If you make an illegal move and the opponent doesn't notice
             | it, you gain a significant advantage. LLMs just have David
             | Sirlin's "Playing to Win" as part of their training data.
        
           | ses1984 wrote:
           | It's like the doctor saying, "you have cancer? Oh you don't?
           | Just kidding. Parkinson's. Oh it's not that either? How about
           | common cold?"
        
             | falcor84 wrote:
             | Big the difference is that valid bad moves (equivalents of
             | "cancer") were included in the analysis, it's only invalid
             | ones (like "your body is kinda outgrowing itself") that
             | were excluded from the analysis
        
               | ses1984 wrote:
               | What makes a chess move invalid is the state of the
               | board. I don't think moves like "pick up the pawn and
               | throw it across the room" were considered.
        
               | toast0 wrote:
               | That's a valid move in Monopoly though. Although it's
               | much prefered to pick up the table and throw it.
        
         | sigmar wrote:
         | Don't think that analogy works unless you could write a script
         | that automatically removes incorrect medical advice, because
         | then you would indeed have an LLM-with-a-script that was an
         | expert doctor (which you can do for illegal chess move, but
         | obviously not for evaluating medical advice)
        
           | kcbanner wrote:
           | It would be possible to employ an expert doctor, instead of
           | writing a script.
        
             | ben_w wrote:
             | Which is cheaper:
             | 
             | 1. having a human expert creating every answer
             | 
             | or
             | 
             | 2. having an expert check 10 answers each of which have a
             | 90% chance of being right and then manually redoing the one
             | which was wrong
             | 
             | Now add a complications that:
             | 
             | * option 1 also isn't 100% correct
             | 
             | * nobody knows which things in option 2 are correlated or
             | not and if those are or aren't correlated with human errors
             | so we might be systematically unable to even recognise the
             | errors
             | 
             | * even if we could, humans not only get lazy without
             | practice but also get bored if the work is too easy, so a
             | short-term study in efficiency changes doesn't tell you
             | things like "after 2 years you get mass resignations by the
             | competent doctors, while the incompetent just say 'LGTM' to
             | all the AI answers"
        
           | wavemode wrote:
           | You can write scripts that correct bad math, too. In fact
           | most of the time ChatGPT will just call out to a calculator
           | function. This is a smart solution, and very useful for end
           | users! But, still, we should not try to use that to make the
           | claim that LLMs have a good understanding of math.
        
             | henryfjordan wrote:
             | At what point does "knows how to use a calculator" equate
             | to knowing how to do math? Feels pretty close to me...
        
               | Tepix wrote:
               | Well, LLMs are bad at math but they're ok at detecting
               | math and delegating it to a calculator program.
               | 
               | It's kind of like humans.
        
             | afro88 wrote:
             | If a script were applied that corrected "bad math" and now
             | the LLM could solve complex math problems that you can't
             | one-shot throw at a calculator, what would you call it?
        
               | sixfiveotwo wrote:
               | It's a good point.
               | 
               | But this math analogy is not quite appropriate: there's
               | abstract math and arithmetic. A good math practitioner
               | (LLM or human) can be bad at arithmetic, yet good at
               | abstract reasoning. The later doesn't (necessarily)
               | requires the former.
               | 
               | In chess, I don't think that you can build a good
               | strategy if it relies on illegal moves, because tactics
               | and strategies are tied.
        
             | vunderba wrote:
             | Agreed. It's not the same thing and we should strive for
             | precision (LLMs are already opaque enough as it is).
             | 
             | An LLM that recognizes an input as "math" and calls out to
             | a NON-LLM to solve the problem vs an LLM that recognizes an
             | input as "math" and also uses next-token prediction to
             | produce an accurate response _ARE DIFFERENT_.
        
         | og_kalu wrote:
         | 3-turbo-instruct makes about 5 or less illegal moves in 8205.
         | It's not here but turbo instruct has been evaled before.
         | 
         | https://github.com/adamkarvonen/chess_gpt_eval
        
         | timjver wrote:
         | > It would be similar to if I claimed that an LLM is an expert
         | doctor, but in my data I've filtered out all of the times it
         | gave incorrect medical advice.
         | 
         | Computationally it's trivial to detect illegal moves, so it's
         | nothing like filtering out incorrect medical advice.
        
           | wavemode wrote:
           | As I wrote in another comment - you can write scripts that
           | correct bad math, too. But we don't use that to claim that
           | LLMs have a good understanding of math.
        
             | ben_w wrote:
             | I'd say that's because we don't understand what we mean by
             | "understand".
             | 
             | Hardware that _accurately_ performs maths faster than all
             | of humanity combined is so cheap as to be disposable, but I
             | 've yet to see anyone claim that a Pi Zero has
             | "understanding" of anything.
             | 
             | An LLM _can_ display the _viva voce_ approach that Turing
             | suggested[0], and do it well. Ironically for all those now
             | talking about  "stochastic parrots", the passage reads:
             | 
             | """... The game (with the player B omitted) is frequently
             | used in practice under the name of viva voce to discover
             | whether some one really understands something or has
             | 'learnt it parrot fashion'. ..."
             | 
             | Showing that not much has changed on the philosophy of this
             | topic since it was invented.
             | 
             | [0]
             | https://academic.oup.com/mind/article/LIX/236/433/986238
        
             | SpaceManNabs wrote:
             | I don't know. I have talked to a few math professors, and
             | they think LLMs are as good as a lot of their peers when it
             | comes hallucinations and being able to discuss ideas on
             | very niche topics, as long as the context is fed in. If Tao
             | is calling some models "a mediocre, but not completely
             | incompetent [...] graduate student", then they seem to
             | understand math to some degree to me.
        
           | KK7NIL wrote:
           | > Computationally it's trivial to detect illegal moves
           | 
           | You're strictly correct, but the rules for chess are
           | infamously hard to implement (as anyone who's tried to write
           | a chess program will know), leading to minor bugs in a lot of
           | chess programs.
           | 
           | For example, there's this old myth about vertical castling
           | being allowed due to ambiguity in the ruleset:
           | https://www.futilitycloset.com/2009/12/11/outside-the-box/
           | (Probably not historically accurate).
           | 
           | If you move beyond legal positions into who wins when one
           | side flags, the rules state that the other side should be
           | awarded a victory if checkmate was possible with any legal
           | sequence of moves. This is so hard to check that no chess
           | program tries to implement it, instead using simpler rules to
           | achieve a very similar but slightly more conservative result.
        
             | rco8786 wrote:
             | I got a kick out of that link. Had certainly never heard of
             | "vertical castling" previously.
        
             | admax88qqq wrote:
             | > You're strictly correct, but the rules for chess are
             | infamously hard to implement
             | 
             | Come on. Yeah they're not trivial but they've been done
             | numerous times. There's been chess programs for almost as
             | long as there have been computers. Checking legal moves is
             | a _solved problem_.
             | 
             | Detecting valid medical advice is not. The two are not even
             | remotely comparable.
        
         | theptip wrote:
         | This is a crazy goal-post move. TFA is proving a positive
         | capability, and rejecting the null hypothesis that "LLMs can't
         | think they just regurgitate".
         | 
         | Making some illegal moves doesn't invalidate the demonstrated
         | situational logic intelligence required to play at ELO 1800.
         | 
         | (Another angle: a human on Chess.com also has any illegal move
         | they try to make ignored, too.)
        
           | wavemode wrote:
           | It's not a goalpost move. As I've already said, I have the
           | exact same problem with this article as I had with the
           | previous one. My goalposts haven't moved, and my standards
           | haven't changed. Just provide the data! How hard can it be?
           | Why leave it out in the first place?
        
           | photonthug wrote:
           | > Making some illegal moves doesn't invalidate the
           | demonstrated situational logic intelligence
           | 
           | That's exactly what it does. 1 illegal move in 1 million or
           | 100 million or any other sample size you want to choose means
           | it doesn't understand chess.
           | 
           | People in this thread are really distracted by the medical
           | analogy so I'll offer another: you've got a bridge that
           | allows millions of vehicles to cross, and randomly falls down
           | if you tickle it wrong, maybe a car of rare color. One key
           | aspect of bridges is that they work reliably for any vehicle,
           | and once they fail they don't work with any vehicle. A bridge
           | that sometimes fails and sometimes doesn't isn't a bridge as
           | much as a death trap.
        
             | og_kalu wrote:
             | >1 illegal move in 1 million or 100 million or any other
             | sample size you want to choose means it doesn't understand
             | chess
             | 
             | Highly rated chess players make illegal moves. It's rare
             | but it happens. They don't understand chess ?
        
               | photonthug wrote:
               | > Then no human understands chess
               | 
               | Humans with correct models may nevertheless make errors
               | in rule applications. Machines are good at applying
               | rules, so when they fail to apply rules correctly, it
               | means they have incorrect, incomplete, or totally absent
               | models.
               | 
               | Without using a word like "understands" it seems clear
               | that the same _apparent_ mistake has different causes..
               | and model errors are very different from model-
               | application errors. In a math or physics class this is
               | roughly the difference between carry-the-one arithmetic
               | errors vs using an equation from a completely wrong
               | domain. The word "understands" is loaded in discussion of
               | LLMs, but everyone knows which mistake is going to get
               | partial credit vs zero credit on an exam.
        
               | og_kalu wrote:
               | >Humans with correct models may nevertheless make errors
               | in rule applications. Ok
               | 
               | >Machines are good at applying rules, so when they fail
               | to apply rules correctly, it means they have incorrect or
               | incomplete models.
               | 
               | I don't know why people continue to force the wrong
               | abstraction. LLMs do not work like 'machines'. They don't
               | 'follow rules' the way we understand normal machines to
               | 'follow rules'.
               | 
               | >so when they fail to apply rules correctly, it means
               | they have incorrect or incomplete models.
               | 
               | Everyone has incomplete or incorrect models. It doesn't
               | mean we always say they don't understand. Nobody says
               | Newton didn't understand gravity.
               | 
               | >Without using a word like "understands" it seems clear
               | that the same apparent mistake has different causes.. and
               | model errors are very different from model-application
               | errors.
               | 
               | It's not very apparent no. You've just decided it has
               | different causes because of preconceived notions on how
               | you think all machines must operate in all
               | configurations.
               | 
               | LLMs are not the logic automatons in science fiction.
               | They don't behave or act like normal machines in any way.
               | The internals run some computations to make predictions
               | but so does your nervous system. Computation is
               | substrate-independent.
               | 
               | I don't even know how you can make this distinction
               | without seeing what sort of illegal moves it makes. If it
               | makes the sort high rated players make then what ?
        
               | photonthug wrote:
               | I can't tell if you are saying the distinction between
               | model errors and model-application errors doesn't exist
               | or doesn't matter or doesn't apply here.
        
               | og_kalu wrote:
               | I'm saying:
               | 
               | - Generally, we do not say someone does not understand
               | just because of a model error. The model error has to be
               | sufficiently large or the model sufficiently narrow. No-
               | one says Newton didn't understand gravity just because
               | his model has an error in it but we might say he didn't
               | understand some aspects of it.
               | 
               | - You are saying the LLM is making a model error (rather
               | than an an application error) only because of
               | preconceived notions of how 'machines' must behave, not
               | on any rigorous examination.
        
               | photonthug wrote:
               | Suppose you're right, the internal model of game rules is
               | perfect but the application of the model for next-move is
               | imperfect. Unless we can actually separate the two, does
               | it matter? Functionally I mean, not philosophically. If
               | the model was correct, maybe we could get a useful
               | version of it out by asking it to _write_ a chess engine
               | instead of _act_ as a chess engine. But when the prolog
               | code for that is as incorrect as the illegal chess move
               | was, will you say again that the model is correct, but
               | the usage of it resulted merely resulted in minor errors?
               | 
               | > You are saying the LLM is making a model error (rather
               | than an an application error) only because of
               | preconceived notions of how 'machines' must behave, not
               | on any rigorous examination.
               | 
               | Here's an anecdotal examination. After much talk about
               | LLMs and chess, and math, and formal logic here's the
               | state of the art, simplified from dialog with gpt today:
               | 
               | > blue is red and red is blue. what color is the sky? >>
               | <blah blah, restates premise, correctly answer "red">
               | 
               | At this point fans rejoice, saying it understands
               | hypotheticals and logic. Dialogue continues..
               | 
               | > name one red thing >> <blah blah, restates premise,
               | incorrectly offers "strawberries are red">
               | 
               | At this point detractors rejoice, declare that it doesn't
               | understand. Now the conversation devolves into semantics
               | or technicalities about prompt-hacks, training data,
               | weights. Whatever. We don't need chess. Just look it,
               | it's broken as hell. Discussing whether the error is
               | human-equivalent isn't the point either. It's broken! A
               | partially broken process is no solid foundation to build
               | others on. And while there are some exceptions, an
               | unreliable tool/agent is often worse than none at all.
        
               | og_kalu wrote:
               | >It's broken! A partially broken process is no solid
               | foundation to build others on. And while there are some
               | exceptions, an unreliable tool/agent is often worse than
               | none at all.
               | 
               | Are humans broken ? Because our reasoning is a very
               | broken process. You say it's no solid foundation ? Take a
               | look around you. This broken processor is the foundation
               | of society and the conveniences you take for granted.
               | 
               | The vast vast majority of human history, there wasn't
               | anything even remotely resembling a non-broken general
               | reasoner. And you know the funny thing ? There still
               | isn't. When people like you say LLMs don't reason, they
               | hold them to a standard that doesn't exist. Where is this
               | non-broken general reasoner in anywhere but fiction and
               | your own imagination?
               | 
               | >And while there are some exceptions, an unreliable
               | tool/agent is often worse than none at all.
               | 
               | Since you are clearly meaning unreliable to be 'makes no
               | mistake/is not broken' then no human is a reliable agent.
               | Clearly, the real exception is when an unreliable agent
               | is worse than nothing at all.
        
               | sixfiveotwo wrote:
               | > Machines are good at applying rules, so when they fail
               | to apply rules correctly, it means they have incorrect,
               | incomplete, or totally absent models.
               | 
               | That's assuming that, somehow, a LLM is a machine. Why
               | would you think that?
        
               | photonthug wrote:
               | Replace the word with one of your own choice if that will
               | help us get to the part where you have a point to make?
               | 
               | I think we are discussing whether LLMs can emulate chess
               | playing machines, regardless of whether they are actually
               | literally composed of a flock of stochastic parrots..
        
               | XenophileJKO wrote:
               | Engineers really have a hard time coming to terms with
               | probabilistic systems.
        
               | sixfiveotwo wrote:
               | That's simple logic. Quoting you again:
               | 
               | > Machines are good at applying rules, so when they fail
               | to apply rules correctly, it means they have incorrect,
               | incomplete, or totally absent models.
               | 
               | If this line of reasoning applies to machines, but LLMs
               | aren't machines, how can you derive any of these claims?
               | 
               | "A implies B" may be right, but you must first
               | demonstrate A before reaching conclusion B..
               | 
               | > I think we are discussing whether LLMs can emulate
               | chess playing machines
               | 
               | That is incorrect. We're discussing whether LLMs can play
               | chess. Unless you think that human players also emulate
               | chess playing machines?
        
             | benediktwerner wrote:
             | Try giving a random human 30 chess moves and ask them to
             | make a non-terrible legal move. Average humans even quite
             | often try to make illegal moves when clearly seeing the
             | board before them. There are even plenty of cases where
             | people reported a bug because the chess application didn't
             | let them do an illegal move they thought was legal.
             | 
             | And the sudden comparison to something that's safety
             | critical is extremely dumb. Nobody said we should tie the
             | LLM to a nuclear bomb that explodes if it makes a single
             | mistake in chess.
             | 
             | The point is that it plays at a level far far above making
             | random legal moves or even average humans. To say that that
             | doesn't mean anything because it's not perfect is simply
             | insane.
        
               | photonthug wrote:
               | > And the sudden comparison to something that's safety
               | critical is extremely dumb. Nobody said we should tie the
               | LLM to a nuclear bomb that explodes if it makes a single
               | mistake in chess.
               | 
               | But it actually is safety critical very quickly whenever
               | you say something like "works fine most of the time, so
               | our plan going forward is to dismiss any discussion of
               | when it breaks and why".
               | 
               | A bridge failure feels like the right order of magnitude
               | for the error rate and effective misery that AI has
               | already quietly caused with biased models where one in a
               | million resumes or loan applications is thrown out. And a
               | nuclear bomb would actually kill less people than a full
               | on economic meltdown. But I'm sure no one is using LLMs
               | in finance at all right?
               | 
               | It's so arrogant and naive to ignore failure modes that
               | we don't even understand yet.. at least bridges and steel
               | have specs. Software "engineering" was always a very
               | suspect name for the discipline but whatever claim we had
               | to it is worse than ever.
        
         | sixo wrote:
         | When I play chess I filter out all kinds of illegal moves. I
         | also filter out bad moves. Human is more like "recursively
         | thinking of ideas and then evaluating them with another part of
         | your model", why not let the LLMs do the same?
        
           | skydhash wrote:
           | Because that's not what happens? We learn through symbolic
           | meaning and rules which then form a consistent system. Then
           | we can have a goal and continuously evaluate if we're within
           | the system and transitionning towards that goal. The nice
           | thing is that we don't have to compute the whole simulation
           | in our brains and can start again from the real world. The
           | more you train, the better your heuristics become and the
           | more your efficiency increases.
           | 
           | The internal model of a LLM is statistical text. Which is
           | linear and fixed. Not great other than generating text
           | similar to what was ingested.
        
             | hackinthebochs wrote:
             | >The internal model of a LLM is statistical text. Which is
             | linear and fixed.
             | 
             | Not at all. Like seriously, not in the slightest.
        
               | skydhash wrote:
               | What does it encode? Images? Scent? Touch? Some higher
               | dimensional qualia?
        
               | hackinthebochs wrote:
               | Well, a simple description is that they discover circuits
               | that reproduce the training sequence. It turns out that
               | in the process of this, they recover relevant
               | computational structures that generalize the training
               | sequence. The question of how far they generalize is
               | certainly up for debate. But you can't reasonably deny
               | that they generalize to a certain degree. After all, most
               | sentences they are prompted on are brand new and they
               | mostly respond sensibly.
               | 
               | Their representation of the input is also not linear.
               | Transformers use self-attention which relies on the
               | softmax function, which is non-linear.
        
             | fl7305 wrote:
             | > The internal model of a LLM is statistical text. Which is
             | linear and fixed. Not great other than generating text
             | similar to what was ingested.
             | 
             | The internal model of a CPU is linear and fixed. Yet, a CPU
             | can still generate an output which is very different from
             | the input. It is not a simple lookup table, instead it
             | executes complex algorithms.
             | 
             | An LLM has large amounts of input processing power. It has
             | a large internal state. It executes "cycle by cycle",
             | processing the inputs and internal state to generate output
             | data and a new internal state.
             | 
             | So why shouldn't LLMs be capable of executing complex
             | algorithms?
        
               | skydhash wrote:
               | It probably can, but how will those algorithms be
               | created? And the representation of both input and output.
               | If it's text, the most efficient way is to construct a
               | formal system. Or a statistical model if ambiguous and
               | incorrect result are ok in the grand scheme of things.
               | 
               | The issue is always inout consumption, and output
               | correctness. In a CPU, we take great care with data
               | representation and protocol definition, then we do formal
               | verification on the algorithms, and we can be pretty sure
               | that the output are correct. So the issue is that the
               | internal model (for a given task) of LLMs are not
               | consistent enough and the referential window (keeping
               | track of each item in the system) is always too small.
        
         | GuB-42 wrote:
         | > Thus it's impossible to draw any meaningful conclusions. It
         | would be similar to if I claimed that an LLM is an expert
         | doctor, but in my data I've filtered out all of the times it
         | gave incorrect medical advice.
         | 
         | Not really, you can try to make illegal moves in chess, and
         | usually, you are given a time penalty and get to try again, so
         | even in a real chess game, illegal moves are "filtered out".
         | 
         | And for the "medical expert" analogy, let's say that you
         | compare to systems based on the well being of the patients
         | after they follow the advise. I think it is meaningful even if
         | you filter out advise that is obviously inapplicable, for
         | example because it refers to non-existing body parts.
        
         | koolala wrote:
         | I want to see graphs of moves the author randomly made too.
         | Maybe even plotting a random-move player on the performance
         | graphs vs. the AIs.
         | 
         | It's beginner chess and beginners make moves at random all the
         | time.
        
           | benediktwerner wrote:
           | 1750 elo is extremely far from beginner chess. The random
           | mover bot on Lichess has like 700 rating.
           | 
           | And the article does show various graphs of the badly playing
           | models which will hardly play worse than random but are
           | clearly far below the good models.
        
         | Der_Einzige wrote:
         | Correct - Dynamic grammar based/constrained sampling can be
         | used to, at each time-step, force the model to only make valid
         | moves (and you don't have to do it in the prompt like this
         | article does!!!)
         | 
         | I have NO idea why no one seems to do this. It's a similar
         | issue with LLM-as-judge evaluations. Often they are begging to
         | be combined with grammar based/constrained/structured sampling.
         | So much good stuff in LLM land isn't used for no good reason!
         | There are several libraries for implementing this easily,
         | outlines, guidance, lm-format-enforcer, and likely many more.
         | You can even do it now with OpenAI!
         | 
         | Oobabooga text gen webUI literally has chess as one of it's
         | candidate examples of grammar based sampling!!!
        
         | rcxdude wrote:
         | I don't think is super relevant. I mean, it would be
         | interesting (especially if there was a meaningful difference in
         | the number of illegal move attempts between the different
         | approaches, doubly so if that didn't correlate with the
         | performance when illegal moves are removed), but I don't think
         | it really affects the conclusions of the article: picking
         | randomly from the set of legal moves makes for a truly terrible
         | chess player, so clearly the LLMs are bringing something to the
         | party such that sampling from their output performs
         | significantly better. Splitting hairs about the capability of
         | the LLM on its own (i.e. insisting on defining attempts at an
         | illegal move as a game loss for the purposes of rating) seems
         | pretty besides the point.
        
         | hansvm wrote:
         | There's a subtle distinction though; if you're able to filter
         | out illegal behavior, the move quality conditioned on legality
         | can be extremely different from arbitrary move quality (and, as
         | you might see in LLM json parsing, conditioning per-token can
         | be very different from conditioning per-response).
         | 
         | If you're arguing that the singularity already happened then
         | your criticism makes perfect sense; these are dumb machines,
         | not useful yet for most applications. If you just want to use
         | the LLM as a tool though, the behavior when you filter out
         | illegal responses (assuming you're able to do so) is the only
         | reasonable metric.
         | 
         | Analogizing to a task I care a bit about: Current-gen LLMs are
         | somewhere between piss-poor and moderate at generating recipes.
         | With a bit of prompt engineering most recipes pass my "bar",
         | but they're still often lacking in one or more important
         | characteristics. If you do nothing other than ask it to
         | generate many options and then as a person manually filter to
         | the subset of ideas (around 1/20) which look stellar, it's both
         | very effective at generating good recipes, and they're usually
         | much better than my other sources of stellar recipes (obviously
         | not generally applicable because you have to be able to tell
         | bad recipes from good at a glance for that workflow to make
         | sense). The fact that most of the responses are garbage doesn't
         | really matter; it's still an improvement to how I cook.
        
       | tech_ken wrote:
       | > It's ridiculously hard to find the optimal combination of
       | prompts and examples and fine-tuning, etc. It's a very large
       | space, there are no easy abstractions to allow you to search
       | through the space, LLMs are unpredictable and fragile, and these
       | experiments are slow and expensive.
       | 
       | Regardless of the actual experiment outcome, I think this is a
       | super valuable insight. "Should we provide legal moves?" section
       | is an excellent case study of this- extremely prudent idea
       | actually degrades model performance, and quite badly. It's like
       | that crocodile game where you're pushing teeth until it clamps
       | onto your hand.
        
       | subarctic wrote:
       | The author either didn't read the hacker news comments last time,
       | or he missed the top theory that said they probably used chess as
       | a benchmark when they developed the model that is good at chess
       | for whatever business reasons they had at the time.
        
         | wavemode wrote:
         | This is plausible. One of the top chess engines in the world
         | (Leela) is just a neural network trained on billions of chess
         | games.
         | 
         | So it makes sense that an LLM would also be able to acquire
         | some skill by simply having a large volume of chess games in
         | its training data.
         | 
         | OpenAI probably just eventually decided it wasn't useful to
         | keep pursuing chess skill.
        
         | devindotcom wrote:
         | fwiw this is exactly what i thought - oai pursued it as a
         | skillset (likely using a large chess dataset) for their own
         | reasons and then abandoned it as not particularly beneficial
         | outside chess.
         | 
         | It's still interesting to try to replicate how you would make a
         | generalist LLM good at chess, so i appreciated the post, but I
         | don't think there's a huge mystery!
        
         | brcmthrowaway wrote:
         | Oh really! What happened to the theory that training on code
         | magically caused some high level reasoning ability?
        
       | koolala wrote:
       | Next test a image & text model! Chess is way easier when you can
       | see the board.
        
       | amelius wrote:
       | I wonder what would happen if they changed the prompt such that
       | the llm is asked to explain their strategy first. Or to explain
       | their opponent's strategy.
        
       | code51 wrote:
       | Initially LLM researchers were saying training on code samples
       | made the "reasoning" better. Now, if "language to world model"
       | thesis is working, shouldn't chess actually be the smallest case
       | for it?
       | 
       | I can't understand why no research group is going hard at this.
        
         | throwaway314155 wrote:
         | I don't think training on code and training on chess are even
         | remotely comparable in terms of available data and linguistic
         | competency required. Coding (in the general case, which is what
         | these models try to approach) is clearly the harder task and
         | contains _massive_ amounts of diverse data.
         | 
         | Having said all of that, it wouldn't surprise me if the
         | "language to world model" thesis you reference is indeed wrong.
         | But I don't think a model that plays chess well disproves it,
         | particularly since there are chess engines using old fashioned
         | approaches that utterly destroy LLM's.
        
       | bee_rider wrote:
       | Extremely tangential, but how to chess engines do when playing
       | from illegal board states? Could the LLM have a chance of
       | competing with a real chess engine from there?
       | 
       | Understanding is a funny concept to try to apply to computer
       | programs anyway. But playing from an illegal state seems (to me
       | at least) to indicate something interesting about the ability to
       | comprehend the general idea of chess.
        
       | derefr wrote:
       | > Many, many people suggested that there must be some special
       | case in gpt-3.5-turbo-instruct that recognizes chess notation and
       | calls out to an external chess engine.
       | 
       | Not that I think there's anything inherently unreasonable about
       | an LLM understanding chess, but I think the author missed a
       | variant hypothesis here:
       | 
       | What if that specific model, when it recognizes chess notation,
       | is trained to silently "tag out" for _another, more specialized
       | LLM, that is specifically trained on a majority-chess dataset_?
       | (Or -- perhaps even more likely -- the model is trained to
       | recognize the need to activate a chess-playing _LoRA adapter_?)
       | 
       | It would still be an LLM, so things like "changing how you prompt
       | it changes how it plays" would still make sense. Yet it would be
       | one that has spent a lot more time modelling chess than other
       | things, and never ran into anything that distracted it enough to
       | catastrophically forget how chess works (i.e. to reallocate some
       | of the latent-space vocabulary on certain layers from modelling
       | chess, to things that matter more to the training function.)
       | 
       | And I could certainly see "playing chess" as a good proving
       | ground for testing the ability of OpenAI's backend to recognize
       | the need to "loop in" a LoRA in the inference of a response. It's
       | something LLM base models suck at; but it's also something you
       | intuitively _could_ train an LLM to do (to at least a proficient-
       | ish level, as seen here) if you had a model focus on just
       | learning that.
       | 
       | Thus, "ability of our [framework-mediated] model to play chess"
       | is easy to keep an eye on, long-term, as a proxy metric for "how
       | well our LoRA-activation system is working", without needing to
       | worry that your next generation of base models might suddenly
       | invalidate the metric by getting good at playing chess without
       | any "help." (At least not any time soon.)
        
         | throwaway314155 wrote:
         | > but I think the author missed a variant hypothesis here:
         | 
         | > What if that specific model, when it recognizes chess
         | notation, is trained to silently "tag out" for another, more
         | specialized LLM, that is specifically trained on a majority-
         | chess dataset? (Or -- perhaps even more likely -- the model is
         | trained to recognize the need to activate a chess-playing LoRA
         | adapter?)
         | 
         | Pretty sure your variant hypothesis is sufficiently covered by
         | the author's writing.
         | 
         | So strange that people are so attached to conspiracy theories
         | in this instance. Why would OpenAI or anyone go through all the
         | trouble? The proposals outlined in the article make far more
         | sense and track well with established research (namely that
         | applying RLHF to a "text-only" model tends to wreak havoc on
         | said model).
        
       | bob1029 wrote:
       | I find it amusing that we would frame an ensemble of models as
       | "cheating". Routing to a collection of specialized models via
       | classification layers seems like the most obvious path for adding
       | practical value to these solutions.
       | 
       | Why conflate the parameters of chess with checkers and go if you
       | already have high quality models for each? I thought tool use and
       | RAG were fair game.
        
       | copperroof wrote:
       | I just want a hacker news no-LLM filter. The site has been almost
       | unusable for a year now.
        
       | XenophileJKO wrote:
       | So this article is what happens when people who don't really
       | understand the models "test" things.
       | 
       | There are several fatal flaws.
       | 
       | The first problem is that he isn't clearly and concisely
       | displaying the current board state. He is expecting the model to
       | attend a move sequence to figure out the board state.
       | 
       | Secondly, he isn't allowing the model to think elastically using
       | COT or other strategies.
       | 
       | Honestly, I am shocked it is working at all. He has basically
       | formulated the problem in the worst possible way.
        
         | yeevs wrote:
         | I'm not sure COT would help in this situation. I am an amateur
         | at chess but in my experience a large part of playing is
         | intuition and making and I'm not confident the model could even
         | accurately summarise its thinking. There are tasks in which
         | models perform worse on when explaining reasoning. However,
         | this is completely vibes based.
        
           | XenophileJKO wrote:
           | Given my experience with the models, giving it the ability to
           | think would allow it to attend to different ramifications of
           | the current board layout. I would expect a non trivial
           | performance gain.
        
       | cma wrote:
       | One thing missing from the graphs is whether 3.5-turbo-instruct
       | also gets better with the techniques? Is finetuning available for
       | it?
        
       | __MatrixMan__ wrote:
       | It would be fun to play against an LLM without having to think
       | about the prompting, if only as a novel way to get a "feel" for
       | how they "think".
        
       | timzaman wrote:
       | "all LLMs" - OP only tested OpenAI LLMs. Try Gemini.
        
       ___________________________________________________________________
       (page generated 2024-11-22 23:00 UTC)