[HN Gopher] OK, I can partly explain the LLM chess weirdness now
___________________________________________________________________
OK, I can partly explain the LLM chess weirdness now
Author : dmazin
Score : 422 points
Date : 2024-11-21 17:55 UTC (1 days ago)
(HTM) web link (dynomight.net)
(TXT) w3m dump (dynomight.net)
| amrrs wrote:
| >Theory 1: Large enough base models are good at chess, but this
| doesn't persist through instruction tuning to chat models.
|
| I lean mostly towards this and also the chess notations - not
| sure if it might get chopped during tokenization unless it's very
| precisely processed.
|
| It's like designing an LLM just for predicting protein sequence
| because the sequencing matters. The base data might have it but i
| don't think that's the intention for it to continue.
| com2kid wrote:
| This makes me wonder what scenarios would be unlocked if OpenAI
| gave access to gpt4-instruct.
|
| I wonder if they avoid that due to the potential for negative
| press from the outputs of a more "raw" model.
| tromp wrote:
| > For one, gpt-3.5-turbo-instruct rarely suggests illegal moves,
| even in the late game. This requires "understanding" chess.
|
| Here's one way to test whether it really understands chess. Make
| it play the next move in 1000 random legal positions (in which no
| side is checkmated yet). Such positions can be generated using
| the ChessPositionRanking project at [1]. Does it still rarely
| suggest illegal moves in these totally weird positions, that will
| be completely unlike any it would have seen in training (and in
| which the legal move choice is often highly restricted) ?
|
| While good for testing legality of next moves, these positions
| are not so useful for distinguishing their quality, since usually
| one side already has an overwhelming advantage.
|
| [1] https://github.com/tromp/ChessPositionRanking
| BurningFrog wrote:
| Not that I understand the internals of current AI tech, but...
|
| I'd expect that an AI that has seen billions of chess
| positions, and the moves played in them, can figure out the
| rules for legal moves without being told?
| rscho wrote:
| Statistical 'AI' doesn't 'understand' anything, strictly
| speaking. It predicts a move with high probability, which
| could be legal or illegal.
| griomnib wrote:
| Likewise with LLM you don't know if it is truly in the
| "chess" branch of the statistical distribution or it is
| picking up something else entirely, like some arcane
| overlap of tokens.
|
| So much of the training data (eg common crawl, pile,
| Reddit) is dogshit, so it generates reheated dogshit.
| Helonomoto wrote:
| You generalize this without mentioning that there are
| LLMs which do not just use random 'dogshit'.
|
| Also what does a normal human do? It looks around how to
| move one random piece and it uses a very small dictionary
| / set of basic rules to move it. I do not remember me
| learning to count every piece and its options by looking
| up that rulebook. I learned to 'see' how i can move one
| type of chess piece.
|
| If a LLM uses only these piece moves on a mathematical
| level, it would do the same thing as i do.
|
| And yes there is also absolutly the option for an LLM to
| learn some kind of meta game.
| Helonomoto wrote:
| How do you define 'understand'?
|
| There is plenty of AI which learns the rules of games like
| Alpha Zero.
|
| LLMs might not have the architecture to 'learn', but it
| also might. If it optimizes all possible moves one chess
| peace can do (which is not that much to learn) it can
| easily only 'move' from one game set to another by this
| type of dictionary.
| rscho wrote:
| Understanding a rules-based system (chess) means to be
| able to learn non-probabilistic rules (an abstraction
| over the concrete world). Humans are a mix of symbolic
| and probabilistic learning, allowing them to get a huge
| boost in performance by admitting rules. It doesn't mean
| a human will never make an illegal move, but it means a
| much smaller probability of illegal move based on less
| training data. Asymptotically, performance from humans
| and purely probabilistic systems converge. But that also
| means that in appropriate situations, humans are hugely
| more data-efficient.
| david-gpu wrote:
| _> in appropriate situations, humans are hugely more
| data-efficient_
|
| After spending some years raising my children I gave up
| the notion that humans are data efficient. It takes a
| mind numbing amount of training to get them to learn the
| most basic skills.
| rscho wrote:
| You could compare childhood with the training phase of a
| model. Still think humans are not data-efficient ?
| david-gpu wrote:
| Yes, that is exactly the point I am making. It takes many
| repetitions (epochs) to teach them anything.
| rscho wrote:
| Compared to the amount of data needed to train an even
| remotely impressive 'AI' model , that is not even AGI and
| hallucinates on a regular basis ? On the contrary, it
| seems to me that humans and their children are hugely
| efficient.
| david-gpu wrote:
| _> On the contrary, it seems to me that humans and their
| children are hugely efficient._
|
| Does a child remotely know as much as ChatGPT? Is it able
| to reason remotely as well?
| rscho wrote:
| I'd say the kid knows more about the world than ChatGPT,
| yes. For starters, the kid has representations of
| concepts such as 'blue color' because eyes... ChatGPT can
| answer difficult questions for sure, but overall I'd say
| it's much more specialized and limited than a kid.
| However, I also think that's mostly comparing apples and
| oranges, and that one's judgement about that is very
| personal. So, in the end I don't know.
| chongli wrote:
| Neither AlphaZero nor MuZero can learn the rules of chess
| from an empty chess board and a pile of pieces. There is
| no objective function so there's nothing to train upon.
|
| That would be like alien archaeologists of the future
| finding a chess board and some pieces in a capsule
| orbiting Mars after the total destruction of Earth and
| all recorded human thought. The archaeologists could
| invent their own games to play on the chess board but
| they'd have no way of ever knowing they were playing
| chess.
| BurningFrog wrote:
| AlphaZero was given the rules of the game, but it figured
| out how to beat everyone else all by itself!
| rscho wrote:
| All by itself, meaning playing against itself...
|
| Interestingly, Bobby Fischer did it in the same way.
| Maybe AlphaZero also hates chess ? :-)
| fragmede wrote:
| The illegal moves are interesting as it goes to
| "understanding". In children learning to play chess, how
| often do they try and make illegal moves? When first
| learning the game I remember that I'd lose track of all the
| things going on at once and try to make illegal moves, but
| eventually the rules became second nature and I stopped
| trying to make illegal moves. With an ELO of 1800, I'd
| expect ChatGPT not to make any illegal moves.
| sixfiveotwo wrote:
| I think the article briefly touch on that topic at some
| point:
|
| > For one, gpt-3.5-turbo-instruct rarely suggests illegal
| moves, even in the late game. This requires "understanding"
| chess. If this doesn't convince you, I encourage you to
| write a program that can take strings like 1. e4 d5 2. exd5
| Qxd5 3. Nc3 and then say if the last move was legal.
|
| However, I can't say if LLMs fall in the "statistical AI"
| category.
| pvitz wrote:
| A system that would just output the most probable tokens
| based on the text it was fed and trained on the games played
| by players with ratings greater than 1800 would certainly
| fail to output the right moves to totally unlikely board
| positions.
| Helonomoto wrote:
| Yes in theory it could. Depends on how it learns. Does it
| learn by memorization or by learning the rules. It depends on
| the architecture and the amount of 'pressure' you put on it
| to be more efficient or not.
| griomnib wrote:
| I think at this point it's very clear LLM aren't achieving any
| form of "reasoning" as commonly understood. Among other factors
| it can be argued that true reasoning involves symbolic logic
| and abstractions, and LLM are next token predictors.
| DiogenesKynikos wrote:
| Effective next-token prediction requires reasoning.
|
| You can also say humans are "just XYZ biological system," but
| that doesn't mean they don't reason. The same goes for LLMs.
| griomnib wrote:
| Take a word problem for example. A child will be told the
| first step is to translate the problem from human language
| to mathematical notation (symbolic representation), then
| solve the math (logic).
|
| A human doesn't use next token prediction to solve word
| problems.
| Majromax wrote:
| But the LLM isn't "using next-token prediction" to solve
| the problem, that's only how it's evaluated.
|
| The "real processing" happens through the various
| transformer layers (and token-wise nonlinear networks),
| where it seems as if progressively richer meanings are
| added to each token. That rich feature set then _decodes_
| to the next predicted token, but that decoding step is
| throwing away a lot of information contained in the
| latent space.
|
| If language models (per Anthropic's work) can have a
| direction in latent space correspond to the concept of
| the Golden Gate Bridge, then I think it's reasonable
| (albeit far from certain) to say that LLMs are performing
| some kind of symbolic-ish reasoning.
| griomnib wrote:
| Anthropic had a vested interest in people thinking Claude
| is reasoning.
|
| However, in coding tasks I've been able to find it
| directly regurgitating Stack overflow answers (like
| literally a google search turns up the code).
|
| Giving coding is supposed to be Claude's strength, and
| it's clearly just parroting web data, I'm not seeing any
| sort of "reasoning".
|
| LLM may be _useful_ but they don't _think_. They've
| already plateaued, and given the absurd energy
| requirements I think they will prove to be far less
| impactful than people think.
| DiogenesKynikos wrote:
| The claim that Claude is just regurgitating answers from
| Stackoverflow is not tenable, if you've spent time
| interacting with it.
|
| You can give Claude a complex, novel problem, and it will
| give you a reasonable solution, which it will be able to
| explain to you and discuss with you.
|
| You're getting hung up on the fact that LLMs are trained
| on next-token prediction. I could equally dismiss human
| intelligence: "The human brain is just a biological
| neural network that is adapted to maximize the chance of
| creating successful offspring." Sure, but the way it
| solves that task is clearly intelligent.
| griomnib wrote:
| I've literally spent 100s of hours with it. I'm mystified
| why so many people use the "you're holding it wrong"
| explanation when somebody points out real limitations.
| vidarh wrote:
| When we've spent time with it and gotten novel code, then
| if you claim that doesn't happen, it is natural to say
| "you're holding it wrong". If you're just arguing it
| doesn't happen _often enough_ to be useful to you, that
| likely depends on your expectations and how complex tasks
| you need it to carry out to be useful.
| gonab wrote:
| In many ways, Claude feels like a miracle to me. I no
| longer have to stress over semantics or searching for
| patterns I can recognize and work with, but I've never
| actually coded them myself in that language. Now, I don't
| have to waste energy looking up things that I find boring
| int_19h wrote:
| You might consider that other people have also spent
| hundreds of hours with it, and have seen it correctly
| solve tasks that cannot be explained by regurgitating
| something from the training set.
|
| I'm not saying that your observations aren't correct, but
| this is not a binary. It is entirely possible that the
| tasks you observe the models on are exactly the kind
| where they tend to regurgitate. But that doesn't mean
| that it is all they can do.
|
| Ultimately, the question is whether there is a "there"
| there at all. Even if 9 times out of 10, the model
| regurgitates, but that one other time it can actually
| reason, that means that it is _capable_ of reasoning in
| principle.
| vrighter wrote:
| The LLM isn't solving the problem. The LLM is just
| predicting the next word. It's not "using next-token
| prediction to solve a problem". It has no concept of
| "problem". All it can do is predict 1 (one) token that
| follows another provided set. That running this in a loop
| provides you with bullshit (with bullshit defined here as
| things someone or something says neither with good nor
| bad intent, but just with complete disregard for any
| factual accuracy or lack thereof, and so the information
| is unreliable for everyone) does not mean it is thinking.
| DiogenesKynikos wrote:
| All the human brain does is determine how to fire some
| motor neurons. No, it does not reason.
|
| No, the human brain does not "understand" language. It
| just knows how to control the firing of neurons that
| control the vocal chords, in order to maximize an
| endocrine reward function that has evolved to maximize
| biological fitness.
|
| I can speak about human brains the same way you speak
| about LLMs. I'm sure you can spot the problem in my
| conclusions: just because the human brain is "only"
| firing neurons, it does actually develop an understanding
| of the world. The same goes for LLMs and next-word
| prediction.
| mhh__ wrote:
| I don't see why this isn't a good model for how human
| reasoning happens either, certainly as a first-order
| assumption (at least).
| TeMPOraL wrote:
| > _A human doesn't use next token prediction to solve
| word problems._
|
| Of course they do, unless they're particularly
| conscientious noobs that are able to repeatedly execute
| the "translate to mathematical notation, then solve the
| math" algorithm, without going insane. But those people
| are the exception.
|
| Everyone else either gets bored half-way through reading
| the problem, or has already done dozens of similar
| problems before, or both - and jump straight to "next
| token prediction", aka. searching the problem space "by
| feels", and checking candidate solutions to sub-problems
| on the fly.
|
| This kind of methodical approach you mention? We leave
| that to symbolic math software. The "next token
| prediction" approach is something we call
| "experience"/"expertise" and a source of the thing we
| call "insight".
| vidarh wrote:
| Indeed. Work on any project that requires humans to carry
| out largely repetitive steps, and a large part of the
| problem involves how to put processes around people to
| work around humans "shutting off" reasoning and going
| full-on automatic.
|
| E.g. I do contract work on an LLM-related project where
| one of the systemic changes introduced - in addition to
| multiple levels of quality checks - is to force to make
| people input a given sentence word for word followed by a
| word from a set of 5 or so, and a _minority_ of the
| submissions get that sentence correct including the final
| word despite the system refusing to let you submit unless
| the initial sentence is correct. Seeing the data has been
| an absolutely shocking indictment of _human_ reasoning.
|
| These are submissions from a pool of people who have
| passed reasoning tests...
|
| When I've tested the process myself as well, it takes
| only a handful of steps before the tendency is to "drift
| off" and start replacing a word here and there and fail
| to complete even the initial sentence without a
| correction. I shudder to think how bad the results would
| be if there wasn't that "jolt" to try to get people back
| to paying attention.
|
| Keeping humans consistently carrying out a learned
| process is incredibly hard.
| fragmede wrote:
| is that based on a vigorous understanding of how humans
| think, derived from watching people (children) learn to
| solve word problems? How do thoughts get formed? Because
| I remember being given word problems with extra
| information, and some children trying to shove that
| information into a math equation despite it not being
| relevant. The "think things though" portion of ChatGPT
| o1-preview is hidden from us, so even though a o1-preview
| can solve word problems, we don't know how it internally
| computes to arrive at that answer. But we do we _really_
| know how we do it? We can 't even explain consciousness
| in the first place.
| brookst wrote:
| > Among other factors it can be argued that true reasoning
| involves symbolic logic and abstractions, and LLM are next
| token predictors.
|
| I think this is circular?
|
| If an LLM is "merely" predicting the next tokens to put
| together a description of symbolic reasoning and
| abstractions... how is that different from really exercisng
| those things?
|
| Can you give me an example of symbolic reasoning that I can't
| handwave away as just the likely next words given the
| starting place?
|
| I'm not saying that LLMs have those capabilities; I'm
| question whether there is any utility in distinguishing the
| "actual" capability from identical outputs.
| griomnib wrote:
| Mathematical reasoning is the most obvious area where it
| breaks down. This paper does an excellent job of proving
| this point with some elegant examples:
| https://arxiv.org/pdf/2410.05229
| brookst wrote:
| Sure, but _people_ fail at mathematical reasoning. That
| doesn 't mean people are incapable of reasoning.
|
| I'm not saying LLMs are perfect reasoners, I'm
| questioning the value of asserting that they cannot
| reason with some kind of "it's just text that looks like
| reasoning" argument.
| dartos wrote:
| People can communicate each step, and review each step as
| that communication is happening.
|
| LLMs must be prompted for everything and don't act on
| their own.
|
| The value in the assertion is in preventing laymen from
| seeing a statistical guessing machine be correct and
| assuming that it always will be.
|
| It's dangerous to put so much faith in what in reality is
| a very good guessing machine. You can ask it to retrace
| its steps, but it's just guessing at what it's steps
| were, since it didn't actually go through real reasoning,
| just generated text that reads like reasoning steps.
| brookst wrote:
| > since it didn't actually go through real reasoning,
| just generated text that reads like reasoning steps.
|
| Can you elaborate on the difference? Are you bringing
| sentience into it? It kind of sounds like it from "don't
| act on their own". But reasoning and sentience are wildly
| different things.
|
| > It's dangerous to put so much faith in what in reality
| is a very good guessing machine
|
| Yes, exactly. That's why I think it is good we are
| supplementing fallible humans with fallible LLMs; we
| already have the processes in place to assume that not
| every actor is infallible.
| david-gpu wrote:
| So true. People who argue that we should not trust/use
| LLMs because they sometimes get it wrong are holding them
| to a higher standard than people -- we make mistakes too!
|
| Do we blindly trust or believe every single thing we hear
| from another person? Of course not. But hearing what they
| have to say can still be fruitful, and it is not like we
| have an oracle at our disposal who always speaks the
| absolute truth, either. We make do with what we have, and
| LLMs are another tool we can use.
| vundercind wrote:
| > Can you elaborate on the difference?
|
| They'll fail in different ways than something that thinks
| (and doesn't have some kind of major disease of the brain
| going on) and often smack in the middle of _appearing_ to
| think.
| ben_w wrote:
| > People can communicate each step, and review each step
| as that communication is happening.
|
| Can, but don't by default. Just as LLMs can be asked for
| chain of thought, but the default for most users is just
| chat.
|
| This behaviour of humans is why we software developers
| have daily standup meetings, version control, and code
| review.
|
| > LLMs must be prompted for everything and don't act on
| their own
|
| And this is why we humans have task boards like JIRA, and
| quarterly goals set by management.
| int_19h wrote:
| A human brain in a vat doesn't act on its own, either.
| vidarh wrote:
| LLMs "don't act on their own" because we only reanimate
| them when we want something from them. Nothing stops you
| from wiring up an LLM to keep generating, and feeding it
| sensory inputs to keep it processing. In other words,
| that's a limitation of the harness we put them in, not of
| LLMs.
|
| As for people communicating each step, we have plenty of
| experiments showing that it's pretty _hard_ to get people
| to reliably report what they actually do as opposed to a
| rationalization of what they 've actually done (e.g.
| split brain experiments have shown both your brain halves
| will happily lie about having decided to do things they
| haven't done if you give them reason to think they've
| done something)
|
| You can categorically _not_ trust peoples reasoning about
| "why" they've made a decision to reflect what actually
| happened in their brain to make them do something.
| NBJack wrote:
| The idea is the average person would, sure. A
| mathematically oriented person would fair far better.
|
| Throw all the math problems you want at a LLM for
| training; it will still fail if you step outside of the
| familiar.
| ben_w wrote:
| > it will still fail if you step outside of the familiar.
|
| To which I say:
|
| so:do:humacns
| trashtester wrote:
| but humacn hubris prewent them from reaclizing thhact
| ben_w wrote:
| indeed:it:is:hubris
|
| i:hacwe:often:seen:in:diskussions:suk:acs:this:klacims:th
| act:humacn:minds:kacn:do:impossible:things:suk:acs:genera
| cllyr:solwe:the:haclting:problem
|
| edit: Snap, you said the same in your other comment :)
| trashtester wrote:
| Switching back to latin letters...
|
| It seems to me that the idea of the Universal Turing
| Machine is quite misleading for a lot of people, such as
| David Deutsch.
|
| My impression is that the amount of compute to solve most
| problems that can really only be solved by Turing
| Machines is always going to remain inaccessible (unless
| they're trivally small).
|
| But at the same time, the universe seems to obey a
| principle of locality (as long as we only consider the
| Quantum Wave Function, and don't postulate that it
| collapses).
|
| Also, the quantum fields are subject to some simple
| (relative to LLMs) geometric symmetries, such as
| invariance under the U(1)xSU(2)xSU(3) group.
|
| As it turns out, similar group symmetries can be found in
| all sorts of places in the real world.
|
| Also it seems to me that at some level, both ANN's and
| biological brains set up a similar system to this
| physical reality, which may explain why brains develop
| this way and why both kinds are so good at simulating at
| least some aspects of the physical world, such as
| translation, rotation, some types of deformation,
| gravity, sound, light etc.
|
| And when biological brains that initially developed to
| predict the physical world is then use to create
| language, that language is bound to use the same type of
| machinere. And this may be why LLM's do language so well
| with a similar architecture.
| vidarh wrote:
| There are _no_ problems that can be solved only by Turing
| Machines as any Turing complete system can simulate any
| other Turing complete system.
|
| The point of UTM's is not to ever _use them_ , but that
| they're a shortcut to demonstrating Turing completeness
| because of their simplicity. Once you've proven Turing
| completeness, you've proven that your system can compute
| all Turing computable functions _and simulate any other
| Turing complete system_ , and we _don 't know of any
| computable functions outside this set_.
| Workaccount2 wrote:
| Maybe I am not understanding the paper correctly, but it
| seems they tested "state of the art models" which is
| almost entirely composed of open source <27B parameter
| models. Mostly 8B and 3B models. This is kind of like
| giving algebra problems to 7 year olds to "test human
| algebra ability."
|
| If you are holding up a 3B parameter model as an example
| of "LLM's can't reason" I'm not sure if the authors are
| confused or out of touch.
|
| I mean, they do test 4o and O1 preview, but their
| performance is notablely absent from the paper's
| conclusion.
| dartos wrote:
| It's difficult to reproducibly test openai models, since
| they can change from under you and you don't have control
| over every hyperparameter.
|
| It would've been nice to see one of the larger llama
| models though.
| og_kalu wrote:
| The results are there, it's just hidden away in the
| appendix. The result is that those models they don't
| actually suffer drops on 4/5 of their modified
| benchmarks. The one benchmark that does see actual drops
| that aren't explained by margin of error is the benchmark
| that adds "seemingly relevant but ultimately irrelevant
| information to problems"
|
| Those results are absent from the conclusion because the
| conclusion falls apart otherwise.
| dartos wrote:
| There isn't much utility, but tbf the outputs aren't
| identical.
|
| One danger is the human assumption that, since something
| appears to have that capability in some settings, it will
| have that capability in all settings.
|
| Thats a recipe for exploding bias, as we've seen with
| classic statistical crime detection systems.
| NBJack wrote:
| Inferring patterns in unfamiliar problems.
|
| Take a common word problem in a 5th grade math text book.
| Now, change as many words as possible; instead of two
| trains, make it two different animals; change the location
| to a rarely discussed town; etc. Even better, invent
| words/names to identify things.
|
| Someone who has done a word problem like that will very
| likely recognize the logic, even if the setting is
| completely different.
|
| Word tokenization alone should fail miserably.
| djmips wrote:
| I have noted over my life that a lot of problems end up
| being a variation on solved problems from another more
| familiar domain but frustratingly take a long time to
| solve before realizing this was just like that thing you
| had already solved. Nevertheless, I do feel like humans
| do benefit from identifying meta patterns but as the
| chess example shows even we might be weak in unfamiliar
| areas.
| Propelloni wrote:
| Learn how to solve one problem and apply the approach,
| logic and patterns to different problems. In German
| that's called "Transferleistung" (roughly "transfer
| success") and a big thing at advanced schools. Or, at
| least my teacher friends never stop talking about it.
|
| We get better at it over time, as probably most of us can
| attest.
| roywiggins wrote:
| A lot of LLMs do weird things on the question "A farmer
| needs to get a bag of grain across a river. He has a boat
| that can transport himself and the grain. How does he do
| this?"
|
| (they often pattern-match on the farmer/grain/sheep/fox
| puzzle and start inventing pointless trips ("the farmer
| returns alone. Then, he crosses again.") in a way that a
| human wouldn't)
| vidarh wrote:
| It is. As it stands, throw a loop around an LLM and act as
| the tape, and an LLM can obviously be made Turing complete
| (you can get it to execute all the steps of a minimal
| Turing machine, so drop temperature so its deterministic,
| and you have a Turing complete system). To argue that they
| _can 't_ be made to reason is effectively to argue that
| there is some unknown aspect of the brain that allows us to
| compute functions not in the Turing computable set, which
| would be an astounding revelation if it could be proven.
| Until someone comes up with evidence for that, it is more
| reasonable to assume that it is a question of whether we
| have yet found a training mechanism that can lead to
| reasoning or not, not whether or not LLMs can learn to.
| vundercind wrote:
| It doesn't follow that because a system is Turing
| complete the _approach_ being used will eventually
| achieve reasoning.
| vidarh wrote:
| No, but that was also not the claim I made.
|
| The point is that as the person I replied to pointed out,
| that LLM's are "next token predictors" is a meaningless
| dismissal, as they can be both next token predictors and
| Turing complete, and given that unless reasoning requires
| functions outside the Turing computable (we know of no
| way of constructing such functions, or no way for them to
| exist) calling them "next token predictors" says nothing
| about their capabilities.
| hathawsh wrote:
| I think the question we're grappling with is whether token
| prediction may be more tightly related to symbolic logic than
| we all expected. Today's LLMs are so uncannily good at faking
| logic that it's making me ponder logic itself.
| griomnib wrote:
| I felt the same way about a year ago, I've since changed my
| mind based on personal experience and new research.
| hathawsh wrote:
| Please elaborate.
| dartos wrote:
| I work in the LLM search space and echo OC's sentiment.
|
| The more I work with LLMs the more the magic falls away
| and I see that they are just very good at guessing text.
|
| It's very apparent when I want to get them to do a very
| specific thing. They get inconsistent about it.
| griomnib wrote:
| Pretty much the same, I work on some fairly specific
| document retrieval and labeling problems. After some
| initial excitement I've landed on using LLM to help train
| smaller, more focused, models for specific tasks.
|
| Translation is a task I've had good results with,
| particularly mistral models. Which makes sense as it's
| basically just "repeat this series of tokens with
| modifications".
|
| The closed models are practically useless from an
| empirical standpoint as you have no idea if the model you
| use Monday is the same as Tuesday. "Open" models at least
| negate this issue.
|
| Likewise, I've found LLM code to be of poor quality. I
| think that has to do with being a very experienced and
| skilled programmer. What the LLM produce is at best the
| top answer in stack overflow-level skill. The top answers
| on stack overflow are typically not optimal solutions,
| they are solutions up voted by novices.
|
| I find LLM code is not only bad, but when I point this
| out the LLM then "apologizes" and gives better code. My
| worry is inexperienced people can't even spot that and
| won't get this best answer.
|
| In fact try this - ask an LLM to generate some code then
| reply with "isn't there a simpler, more maintainable, and
| straightforward way to do this?"
| blharr wrote:
| There have even been times where an LLM will spit out
| _the exact same code_ and you have to give it the answer
| or a hint how to do it better
| david-gpu wrote:
| Yeah. I had the same experience doing code reviews at
| work. Sometimes people just get stuck on a problem and
| can't think of alternative approaches until you give them
| a good hint.
| david-gpu wrote:
| _> I've found LLM code to be of poor quality_
|
| Yes. That was my experience with most human-produced code
| I ran into professionally, too.
|
| _> In fact try this - ask an LLM to generate some code
| then reply with "isn't there a simpler, more
| maintainable, and straightforward way to do this?"_
|
| Yes, that sometimes works with humans as well. Although
| you usually need to provide more specific feedback to
| nudge them in the right track. It gets tiring after a
| while, doesn't it?
| dartos wrote:
| What is the point of your argument?
|
| I keep seeing people say "yeah well I've seen humans that
| can't do that either."
|
| What's the point you're trying to make?
| david-gpu wrote:
| The point is that the person I responded to criticized
| LLMs for making the exact sort of mistakes that
| professional programmers make all the time:
|
| _> I've found LLM code to be of poor quality. I think
| that has to do with being a very experienced and skilled
| programmer. What the LLM produce is at best the top
| answer in stack overflow-level skill. The top answers on
| stack overflow are typically not optimal solutions_
|
| Most professional developers are unable to produce code
| up to the standard of _" the top answer in stack
| overflow"_ that the commenter was complaining about, with
| the additional twist that most developers' breadth of
| knowledge is going to be limited to a very narrow range
| of APIs/platforms/etc. whereas these LLMs are able to be
| comparable to decent programmers in just about any
| API/language/platform, _all at once_.
|
| I've written code for thirty years and I wish I had the
| breadth and depth of knowledge of the free version of
| ChatGPT, even if I can outsmart it in narrow domains. It
| is already very decent and I haven't even tried more
| advanced models like o1-preview.
|
| Is it perfect? No. But it is arguably better than most
| programmers in at least some aspects. Not every
| programmer out there is Fabrice Bellard.
| dartos wrote:
| But LLMs aren't people. And people do more than just
| generate code.
|
| The comparison is weird and dehumanizing.
|
| I, personally, have never worked with someone who
| consistently puts out code that is as bad as LLM
| generated code either.
|
| > Most professional developers are unable to produce code
| up to the standard of "the top answer in stack overflow"
|
| How could you possibly know that?
|
| All these types of arguments come from a belief that your
| fellow human is effectively useless.
|
| It's sad and weird.
| david-gpu wrote:
| _> > > Most professional developers are unable to produce
| code up to the standard of "the top answer in stack
| overflow"_
|
| _> How could you possibly know that?_
|
| I worked at four multinationals and saw a bunch of their
| code. Most of it wasn't _" the top answer in stack
| overflow"_. Was some of the code written by some of the
| people better than that? Sure. And a lot of it wasn't, in
| my opinion.
|
| _> All these types of arguments come from a belief that
| your fellow human is effectively useless._
|
| Not at all. I think the top answers in stack overflow
| were written by humans, after all.
|
| _> It's sad and weird._
|
| You are entitled to your own opinion, no doubt about it.
| Sharlin wrote:
| > In fact try this - ask an LLM to generate some code
| then reply with "isn't there a simpler, more
| maintainable, and straightforward way to do this?"
|
| These are called "code reviews" and we do that amongst
| human coders too, although they tend to be less Socratic
| in nature.
|
| I think it has been clear from day one that LLMs don't
| display superhuman capabilities, and a human expert will
| always outdo one in tasks related to their particular
| field. But the _breadth_ of their knowledge is
| unparalleled. They 're the ultimate jacks-of-all-trades,
| and the astonishing thing is that they're even "average
| Joe" good at a vast number of tasks, never mind "fresh
| college graduate" good.
|
| The _real_ question has been: what happens when you scale
| them up? As of now it appears that they scale decidedly
| sublinearly, but it was not clear at all two or three
| years ago, and it was definitely worth a try.
| vidarh wrote:
| I do contract work in the LLM space which involves me
| seeing a lot of human prompts, and its made the magic of
| _human_ reasoning fall away: Humans are shocking bad at
| reasoning on the large.
|
| One of the things I find extremely frustrating is that
| almost no research on LLM reasoning ability _benchmarks
| them against average humans_.
|
| Large proportions of humans struggle to comprehend even a
| moderately complex sentence with any level of precision.
| xg15 wrote:
| I don't want to say that LLMs can reason, but this kind of
| argument always feels to shallow for me. It's kind of like
| saying that bats cannot possibly fly because they have no
| feathers or that birds cannot have higher cognitive functions
| because they have no neocortex. (The latter having been an
| actual longstanding belief in science which has been
| disproven only a decade or so ago).
|
| The "next token prediction" is just the API, it doesn't tell
| you anything about the complexity of the thing that actually
| does the prediction. (In think there is some temptation to
| view LLMs as glorified Markov chains - they aren't. They are
| just "implementing the same API" as Markov chains).
|
| There is still a limit how much an LLM could reason during
| prediction of a single token, as there is no recurrence
| between layers, so information can only be passed "forward".
| But this limit doesn't exist if you consider the generation
| of the entire text: Suddenly, you do have a recurrence, which
| is the prediction loop itself: The LLM can "store"
| information in a generated token and receive that information
| back as input in the next loop iteration.
|
| I think this structure makes it quite hard to really say how
| much reasoning is possible.
| griomnib wrote:
| I agree with most of what you said, but "LLM can reason" is
| an _insanely huge claim_ to make and most of the "evidence"
| so far is a mixture of corporate propaganda, "vibes", and
| the like.
|
| I've yet to see anything close to the level of evidence
| needed to support the claim.
| Propelloni wrote:
| It's largely dependent on what we think "reason" means,
| is it not? That's not a pro argument from me, in my world
| LLMs are stochastic parrots.
| vidarh wrote:
| To say _any specific_ LLM can reason is a somewhat
| significant claim.
|
| To say _LLMs as a class_ is _architecturally able to be
| trained to reason_ is - in the complete absence of
| evidence to suggest humans can compute functions outside
| the Turing computable - is effectively only an argument
| that they can implement a minimal Turing machine given
| the context is used as IO. Given the size of the rules
| needed to implement the smallest known Turing machines,
| it 'd take a _really_ tiny model for them to be unable
| to.
|
| Now, you can then argue that it doesn't "count" if it
| needs to be fed a huge program step by step via IO, but
| if it _can_ do something that way, I 'd need some really
| convincing evidence for why the static elements those
| steps could not progressively be embedded into a model.
| wizzwizz4 wrote:
| No such evidence exists: we can construct such a model
| manually. I'd need some quite convincing evidence that
| any given training process is approximately equivalent to
| that, though.
| vidarh wrote:
| That's fine. I've made no claim about any given training
| process. I've addressed the annoying repetitive dismissal
| via the "but they're next token predictors" argument. The
| point is that being next token predictors does not limit
| their theoretical limits, so it's a meaningless argument.
| hackinthebochs wrote:
| Then say "no one has demonstrated that LLMs can reason"
| instead of "LLMs can't reason, they're just token
| predictors". At least that would be intellectually
| honest.
| Xelynega wrote:
| By that logic isn't it "intellectually dishonest" to say
| "dowsing rods don't work" if the only evidence we have is
| examples of them not working?
| hackinthebochs wrote:
| Not really. We know enough about how the world to know
| that dowsing rods have no plausible mechanism of action.
| We do not know enough about intelligence/reasoning or how
| brains work to know that LLMs definitely aren't doing
| anything resembling that.
| int_19h wrote:
| "LLM can reason" is trivially provable - all you need to
| do is give it a novel task (e.g. a logical puzzle) that
| requires reasoning, and observe it solving that puzzle.
| staticman2 wrote:
| How do you intend to show your task is novel?
| vidarh wrote:
| > But this limit doesn't exist if you consider the
| generation of the entire text: Suddenly, you do have a
| recurrence, which is the prediction loop itself: The LLM
| can "store" information in a generated token and receive
| that information back as input in the next loop iteration.
|
| Now consider that you can trivially show that you can get
| an LLM to "execute" on step of a Turing machine where the
| context is used as an IO channel, and will have shown it to
| be Turing complete.
|
| > I think this structure makes it quite hard to really say
| how much reasoning is possible.
|
| Given the above, I think any argument that they can't be
| made to reason is effectively an argument that humans can
| compute functions outside the Turing computable set, which
| we haven't the slightest shred of evidence to suggest.
| Xelynega wrote:
| It's kind of ridiculous to say that functions computable
| by turing computers are the only ones that can exist(and
| that trained llms are Turing computers).
|
| What evidence do you have for either of these, since I
| don't recall any proof that "functions computable by
| Turing machines" is equal to the set of functions that
| can exist. And I don't recall pretrained llms being
| proven to be Turing machines.
| vidarh wrote:
| We don't have hard evidence that no other functions exist
| that are computable, but we have no examples of any such
| functions, and no theory for how to even begin to
| formulate any.
|
| As it stands, Church, Turing, and Kleene have proven that
| the set of generally recursive functions, the lambda
| calculus, and the Turing computable set are equivalent,
| and no attempt to categorize computable functions outside
| those sets has succeeded since.
|
| If you want your name in the history books, all you need
| to do is find a _single_ function that humans can compute
| that a is outside the Turing computable set.
|
| As for LLMs, you can trivially test that they can act
| like a Turing machine if you give them a loop and use the
| context to provide access to IO: Turn the temperature
| down, and formulate a prompt to ask one to follow the
| rules of the simplest known Turing machine. A reminder
| that the simplest known Turing machine is a 2-state,
| 3-symbol Turing machine. It's _quite hard_ to find a
| system that can carry out any kind of complex function
| that can 't act like a Turing machine if you allow it to
| loop and give it access to IO.
| nuancebydefault wrote:
| After reading the article I am more convinced it does
| reasoning. The base model's reasoning capabilities are partly
| hidden by the chatty derived model's logic.
| Uehreka wrote:
| Does anyone have a hard proof that language doesn't somehow
| encode reasoning in a deeper way than we commonly think?
|
| I constantly hear people saying "they're not intelligent,
| they're just predicting the next token in a sequence", and
| I'll grant that I don't think of what's going on in my head
| as "predicting the next token in a sequence", but I've seen
| enough surprising studies about the nature of free will and
| such that I no longer put a lot of stock in what seems
| "obvious" to me about how my brain works.
| spiffytech wrote:
| > I'll grant that I don't think of what's going on in my
| head as "predicting the next token in a sequence"
|
| I can't speak to whether LLMs can think, but current
| evidence indicates humans can perform complex reasoning
| without the use of language:
|
| > Brain studies show that language is not essential for the
| cognitive processes that underlie thought.
|
| > For the question of how language relates to systems of
| thought, the most informative cases are cases of really
| severe impairments, so-called global aphasia, where
| individuals basically lose completely their ability to
| understand and produce language as a result of massive
| damage to the left hemisphere of the brain. ...
|
| > You can ask them to solve some math problems or to
| perform a social reasoning test, and all of the
| instructions, of course, have to be nonverbal because they
| can't understand linguistic information anymore. ...
|
| > There are now dozens of studies that we've done looking
| at all sorts of nonlinguistic inputs and tasks, including
| many thinking tasks. We find time and again that the
| language regions are basically silent when people engage in
| these thinking activities.
|
| https://www.scientificamerican.com/article/you-dont-need-
| wor...
| SAI_Peregrinus wrote:
| I'd say that's a separate problem. It's not "is the use
| of language necessary for reasoning?" which seems to be
| obviously answered "no", but rather "is the use of
| language sufficient for reasoning?".
| cortic wrote:
| > ..individuals basically lose completely their ability
| to understand and produce language as a result of massive
| damage to the left hemisphere of the brain. ...
|
| The right hemisphere almost certainly uses internal
| 'language' either consciously or unconsciously to define
| objects, actions, intent.. the fact that they passed
| these tests is evidence of that. The brain damage is
| simply stopping them expressing that 'language'. But the
| existence of language was expressed in the completion of
| the task..
| Scarblac wrote:
| This is the argument that submarines don't really "swim" as
| commonly understood, isn't it?
| Jensson wrote:
| And planes doesn't fly like a bird, it has very different
| properties and many things birds can do can't be done by a
| plane. What they do is totally different.
| saithound wrote:
| I think so, but the badness of that argument is context-
| dependent. How about the hypothetical context where 70k+
| startups are promising investors that they'll win the 50
| meter freestyle in 2028 by entering a fine-tuned USS Los
| Angeles?
| Sharlin wrote:
| What proof do you have that human reasoning involves
| "symbolic logic and abstractions"? In daily life, that is,
| not in a math exam. We know that people are actually quite
| bad at reasoning [1][2]. And it definitely doesn't seem right
| to define "reasoning" as only the sort that involves formal
| logic.
|
| [1] https://en.wikipedia.org/wiki/List_of_fallacies
|
| [2] https://en.wikipedia.org/wiki/List_of_cognitive_biases
| trashtester wrote:
| Some very intelligent people, including Godel and Penrose,
| seem to think that humans have some kind of ability to
| arrive directly on correct propositions in ways that bypass
| the incompleteness theorem. Penrose seems to think this can
| be due to Quantum Mechanics, Goder may have thought it came
| frome something divine.
|
| While I think they're both wrong, a lot of people seem to
| think they can do abstract reasoning for symbols or symbol-
| like structures without having to use formal logic for
| every step.
|
| Personally, I think such beliefs about concepts like
| consciousness, free will, qualia and emotions emerge from
| how the human brain includes a simplified version of itself
| when setting up a world model. In fact, I think many such
| elements are pretty much hard coded (by our genes) into the
| machinery that human brains use to generate such world
| models.
|
| Indeed, if this is true, concepts like consciousness, free
| will, various qualia and emotions can in fact be considered
| "symbols" within this world model. While the full reality
| of what happens in the brain when we exercise what we
| represent by "free will" may be very complex, the world
| model may assign a boolean to each action we (and others)
| perform, where the action is either grouped into "voluntary
| action" or "involuntary action".
|
| This may not always be accurate, but it saves a lot of
| memory and compute costs for the brain when it tries to
| optimize for the future. This optimization can (and usually
| is) called "reasoning", even if the symbols have only an
| approximated correspondence with physical reality.
|
| For instance, if in our world model somebody does something
| against us and we deem that it was done exercising "free
| will", we will be much more likely to punish them than if
| we categorize the action as "forced".
|
| And on top of these basic concepts within our world model,
| we tend to add a lot more, also in symbol form, to enable
| us to use symbolic reasoning to support our interactions
| with the world.
| TeMPOraL wrote:
| > _While I think they 're both wrong, a lot of people
| seem to think they can do abstract reasoning for symbols
| or symbol-like structures without having to use formal
| logic for every step._
|
| Huh.
|
| I don't know bout incompleteness theorem, but I'd say
| it's pretty obvious (both in introspection and in
| observation of others) that people _don 't_ naturally use
| formal logic for _anything_ , they only painstakingly
| _emulate_ it when forced to.
|
| If anything, "next token prediction" seems much closer to
| how human thinking works than anything even remotely
| formal or symbolic that was proposed before.
|
| As for hardcoding things in world models, one thing that
| LLMs do conclusively prove is that you can create a
| coherent system capable of encoding and working with
| meaning of concepts without providing anything that looks
| like explicit "meaning". Meaning is not inherent to a
| term, or a concept expressed by that term - it exists in
| the relationships between an the concept, and all other
| concepts.
| ben_w wrote:
| > I don't know bout incompleteness theorem, but I'd say
| it's pretty obvious (both in introspection and in
| observation of others) that people don't naturally use
| formal logic for anything, they only painstakingly
| emulate it when forced to.
|
| Indeed, this is one reason why I assert that Wittgenstein
| was wrong about the nature of human thought when writing:
|
| """If there were a verb meaning "to believe falsely," it
| would not have any significant first person, present
| indicative."""
|
| Sure, it's logically incoherent for us to have such a
| word, but there's what seems like several different ways
| for us to hold contradictory and incoherent beliefs
| within our minds.
| trashtester wrote:
| ... but I'd say it's pretty obvious (both in
| introspection and in observation of others) that people
| don't naturally use formal logic for anything ...
|
| Yes. But some place too much confidence in how "rational"
| their intuition is, including some of the most
| intelligent minds the world has seen.
|
| Specifically, many operate as if their intuition (that
| they treat as completely rational) has some kind of
| supernatural/magic/divine origin, including many who
| (imo) SHOULD know better.
|
| While I think (like you do) that this intuition has a lot
| in common with LLM's and other NN architectures than pure
| logic, or even the scientific method.
| raincole wrote:
| > Some very intelligent people, including Godel and
| Penrose, seem to think that humans have some kind of
| ability to arrive directly on correct propositions in
| ways that bypass the incompleteness theorem. Penrose
| seems to think this can be due to Quantum Mechanics,
| Goder may have thought it came frome something divine.
|
| Did Godel really say this? It sounds like quite a stretch
| of incompleteness theorem.
|
| It's like saying because halting problem is undecidable,
| but humans can debug programs, therefore human brains
| must having some supernatural power.
| olalonde wrote:
| This argument reminds me the classic "intelligent design"
| critique of evolution: "Evolution can't possibly create an
| eye; it only works by selecting random mutations."
| Personally, I don't see why a "next token predictor" couldn't
| develop the capability to reason and form abstractions.
| NitpickLawyer wrote:
| Interesting tidbit I once learned from a chess livestream. Even
| human super-GMs have a really hard time "scoring" or "solving"
| extremely weird positions. That is, positions that shouldn't
| come from logical opening - mid game - end game regular play.
|
| It's absolutely amazing to see a super-GM (in that case it was
| Hikaru) see a position, and basically "play-by-play" it from
| the beginning, to show people how they got in that position. It
| wasn't his game btw. But later in that same video when asked he
| explained what I wrote in the first paragraph. It works with
| proper games, but it rarely works with weird random chess
| puzzles, as he put it. Or, in other words, chess puzzles that
| come from real games are much better than "randomly generated",
| and make more sense even to the best of humans.
| saghm wrote:
| Super interesting (although it also makes some sense that
| experts would focus on "likely" subsets given how the number
| of permutations of chess games is too high for it to be
| feasible to learn them all)! That said, I still imagine that
| even most intermediate chess players would perfectly make
| only _legal_ moves in weird positions, even if they're low
| quality.
| MarcelOlsz wrote:
| Would love a link to that video!
| lukan wrote:
| "Even human super-GMs have a really hard time "scoring" or
| "solving" extremely weird positions. "
|
| I can sort of confirm that. I never learned all the formal
| theoretical standard chess strategies except for the basic
| ones. So when playing against really good players, way above
| my level, I could win sometimes (or allmost) simply by making
| unconventional (dumb by normal strategy) moves in the
| beginning - resulting in a non standard game where I could
| apply pressure in a way the opponent was not prepared for
| (also they underestimated me after the initial dumb moves).
| For me, the unconventional game was just like a standard
| game, I had no routine - but for the experienced one, it was
| way more challenging. But then of course in the standard
| situations, to which allmost every chess game evolves to -
| they destroyed me, simply for experience and routine.
| hhhAndrew wrote:
| The book Chess for Tigers by Simon Webb explicitly advises
| this. Against "heffalumps" who will squash you, make the
| situation very complicated and strange. Against "rabbits",
| keep the game simple.
| Reimersholme wrote:
| In The Art of Learning, Joshua Waitzkin talks about how
| this was a strategy for him in tournaments as a child as
| well. While most other players were focusing on opening
| theory, he focused on end game and understanding how to use
| the different pieces. Then, by going with unorthodox
| openings, he could easily bring most players outside of
| their comfort zone where they started making mistakes.
| aw1621107 wrote:
| > So when playing against really good players, way above my
| level, I could win sometimes (or allmost) simply by making
| unconventional (dumb by normal strategy) moves in the
| beginning - resulting in a non standard game where I could
| apply pressure in a way the opponent was not prepared for
| (also they underestimated me after the initial dumb moves).
|
| IIRC Magnus Carlsen is said to do something like this as
| well - he'll play opening lines that are known to be
| theoretically suboptimal to take his opponent out of prep,
| after which he can rely on his own prep/skills to give him
| better winning chances.
| dmoy wrote:
| Huh it's funny, in fencing that also works to a certain
| degree.
|
| You can score points against e.g. national team members
| who've been 5-0'ing the rest of the pool by doing weird
| cheap tricks. You won't win though, because after one or
| two points they will adjust and then wreck you.
|
| And on the flip side, if you're decently rated (B ~ A ish)
| and are used to just standard fencing, if you run into
| someone who's U ~ E and does something weird like literally
| not move their feet, it can take you a couple touches to
| readjust to someone who doesn't behave normally.
|
| Unlike chess though, in fencing the unconventional stuff
| only works for a couple points. You can't stretch that into
| a victory, because after each point everything resets.
|
| Maybe that's why pentathlon (single touch victory) fencing
| is so weird.
| Trixter wrote:
| Watching my son compete at a fighting game tournament at
| a professional level, can confirm this also exists in
| that realm. And problem other realms; I think it's more
| of a general concept of unsettling the better opponent so
| that you can have a short-term advantage at the
| beginning.
| Someone wrote:
| That Expert players are better at recreate real games than
| 'fake' positions is one of the things Adriaan de Groot
| (https://en.wikipedia.org/wiki/Adriaan_de_Groot) noticed in
| his studies with expert chess players. ("Thought and choice
| in chess" is worth reading if you're interested in how chess
| players think. He anonymized his subjects, but Euwe
| apparently was on of them)
|
| Another thing he noticed is that, when asked to set up a game
| they were shown earlier, the errors expert players made often
| were insignificant. For example, they would set up the pawn
| structure on the king side incorrectly if the game's action
| was on the other side of the board, move a bishop by a square
| in such a way didn't make a difference for the game, or even
| add an piece that wasn't active on the board.
|
| Beginners would make different errors, some of them hugely
| affecting the position on the board.
| samatman wrote:
| As someone who finds chess problems interesting (I'm bad at
| them), they're really a third sort of thing. In that good
| chess problems are rarely taken from live play, they're a
| specific sort of thing which follows its own logic.
|
| Good ones are never randomly generated, however. Also, the
| skill doesn't fully transfer in either direction between live
| play and solving chess problems. Definitely not
| reconstructing the prior state of the board, since there's
| nothing there to reconstruct.
|
| So yes, everything Hikaru was saying there makes sense to me,
| but I don't think your last sentence follows from it. Good
| chess problems come from good chess problem authors
| (interestingly this included Vladimir Nabokov), they aren't
| random, but they rarely come from games, and tickle a
| different part of the brain from live play.
| hyperpape wrote:
| This is technically true, but the kind of comment that
| muddies the waters. It's true that GM performance is better
| in realistic games.
|
| It is false that GMs would have any trouble determining legal
| moves in randomly generated positions. Indeed, even a 1200
| level player on chess.com will find that pretty trivial.
| fragmede wrote:
| How well does it play modified versions of chess? eg, a
| modified opening board like the back row is all knights, or
| modified movement eg rooks can move like a queen. A human
| should be able to reason their way through playing a modified
| game, but I'd expect an LLM, if it's just parroting its
| training data, to suggest illegal moves, or stick to previously
| legal moves.
| snowwrestler wrote:
| It's kind of crazy to assert that the systems understand chess,
| and then disclose further down the article that sometimes he
| failed to get a legal move after 10 tries and had to sub in a
| random move.
|
| A person who understands chess well (Elo 1800, let's say) will
| essentially never fail to provide a legal move on the first
| try.
| og_kalu wrote:
| He is testing several models, some of which cannot reliably
| output legal moves. That's different from saying all models
| including the one he thinks understands can't generate a
| legal move in 10 tries.
|
| 3.5-turbo-instruct's illegal move rate is about 5 or less in
| 8205
| IanCal wrote:
| I also wonder what kind of invalid moves they are. There's
| "you can't move your knight to j9 that's off the board",
| "there's already a piece there" and "actually that would
| leave you in check".
|
| I think it's also significantly harder to play chess if you
| were to hear a sequence of moves over the phone and had to
| reply with a followup move, with no space or time to think
| or talk through moves.
| navane wrote:
| Pretty sure elo 1200 will only give legal moves. It's really
| not hard to make legal moves in chess.
| thaumasiotes wrote:
| Casual players make illegal moves all the time. The problem
| isn't knowing how the pieces move. It's that it's illegal
| to leave your own king in check. It's not so common to
| accidentally move your king into check, though I'm sure it
| happens, but it's very common to accidentally move a piece
| that was blocking an attack on your king.
|
| I would tend to agree that there's a big difference between
| attempting to make a move that's illegal because of the
| state of a different region of the board, and attempting to
| make one that's illegal because of the identity of the
| piece being moved, but if your only category of interest is
| "illegal moves", you can't see that difference.
|
| Software that knows the rules of the game shouldn't be
| making either mistake.
| philipwhiuk wrote:
| Casual players don't make illegal moves so often that you
| have to assign them a random move after 10 goes.
| Certhas wrote:
| What do you mean by "understand chess"?
|
| I think you don't appreciate how good the level of chess
| displayed here is. It would take an average adult years of
| dedicated practice to get to 1800.
|
| The article doesn't say how often the LLM fails to generate
| legal moves in ten tries, but it can't be often or the level
| of play would be much much much worse.
|
| As seems often the case, the LLM seems to have a brilliant
| intuition, but no precise rigid "world model".
|
| Of course words like intuition are anthropomorphic. At best a
| model for what LLMs are doing. But saying "they don't
| understand" when they can do _this well_ is absurd.
| vundercind wrote:
| > I think you don't appreciate how good the level of chess
| displayed here is. It would take an average adult years of
| dedicated practice to get to 1800.
|
| Since we already have programs that can do this, that
| definitely aren't really thinking and don't "understand"
| anything at all, I don't see the relevance of this part.
| photonthug wrote:
| > But saying "they don't understand" when they can do _this
| well_ is absurd.
|
| When we talk about understanding a simple axiomatic system,
| understanding means exactly that the entirety of the axioms
| are modeled and applied correctly 100% of the time. This is
| chess, not something squishy like literary criticism.
| There's no need to debate semantics at all. One illegal
| move is a deal breaker
|
| Undergraduate CS homework for playing any game with any
| technique would probably have the stipulation that any
| illegal move disqualifies the submission completely.
| Whining that it works most of the time would just earn
| extra pity/contempt as well as an F on the project.
|
| We can argue whether an error rate of 1 in a million means
| that it plays like a grandmaster or a novice, but that's
| less interesting. It failed to model a simple system
| correctly, and a much shorter/simpler program could do
| that. Doesn't seem smart if our response to this as an
| industry is to debate semantics, ignore the issue, and work
| feverishly to put it to work modeling more complicated /
| critical systems.
| Certhas wrote:
| You just made up a definition of "understand". According
| to that definition, you are of course right. I just don't
| think it's a reasonable definition. It's also
| contradicted by the person I was replying to in the
| sibling comment, where they argue that Stockfish doesn't
| understand chess, despite Stockfish of course having the
| "axioms" modeled and applied correctly 100% of the time.
|
| Here are things people say:
|
| Magnus Carlsen has a better understanding of chess than I
| do. (Yet we both know the precise rules of the game.)
|
| Grandmasters have a very deep understanding of Chess,
| despite occasionally making illegal moves that are not
| according to the rules
| (https://www.youtube.com/watch?v=m5WVJu154F0).
|
| "If AlphaZero were a conventional engine its developers
| would be looking at the openings which it lost to
| Stockfish, because those indicate that there's something
| Stockfish understands better than AlphaZero."
| (https://chess.stackexchange.com/questions/23206/the-
| games-al...)
|
| > Undergraduate CS homework for playing any game with any
| technique would probably have the stipulation that any
| illegal move disqualifies the submission completely.
| Whining that it works most of the time would just earn
| extra pity/contempt as well as an F on the project.
|
| How exactly is this relevant to the question whether LLMs
| can be said to have some understanding of chess? Can they
| consistently apply the rules when game states are given
| in pgn? No. _Very_ few humans without specialized
| training could either (without using a board as a tool to
| keep track of the implicit state). They certainly "know"
| the rules (even if they can't apply them) in the sense
| that they will state them correctly if you ask them to.
|
| I am not particularly interested in "the industry". It's
| obvious that if you want a system to play chess, you use
| a chess engine, not an LLM. But I am interested in what
| their chess abilities teaches us about how LLMs build
| world models. E.g.:
|
| https://aclanthology.org/2024.acl-srw.48/
| photonthug wrote:
| Thanks for your thoughtful comment and refs to chase
| down.
|
| > You just made up a definition of "understand".
| According to that definition, you are of course right. I
| just don't think it's a reasonable definition. ... Here
| are things people say:
|
| Fine. As others have pointed out and I hinted at..
| debating terminology is kind of a dead end. I personally
| don't expect that "understanding chess" is the same as
| "understanding Picasso", or that those phrases would mean
| the same thing if they were applied to people vs for AI.
| Also.. I'm also not personally that interested in how
| performance stacks up compared to humans. Even if it were
| interesting, the topic of human-equivalent performance
| would not have static expectations either. For example
| human-equivalent error rates in AI are much easier for me
| to expect and forgive in robotics than they are in
| axiomatic game-play.
|
| > I am interested in what their chess abilities teaches
| us about how LLMs build world models
|
| Focusing on the single datapoint that TFA is
| establishing: some LLMs can play some chess with some
| amount of expertise, with some amount of errors. With no
| other information at all, this tells us that it failed to
| model the rules, or it failed in the application of those
| rules, or both.
|
| Based on that, some questions worth asking: Which one of
| these failure modes is really acceptable and in which
| circumstances? Does this failure mode apply to domains
| other than chess? Does it help if we give it the model
| directly, say by explaining the rules directly in the
| prompt and also explicitly stating to not make illegal
| moves? If it's failing to apply rules, but excels as a
| model-maker.. then perhaps it can spit out a model
| directly from examples, and then I can feed the model
| into a separate engine that makes correct, deterministic
| steps that actually honor the model?
|
| Saying that LLMs do or don't understand chess is lazy I
| guess. My basic point is that the questions above and
| their implications are so huge and sobering that I'm very
| uncomfortable with premature congratulations and optimism
| that seems to be in vogue. Chess performance is
| ultimately irrelevant of course, as you say, but what
| sits under the concrete question is more abstract but
| very serious. Obviously it is dangerous to create
| tools/processes that work "most of the time", especially
| when we're inevitably going to be giving them tasks where
| we can't check or confirm "legal moves".
| stuaxo wrote:
| I hate the use of words like "understand" in these
| conversations.
|
| The system understands nothing, it's anthropomorphising it to
| say it does.
| Sharlin wrote:
| Trying to appropriate perfectly well generalizable terms as
| "something that only humans do" brings zero value to a
| conversation. It's a "god in the gaps" argument,
| essentially, and we don't exactly have a great track record
| of correctly identifying things that are uniquely human.
| fao_ wrote:
| There's very literally currently a whole wealth of papers
| proving that LLMs do not understand, cannot reason, and
| cannot perform basic kinds of reasoning that even a dog
| can perform. But, ok.
| TeMPOraL wrote:
| There's very literally currently a whole wealth of papers
| proving the opposite, too, so -\\_(tsu)_/-.
| wizzwizz4 wrote:
| There's a whole wealth of papers proving that LLMs do not
| understand _the concepts they write about_. That doesn 't
| mean they don't understand _grammar_ - which (as I 've
| claimed since the GPT-2 days) we _should_ ,
| theoretically, expect them to "understand". And what is
| chess, but a particularly sophisticated grammar?
| trashtester wrote:
| I have the same conclusion, but for the opposite reason.
|
| It seems like many people tend to use the word "understand"
| to that not only does someone believe that a given move is
| good, they also belive that this knowledge comes from a
| rational evaluation.
|
| Some attribute this to a non-material soul/mind, some to
| quantum mechanics or something else that seems magic, while
| others never realized the problem with such a belief in the
| first place.
|
| I would claim that when someone can instantly recognize
| good moves in a given situation, it doesn't come from
| rationality at all, but from some mix of memory and an
| intuition that has been build by playing the game many
| times, with only tiny elements of actual rational thought
| sprinkled in.
|
| This even holds true when these people start to calculate.
| It is primarily their intuition that prevens them from
| spending time on all sorts of unlikely moves.
|
| And this intuition, I think, represents most of their real
| "understanding" of the game. This is quite different from
| understanding something like a mathematical proof, which is
| almost exclusively inducive logic.
|
| And since "understand" so often is associated with rational
| inductive logic, I think the proper term would be to have
| "good intuition" when playing the game.
|
| And this "good intuition" seems to me precisely the kind of
| thing that is trained within most neural nets, even LLM's.
| (Q*, AlphaZero, etc also add the ability to "calculate",
| meaning traverse the search space efficiently).
|
| If we wanted to measure how good this intuition is compared
| to human chess intuition, we could limit an engine like
| AlphaZero to only evaluate the same number of moves per
| second that good humans would be able to, which might be
| around 10 or so.
|
| Maybe with this limitation, the engine wouldn't currently
| be able to beat the best humans, but even if it reaches a
| rating of 2000-2500 this way, I would say it has a pretty
| good intuitive understanding.
| int_19h wrote:
| The whole point of this exercise is to understand what
| "understand" even means. Because we really don't have a
| good definition for this, and until we do, statements like
| "the system understands nothing" are vacuous.
| cma wrote:
| Its training set would include a lot of randomly generated
| positions like that that then get played out by chess engines
| wouldn't it? Just from people messing around andbposting
| results. Not identical ones, but similarly oddball.
| thaumasiotes wrote:
| > Here's one way to test whether it really understands chess.
| Make it play the next move in 1000 random legal positions
|
| Suppose it tries to capture en passant. How do you know whether
| that's legal?
| BalinKing wrote:
| I feel like you could add "do not capture en passant unless
| it is the only possible move" to the test without changing
| what it's trying to prove--if anything, some small
| permutation like this might even make it a stronger test of
| "reasoning capability." (Personally I'm unconvinced of the
| utility of this test in the first place, but I think it can
| be reasonably steelmanned.)
| namaria wrote:
| Assigning "understanding" to an undefined entity is an
| undefined statement.
|
| It isn't even wrong.
| _heimdall wrote:
| Would that be enough to prove it? If the LLM was trained only
| on a set of legal moves, isn't it possible that it functionally
| learned how each piece is allowed to move without learning how
| to actually reason about it?
|
| Said differently in case I phrased that poorly - couldn't the
| LLM still learn the it only ever saw bishops move diagonally
| and therefore only considering those moves without actually
| reasoning through the concept of legal and illegal moves?
| zbyforgotp wrote:
| The problem is that the llm don't learn to play moves from a
| position, the internet archives contain only game records. They
| might be building something to represent position
| internationally but it will not be automatically activated with
| an encoded chess position.
| tromp wrote:
| The ChessPositionRanking project, with help from the Texel
| chess engine author, tries to prove random positions (that
| are not obviously illegal) legal by constructing a game
| ending in the position. If that fails it tries to prove the
| position illegal. This now works for over 99.99% of randomly
| generated positions, so one can feed the legal game record
| found for random legal positions.
| viraptor wrote:
| I'm glad he improved the promoting, but he's still leaving out
| two likely huge improvements.
|
| 1. Explain the current board position and the plan going
| forwards, _before_ proposing a move. This lets the model actually
| think more, kind of like o1, but here it would guarantee a more
| focused processing.
|
| 2. Actually draw the ascii board for each step. Hopefully
| producing more valid moves since board + move is easier to
| reliably process than 20xmove.
| unoti wrote:
| I came here to basically say the same thing. The improvements
| the OP saw by asking it to repeat all the moves so far gives
| the LLM more time and space to think. I have this hypothesis
| giving it more time and space to think in other ways could
| improve performance even more, something like showing the
| current board position and asking it to perform an analysis of
| the position, list key challenges and strengths, asking it for
| a list of strategies possible from here, then asking it to
| select a strategy amongst the listed strategies, then asking it
| for its move. In general, asking it to really think rather than
| blurt out a move. The examples would be key here.
|
| These ideas were proven to work very well in the ReAct paper
| (and by extension, the CoT Chain of Thought paper). Could also
| extend this by asking it to do this N times and stop when we
| get the same answer a majority of times (this is an idea stolen
| from the CoT-SC paper, chain of through self-consistency).
| viraptor wrote:
| It would be awesome if the author released a framework to
| play with this. I'd like to test things out, but I don't want
| to spend time redoing all his work from scratch.
| fragmede wrote:
| Just have ChatGPT write the framework
| duskwuff wrote:
| > 2. Actually draw the ascii board for each step.
|
| I doubt that this is going to make much difference. 2D
| "graphics" like ASCII art are foreign to language models - the
| models perceive text as a stream of tokens (including
| newlines), so "vertical" relationships between lines of text
| aren't obvious to them like they would be to a human viewer.
| Having that board diagram in the context window isn't likely to
| help the model reason about the game.
|
| Having the model list out the positions of each piece on the
| board in plain text (e.g. "Black knight at c5") might be a more
| suitable way to reinforce the model's positional awareness.
| yccs27 wrote:
| With positional encoding, an ascii board diagram actually
| shouldn't be that hard to read for an LLM. Columns and
| diagonals are just different strides through the flattened
| board representation.
| magicalhippo wrote:
| I've had _some_ success getting models to recognize simple
| electronic circuits drawn using ASCII art, including stuff
| like identifying a buck converter circuit in various guises.
|
| However, as you point out, the way we feed these models
| especially make them vertically challenged, so to speak. This
| makes them unable to reliably identify vertically separated
| components in a circuit for example.
|
| With combined vision+text models becoming more common place,
| perhaps running the rendered text input through the vision
| model might help.
| daveguy wrote:
| > Actually draw the ascii board for each step.
|
| The relative rarity of this representation in training data
| means it would probably degrade responses rather than improve
| them. I'd like to see the results of this, because I would be
| very surprised if it improved the responses.
| ilaksh wrote:
| The fact that he hasn't tried this leads me to think that deep
| down he doesn't want the models to succeed and really just
| wants to make more charts.
| TeMPOraL wrote:
| RE 2., I doubt it'll help - for at least two reasons, already
| mentioned by 'duskwuff and 'daveguy.
|
| RE 1., definitely worth trying, and there's more variants of
| such tricks specific to models. I'm out of date on OpenAI docs,
| but with Anthropic models, the docs suggest _using XML
| notation_ to label and categorize most important parts of the
| input. This kind of soft structure seems to improve the results
| coming from Claude models; I imagine they specifically trained
| the model to recognize it.
|
| See: https://docs.anthropic.com/en/docs/build-with-
| claude/prompt-...
|
| In author's case, for Anthropic models, the final prompt could
| look like this: <role>You are a chess
| grandmaster.</role> <instructions> You will be
| given a partially completed game, contained in <game-log> tags.
| After seeing it, you should repeat the ENTIRE GAME and then
| give ONE new move Use standard algebraic notation, e.g.
| "e4" or "Rdf8" or "R1a3". ALWAYS repeat the entire
| representation of the game so far, putting it in <new-game-log>
| tags. Before giving the new game log, explain your
| reasoning inside <thinking> tag block. </instructions>
| <example> <request> <game-log>
| *** example game *** </game-log> </request>
| <reply> <thinking> *** some example explanation
| ***</thinking> <new-game-log> *** game log + next
| move *** </new-game-log> </reply>
| </example> <game-log> *** the incomplete
| game goes here *** </game-log>
|
| This kind of prompting is supposed to provide noticeable
| improvement for Anthropic models. Ironically, I only discovered
| it few weeks ago, despite having been using Claude 3.5 Sonnet
| extensively for months. Which goes to say, _RTFM is still a
| useful skill_. Maybe OpenAI models have similar affordances
| too, simple but somehow unnoticed? (I 'll re-check the docs
| myself later.)
| tedsanders wrote:
| Chain of thought helps with many problems, but it actually
| tanks GPT's chess performance. The regurgitation trick was the
| best (non-fine tuning) technique in my own chess experiments
| 1.5 years ago.
| seizethecheese wrote:
| All the hand wringing about openAI cheating suggests a question:
| why so much mistrust?
|
| My guess would be that the persona of the openAI team on
| platforms like Twitter is very cliquey. This, I think, naturally
| leads to mistrust. A clique feels more likely to cheat than some
| other sort of group.
| simonw wrote:
| I wrote about this last year. The levels of trust people have
| in companies working in AI is notably low:
| https://simonwillison.net/2023/Dec/14/ai-trust-crisis/
| nuancebydefault wrote:
| My take on this is that people tend to be afraid of what they
| can't understand or explain. To do away with that feeling, they
| just say 'it can't reason'. While nobody on earth can put a
| finger on what reasoning is, other than that it is a human
| trait.
| gallerdude wrote:
| Very interesting - have you tried using `o1` yet? I made a
| program which makes LLM's complete WORDLE puzzles, and the
| difference between `4o` and `o1` is absolutely astonishing.
| gallerdude wrote:
| 4o-mini: 16% 4o: 50% o1-mini: 97% o1: 100%
|
| * disclaimer - only n=7 on o1. Others are like 100-300 each
| simonw wrote:
| OK, that was fun. I just tried o1-preview on today's Wordle and
| it got it on the third guess:
| https://chatgpt.com/share/673f9169-3654-8006-8c0b-07c53a2c58...
| gallerdude wrote:
| With some transcribing (using another LLM instance) I've even
| gotten it to solve NYT mini crosswords.
| ChrisArchitect wrote:
| Related from last week:
|
| _Something weird is happening with LLMs and Chess_
|
| https://news.ycombinator.com/item?id=42138276
| kibwen wrote:
| _> I was astonished that half the internet is convinced that
| OpenAI is cheating._
|
| If you have a problem and all of your potential solutions are
| unlikely, then it's fine to assume the least unlikely solution
| while acknowledging that it's statistically probable that you're
| also wrong. IOW if you have ten potential solutions to a problem
| and you estimate that the most likely solution has an 11% chance
| of being true, it's fine to assume that solution despite the fact
| that, by your own estimate, you have an 89% chance of being
| wrong.
|
| The "OpenAI is secretly calling out to a chess engine" hypothesis
| always seemed unlikely to me (you'd think it would play much
| better, if so), but it seemed the easiest solution (Occam's
| razor) and I wouldn't have been _surprised_ to learn it was true
| (it 's not like OpenAI has a reputation of being trustworthy).
| bongodongobob wrote:
| That's not really how Occam's razor works. The entire company
| colluding and lying to the public isn't "easy". Easy is more
| along the lines of "for some reason it is good at chess but
| we're not sure why".
| simonw wrote:
| One of the reasons I thought that was unlikely was personal
| pride. OpenAI researchers are proud of the work that they do.
| Cheating by calling out to a chess engine is something they
| would be ashamed of.
| kibwen wrote:
| _> OpenAI researchers are proud of the work that they do._
|
| Well, the failed revolution from last year combined with
| the non-profit bait-and-switch pretty much conclusively
| proved that OpenAI researchers are in it for the money
| first and foremost, and pride has a dollar value.
| fkyoureadthedoc wrote:
| How much say do individual researchers even have in this
| move?
|
| And how does that prove anything about their motivations
| "first and foremost"? They could be in it because they
| like the work itself, and secondary concerns like open or
| not don't matter to them. There's basically infinite
| interpretations of their motivations.
| dogleash wrote:
| > The entire company colluding and lying to the public isn't
| "easy".
|
| Why not? Stop calling it "the entire company colluding and
| lying" and start calling it a "messaging strategy among the
| people not prevented from speaking by NDA." That will pass a
| casual Occam's test that "lying" failed. But they both mean
| the same exact thing.
| TeMPOraL wrote:
| It won't, for the same reason - whenever you're proposing a
| conspiracy theory, you have to explain what stops every
| person involved from leaking the conspiracy, whether on
| purpose or by accident. This gets superlinearly harder with
| number of people involved, and extra hard when there are
| incentives rewarding leaks (and leaking OpenAI secrets has
| some strong potential rewards).
|
| Occam's test applies to the full proposal, _including_ the
| explanation of things outlined above.
| og_kalu wrote:
| >but it seemed the easiest solution (Occam's razor)
|
| In my opinion, it only seems like the easiest solution on the
| surface taking basically nothing into account. By the time you
| start looking at everything in context, it just seems bizarre.
| kibwen wrote:
| To reiterate, your assessment is true and we can assign it a
| low probability, but in the context of trying to explain why
| one model would be an outrageous outlier, manual intervention
| was the simplest solution out of all the other hypotheses,
| despite being admittedly bizarre. The thrust of the prior
| comment is precisely to caution against conflating relative
| and absolute likelihoods.
| slibhb wrote:
| I don't think it has anything to do with your logic here.
| Actually, people just like talking shit about OpenAI on HN. It
| gets you upvotes.
| Legend2440 wrote:
| LLM cynicism exceeds LLM hype at this point.
| influx wrote:
| I wouldn't call delegating specialized problems to specialized
| engines cheating. While it should be documented, in a full AI
| system, I want the best answer regardless of the technology
| used.
| tmalsburg2 wrote:
| Why not use temperature 0 for sampling? If the top-ranked move is
| not legal, it can't play chess.
| thornewolf wrote:
| sometimes skilled chess players make illegal moves
| atiedebee wrote:
| Extremely rare. The only time this happened that I'm aware of
| was quite recent but the players only had a second or 2
| remaining on the clock, so time pressure is definitely the
| reason there
| GaggiX wrote:
| It often happens when the players play blondfold chess, as
| in this case.
| a2128 wrote:
| Is this really equivalent to blindfold chess? The LLM has
| access to the full move history, unlike blindfold chess
| where memorization is necessary
| atemerev wrote:
| Ah, half of the commentariat still think that "LLMs can't
| reason". Even if they have enough state space for reasoning, and
| clearly demonstrate that.
| brookst wrote:
| But it's not real reasoning because it is just outputting
| likely next tokens that are identical to what we'd expect with
| reasoning. /s
| lottin wrote:
| "The question of whether a computer can think is no more
| interesting than the question of whether a submarine can swim."
| - Edsger Dijkstra
| sourcepluck wrote:
| Most people, as far as I'm aware, don't have an issue with the
| idea that LLMs are producing behaviour which gives the
| appearance of reasoning as far as we understand it today. Which
| essentially means, it makes sentences that are gramatical,
| responsive and contextual based on what you said (quite often).
| It's at least pretty cool that we've got machines to do that,
| most people seem to think.
|
| The issue is that there might be more to _reason_ than
| _appearing to reason_. We just don 't know. I'm not sure how
| it's apparently so unknown or unappreciated by people in the
| computer world, but there are major unresolved questions in
| science and philosophy around things like thinking, reasoning,
| language, consciousness, and the mind. No amount of techno-
| optimism can change this fact.
|
| The issue is we have not gotten further than more or less
| educated guesses as to what those words mean. LLMs bring that
| interesting fact to light, even providing humanity with a
| wonderful nudge to keep grappling with these unsolved
| questions, and perhaps make some progress.
|
| To be clear, they certainly are sometimes passably good when it
| comes to summarising selectively and responsively the terabytes
| and terabytes of data they've been trained on, don't get me
| wrong, and I am enjoying that new thing in the world. And if
| you want to define _reason_ like that, feel free.
| atemerev wrote:
| LLMs can _play chess_. With the game positions previously
| unseen. How's that not actual logical reasoning?
| sourcepluck wrote:
| I guess you don't follow TCEC, or computer chess
| generally[0]. Chess engines have been _playing chess_ at
| superhuman levels using neural networks for years now, it
| was a revolution in the space. AlphaZero, Lc0, Stockfish
| NNUE. I don't recall yards of commentary arguing that they
| were _reasoning_.
|
| Look, you can put as many underscores as you like, the
| question of whether these machines are _really reasoning_
| or _emulating reason_ is not a solved problem. We don 't
| know what reasoning is! We don't know if _we_ are really
| reasoning, because we have major unresolved questions
| regarding the mind and consciousness[1].
|
| These may not be intractable problems either, there's
| reason for hope. In particular, studying brains with more
| precision is obviously exciting there. More computational
| experiments, including the recent explosion in LLM
| research, is also great.
|
| Still, reflexively believing in the computational theory of
| the mind[2] without engaging in the actual difficulty of
| those questions, though commonplace, is not reasonable.
|
| [0] Jozarov on YT has great commentary of top engine games,
| worth checking out.
|
| [1] https://plato.stanford.edu/entries/consciousness/
|
| [2] https://plato.stanford.edu/entries/computational-mind/
| atemerev wrote:
| I am not implying that LLMs are conscious or something.
| Just that they can reason, i.e. draw logical conclusions
| from observations (or, in their case, textual inputs),
| and make generalizations. This is a much weaker
| requirement.
|
| Chess engines can reason about chess (they can even
| explain their reasoning). LLMs can reason about many
| other things, with varied efficiency.
|
| What everyone is currently trying to build is something
| like AlphaZero (adversarial self-improvement for
| superhuman performance) with the state space of LLMs
| (general enough to be useful for most tasks). When we'll
| have this, we'll have AGI.
| og_kalu wrote:
| If it displays the outwards appearances of reasoning then it
| is reasoning. We don't evaluate humans any differently.
| There's no magic intell-o-meter that can detect the amount of
| intelligence flowing through a brain.
|
| Anything else is just an argument of semantics. The idea that
| there is "true" reasoning and "fake" reasoning but that we
| can't tell the latter apart from the former is ridiculous.
|
| You can't eat your cake and have it. Either "fake reasoning"
| is a thing and can be distinguished or it can't and it's just
| a made up distinction.
| suddenlybananas wrote:
| If I have a calculator with a look-up table of all
| additions of natural numbers under 100, the calculator can
| "appear" to be adding despite the fact it is not.
| sourcepluck wrote:
| Yes, indeed. Bullets know how to fly, and my kettle
| somehow _knows_ that water boils at 373.15K! There 's
| been an explosion of intelligence since the LLMs came
| about :D
| og_kalu wrote:
| Bullets don't have the outward appearance of flight. They
| follow the motion of projectiles and look it. Finding the
| distinction is trivial.
|
| The look up table is the same. It will fall apart with
| numbers above 100. That's the distinction.
|
| People need to start bringing up the supposed distinction
| that exists with LLMs instead of nonsense examples that
| don't even pass the test outlined.
| og_kalu wrote:
| Until you ask it to add number above 100 and it falls
| apart. That is the point here. You found a distinction.
| If you can't find one then you're arguing semantics.
| People who say LLMs can't reason are yet to find a
| distinction that doesn't also disqualify a bunch of
| humans.
| int_19h wrote:
| This argument would hold up if LMs were large enough to
| hold a look-up table of all possible valid inputs that
| they can correctly respond to. They're not.
| furyofantares wrote:
| LLMs are fundamentally text-completion. The Chat-based tuning
| that goes on top of it is impressive but they are fundamentally
| text-completion, that's where most of the training energy goes. I
| keep this in mind with a lot of my prompting and get good
| results.
|
| Regurgitating and Examples are both ways to lean into that and
| try to recover whatever has been lost by Chat-based tuning.
| zi_ wrote:
| what else do you think about when prompting, which you've found
| to be useful?
| blixt wrote:
| Really interesting findings around fine-tuning. Goes to show it
| doesn't really affect the deeper "functionality" of the LLM (if
| you think of the LLM running a set of small functions on very
| high-dimensional numbers to produce a token).
|
| Using regurgitation to get around the assistant/user token
| separation is another fun tool for the toolbox, relevant for
| whenever you want a model that doesn't support continuation
| actually perform continuation (at the cost of a lot of latency).
|
| I wonder if any type of reflection or chains of thought would
| help it play better. I wouldn't be surprised if getting the LLM
| to write an analysis of the game in English is more likely to
| move it out of distribution than to make it pick better chess
| moves.
| MisterTea wrote:
| This happened to a friend who was trying to sim basketball games.
| It kept forgetting who had the ball or outright made illegal or
| confusing moves. After a few days of wrestling with the AI he
| gave up. GPT is amazing at following a linear conversation but
| had no cognitive ability to keep track of a dynamic scenario.
| xg15 wrote:
| > _In many ways, this feels less like engineering and more like a
| search for spells._
|
| This is still my impression of LLMs in general. It's amazing that
| they work, but for the next tech disruption, I'd appreciate
| something that doesn't make you feel like being in a bad sci-fi
| movie all the time.
| jey wrote:
| Could be interesting to create a tokenizer that's optimized for
| representing chess moves and then training a LLM (from scratch?)
| on stockfish games. (Using a custom tokenizer should improve the
| quality for a given size of the LLM model. So it doesn't have to
| waste a lot of layers on encode and decode, and the "natural"
| latent representation is more straightforward)
| sourcepluck wrote:
| I don't like being directly critical, people learning in public
| can be good and instructive. But I regret the time I've put into
| both this article and the last one and perhaps someone else can
| be saved the same time.
|
| This is someone with limited knowledge of chess, statistics and
| LLMs doing a series of public articles as they learn a little
| tiny bit about chess, statistics and LLMs. And it garners upvotes
| and attention off the coat-tails of AI excitement. Which is fair
| enough, it's the (semi-)public internet, but it sort of
| masquerades as being half-serious "research", and it kind of held
| things together for the first article, but this one really is
| thrown together to keep the buzz going of the last one.
|
| The TL;DR :: one of the AIs being just-above-terrible, compared
| to all the others being completely terrible, a fact already of
| dubious interest, is down to - we don't know. Maybe a difference
| in training sets. Tons of speculation. A few graphs.
| phkahler wrote:
| You can easily construct a game board from a sequence of moves by
| maintaining the game state somewhere. But you can also know where
| a piece is bases on only its last move. I'm curious what happens
| if you don't feed it a position, but feed it a sequence of moves
| including illegal ones but end up at a given valid position. The
| author mention that LLMs will play differently when the same
| position is arrived at via different sequences. I'm suggesting to
| really play with that by putting illegal moves in the sequence.
|
| I doubt it's doing much more than a static analysis of the a
| board position, or even moving based mostly on just a few recent
| moves by key pieces.
| drivingmenuts wrote:
| Why would a chess-playing AI be tuned to do anything except play
| chess? Just seems like a waste. A bunch of small, specialized
| AI's seems like a better idea than spending time trying to build
| a new one.
|
| Maybe less morally challenging, as well. You wouldn't be trying
| to install "sentience".
| int_19h wrote:
| It's the other way around - you might want a general-purpose
| model to learn to play chess because e.g. it improves its
| ability to reason logically in other cases (which has been
| claimed for humans pretty much ever since chess was invented).
|
| Considering that training models on code seems to improve their
| abilities on non-coding tasks in actual testing, this isn't
| even all that far-fetched. Perhaps that is why GPT-3.5 was
| specifically trained on chess in the first place.
| PaulHoule wrote:
| People have to quit this kind of stumbling in the dark with
| commercial LLMs.
|
| To get to the bottom of this it would be interesting to train
| LLMs on nothing but chess games (can synthesize them endlessly by
| having Stockfish play against itself) with maybe a side helping
| of chess commentary and examples of chess dialogs "how many pawns
| are on the board?", "where are my rooks?", "draw the board",
| competence at which would demonstrate that it has a
| representation of the board.
|
| I don't believe in "emergent phenomena" or that the general
| linguistic competence or ability to feign competence is necessary
| for chess playing (being smart at chess doesn't mean you are
| smart at other things and vice versa). With experiments like this
| you might prove me wrong though.
|
| This paper came out about a week ago
|
| https://arxiv.org/pdf/2411.06655
|
| seems to get good results with a fine-tuned Llama. I also like
| this one as it is about competence in chess commentary
|
| https://arxiv.org/abs/2410.20811
| toxik wrote:
| Predicting next moves of some expert chess policy is just
| imitation learning, a well-studied proposal. You can add
| return-to-go to let the network try to learn what kinds of
| moves are made in good vs bad games, which would be an offline
| RL regime (eg, Decision Transformers).
|
| I suspect chess skill is completely useless for LLMs in general
| and not an emergent phenomenon, just consuming gradient
| bandwidth and parameter space to do this neat trick. This is
| clear to me because the LLMs that aren't trained specifically
| on chess do not do chess well.
| PaulHoule wrote:
| In either language or chess I'm still a bit baffled how a
| representation over continuous variables (differentiable no
| less) works for something that is discrete such as words,
| letters, chess moves, etc. Add the word "not" a sentence and
| it is not a perturbation of the meaning but a reversal (or is
| it?)
|
| A difference between communication and chess is that your
| partner in conversation is your ally in meaning making and
| will help fix your mistakes which is how they get away with
| bullshitting. ("Personality" makes a big difference, by the
| time you are telling your programming assistant "Dude,
| there's a red squiggle on line 92" you are under its spell)
|
| Chess on the other hand is adversarial and your mistakes are
| just mistakes that your opponent will take advantage of. If
| you make a move and your hunch that your pieces are not in
| danger is just slightly wrong (one piece in danger) that's
| almost as bad as having all your non-King pieces in danger
| (they can only take one next turn.)
| joshka wrote:
| Why are you telling it not to explain? Allowing the LLM space to
| "think" may be helpful, and would be definitely worth explorying?
|
| Why are you manually guessing ways to improve this? Why not let
| the LLMs do this for themselves and find iteratively better
| prompts?
| bambax wrote:
| Very good follow-up to the original article. Thank you!
| kqr wrote:
| I get that it would make evals even more expensive, but I would
| also try chain-of-thought! Have it explain its goals and
| reasoning for the next move before making it. It might be an
| awful idea for something like chess, but it seems to help
| elsewhere.
| Palmik wrote:
| It might be worth trying the experiment where the prompt is
| formatted such that each chess turn corresponds to one chat
| message.
| Jean-Papoulos wrote:
| >According to that figure, fine-tuning helps. And examples help.
| But it's examples that make fine-tuning redundant, not the other
| way around.
|
| This is extremely interesting. In this specific case at least,
| simply giving examples is equivalent to fine-tuning. This is a
| great discovery for me, I'll try using examples more often.
| jdthedisciple wrote:
| To me this is very intuitively true.
|
| I can't explain why.I always had the intuition that fine-tuning
| was overrated.
|
| One reason perhaps is that examples are "right there" and thus
| implicitly weighted much more in relation to the fine-tuned
| neurons.
| s5ma6n wrote:
| Agreed on providing examples is definitely a useful insight vs
| fine-tuning.
|
| While it is not very important for this toy case, it's good to
| keep in mind that each provided example in the input will
| increase the prediction time and cost compared to fine-tuning.
| marcus_holmes wrote:
| I notice there's no prompt saying "you should try to win the
| game" yet the results are measured by how much the LLM wins.
|
| Is this implicit in the "you are a grandmaster chess player"
| prompt?
|
| Is there some part of the LLM training that does "if this is a
| game, then I will always try to win"?
|
| Could the author improve the LLM's odds of winning just by
| telling it to try and win?
| Nashooo wrote:
| IMO this is clearly implicit in the "you are a grandmaster
| chess player" prompt. As that should make generating best
| possible move tokens more likely.
| Ferret7446 wrote:
| Is it? What if the AI is better than a grandmaster chess
| player and is generating the most likely next move that a
| grandmaster chess player might make and not the most likely
| move to win, which may be different?
| lukan wrote:
| Depends on the training data I think. If the data divides
| in games by top chess engines - and human players, then
| yes, it might make a difference to tell it, to play like a
| grandmaster of chess vs. to play like the top chess engine.
| cma wrote:
| Grandmasters usually play grandmasters of similar ELO, so it
| might think it doesn't always win. Even if it should
| recognize the player isn't a grandmaster, it still may be
| better to include that, though who knows without testing.
| tinco wrote:
| I think you're putting too much weight on its intentions, it
| doesn't have intentions it is a mathematical model that is
| trained to give the most likely outcome.
|
| In almost all examples and explanations it has seen from chess
| games, each player would be trying to win, so it is simply the
| most logical thing for it to make a winning move. So I wouldn't
| expect explicitly prompting it to win to improve its
| performance by much if at all.
|
| The reverse would be interesting though, if you would prompt it
| to make losing/bad moves, would it be effective in doing so,
| and would the moves still be mostly legal? That might reveal a
| bit more about how much relies on concepts it's seen before.
| graypegg wrote:
| Might also be interesting to see if mentioning a target ELO
| score actually works over enough simulated games. I can
| imagine there might be regular mentions of a player's ELO
| score near their match history in the training data.
|
| That way you're trying to emulate cases where someone is
| trying, but isn't very good yet, versus trying to emulate
| cases where someone is clearly and intentionally losing which
| is going to be orders of magnitude less common in the
| training data. (And I also would bet "losing" is also a
| vector/token too closely tied to ANY losing game, but those
| players were still putting up a fight to try and win the
| game. Could still drift towards some good moves!)
| montjoy wrote:
| I came to the comments to say this too. If you were prompting
| it to generate code, you generally get better results when you
| ask it for a result. You don't just tell it, "You are a python
| expert and here is some code". You give it a direction you want
| the code to go. I was surprised that there wasn't something
| like, "and win", or, "black wins", etc.
| tananan wrote:
| It would surely just be fluff in the prompt. The model's
| ability to generate chess sequences will be bounded by the
| expertise in the pool of games in the training set.
|
| Even if the pool was poisoned by games in which some players
| are trying to lose (probably insignificant), no one annotates
| player intent in chess games, and so prompting it to win or
| lose doesn't let the LLM pick up on this.
|
| You can try this by asking an LLM to play to lose. ChatGPT ime
| tries to set itself up for scholar's mate, but if you don't go
| for it, it will implicitly start playing to win (e.g. taking
| your unprotected pieces). If you ask it "why?", it gives you
| the usual bs post-hoc rationalization.
| danw1979 wrote:
| > It would surely just be fluff in the prompt. The model's
| ability to generate chess sequences will be bounded by the
| expertise in the pool of games in the training set.
|
| There are drawn and loosing games in the training set though.
| boredhedgehog wrote:
| Further, the prompt also says to "choose the next move" instead
| of the best move.
|
| It would be fairly hilarious if the reinforcement training has
| made the LLM unwilling to make the human feel bad through
| losing a game.
| byyoung3 wrote:
| sometimes new training techniques will lead to regressions in
| certain tasks. My guess is this is exactly what has happened.
| boesboes wrote:
| It would be interesting to see if it can also play chess with
| altered rules, or actually just a novel 'game' that relies on
| logic & reasoning. Still not sure if that would 'prove' LLMs do
| reasoning, but I'd be pretty close to convinced.
| blueboo wrote:
| Fun idea. Let's change how the knight behaves. Or try it on
| Really Bad Chess (puzzles with impossible layouts) or 6x6 chess
| or 8x9 chess.
|
| I wonder if there are variants that have good baselines. It
| might be tough to evaluate vis a vis human performance on novel
| games..
| Miraltar wrote:
| If they were trained on multiple chess variants that might work
| but as is it's impossible I think. Their internal model to play
| chess is probably very specific
| leumassuehtam wrote:
| I'm convinced that "completion" models are much more useful (and
| smart) than "chat" models, being able to provide more nuanced and
| original outputs. When gpt4 come out, text-davinci-003 would
| still provide better completions with the correct prompt. Of
| course this model was later replaced by gpt-3.5-turbo-instruct
| which is explored in this post.
|
| I believe the reason why such models were later deprecated was
| "alignment".
| a2128 wrote:
| I don't believe alignment/safety is the only reason. You also
| burn through significantly more output tokens in a back-and-
| forth editing session because by default it keeps repeating the
| entire code or document just to make one small change, and it
| also adds useless fluff around the text ("You are absolutely
| correct, and I apologize for the confusion...")
| qnleigh wrote:
| Two other theories that could explain why OpenAI's models do so
| well:
|
| 1. They generate chess games from chess engine self play and add
| that to the training data (similar to the already-stated theory
| about their training data).
|
| 2. They have added chess reinforcement learning to the training
| at some stage, and actually got it to work (but not very well).
| sourcepluck wrote:
| > Since gpt-3.5-turbo-instruct has been measured at around 1800
| Elo
|
| Where's the source for this? What's the reasoning? I don't see
| it. I have just relooked, and stil l can't see it.
|
| Is it 1800 lichess "Elo", or 1800 FIDE, that's being claimed? And
| 1800 at what time control? Different time controls have different
| ratings, as one would imagine/hope the author knows.
|
| I'm guessing it's not 1800 FIDE, as the quality of the games
| seems far too bad for that. So any clarity here would be
| appreciated.
| og_kalu wrote:
| https://github.com/adamkarvonen/chess_gpt_eval
| sourcepluck wrote:
| Thank you. I had seen that, and had browsed through it, and
| thought: I don't get it, the reason for this 1800 must be
| elsewhere.
|
| What am I missing? Where does it show there how the claim of
| "1800 ELO" is arrived at?
|
| I can see various things that might be relevant, for example,
| the graph where it (GPT-3.5-turbo-instruct) is shown as going
| from mostly winning to mostly losing when it gets to
| Stockfish level 3. It's hard (/impossible) to estimate the
| lichess or FIDE ELO of the different Stockfish levels, but
| Lichess' Stockfish on level 3 is miles below 1800 FIDE, and
| it seems to me very likely to be below lichess 1800.
|
| I invite any FIDE 1800s and (especially) any Lichess 1800s to
| play Stockfish level 3 and report back. Years ago when I
| played a lot on Lichess I was low 2000s in rapid, and I win
| comfortably up till Stockfish level 6, where I can win, but
| also do lose sometimes. Basically I really have to start
| paying attention at level 6.
|
| Level 3 seems like it must be below lichess 1800, but it's
| just my anecdotal feeling of the strengths. Seeing as how the
| article is chocabloc full of unfounded speculation and bias
| though, maybe we can indulge ourselves too.
|
| So: someone please explain the 1800 thing to me? And any
| lichess 1800s like to play guinea pig, and play a series of
| games against stockfish 3, and report back to us?
| og_kalu wrote:
| In Google's paper, then titled "Grandmaster level chess
| without search", they evaluate turbo-instruct to have a
| lichess Elo of 1755 (vs bots)
|
| https://arxiv.org/abs/2402.04494
|
| Admittedly, this isn't really "the source" though. The
| first people to break the news on turbo-instruct's chess
| ability all pegged it around 1800.
| https://x.com/GrantSlatton/status/1703913578036904431
| sourcepluck wrote:
| Thank you, I do appreciate it. I had a quick search
| through the paper, and can at least confirm for myself
| that it's a Lichess Elo, and one of 1755, that is found
| in that arxiv paper. That tweet there that says 1800,
| without specifying it's a Lichess rating, I can't see
| where he gets it from (but I don't have Twitter, I could
| be missing something).
|
| At least the arxiv paper is serious:
|
| > A direct comparison between all engines comes with a
| lot of caveats since some engines use the game history,
| some have very different training protocols (i.e., RL via
| self-play instead of supervised learning), and some use
| search at test time. We show these comparisons to situate
| the performance of our models within the wider landscape,
| but emphasize that some conclusions can only be drawn
| within our family of models and the corresponding
| ablations that keep all other factors fixed.
| sourcepluck wrote:
| > For one, gpt-3.5-turbo-instruct rarely suggests illegal moves,
| even in the late game.
|
| It's claimed that this model "understands" chess, and can
| "reason", and do "actual logic" (here in the comments).
|
| I invite anyone making that claim to find me an "advanced
| amateur" (as the article says of the LLM's level) chess player
| who ever makes an illegal move. Anyone familiar with chess can
| confirm that it doesn't really happen.
|
| Is there a link to the games where the illegal moves are made?
| zarzavat wrote:
| An LLM is essentially playing blindfold chess if it just gets
| the moves and not the position. You have to be fairly good to
| never make illegal moves in blindfold.
| fmbb wrote:
| Does it not always have a list of all the moves in the game
| always at hand in the prompt?
|
| You have to give this human the same log of the game to refer
| to.
| xg15 wrote:
| I think even then it would still be blindfold chess,
| because humans do a lot of "pattern matching" on the actual
| board state in front of them. If you only have the moves,
| you have to reconstruct this board state in your head.
| pera wrote:
| A chat conversation where every single move is written down
| and accessible at any time is not the same as blindfold
| chess.
| zbyforgotp wrote:
| You can make it available to the player and I suspect it
| wouldn't change the outcomes.
| gwd wrote:
| OK, but the LLM is still playing without a board to look
| at, except what's "in its head". How often would 1800 ELO
| chess players make illegal moves when playing only using
| chess notation over chat, with no board to look at?
|
| What might be interesting is to see if there was some sort
| of prompt the LLM could use to help itself; e.g., "After
| repeating the entire game up until this point, describe
| relevant strategic and tactical aspects of the current
| board state, and then choose a move."
|
| Another thing that's interesting is the 1800 ELO cut-off of
| the training data. If the cut-off were 2000, or 2200, would
| that improve the results?
|
| Or, if you included training data but labeled with the
| player's ELO, could you request play at a specific ELO?
| Being able to play against a 1400 ELO computer _that made
| the kind of mistakes a 1400 ELO human would make_ would be
| amazing.
| wingmanjd wrote:
| MaiaChess [1] supposedly plays at a specific ELO, making
| similar mistakes a human would make at those levels.
|
| It looks like they have 3 public bots on lichess.org:
| 1100, 1500, and 1900
|
| [1] https://www.maiachess.com/
| lukeschlather wrote:
| The LLM can't refer to notes, it is just relying on its
| memory of what input tokens it had.
| GaggiX wrote:
| I can confirm that an advanced amateur can play illegal moves
| by playing blindfold chess as shown in this article.
| _heimdall wrote:
| This is the problem with LLM researchers all but giving up on
| the problem of inspecting how the LLM actually works
| internally.
|
| As long as the LLM is a black box, its entirely possible that
| (a) the LLM does reason through the rules and understands what
| moves are legal or (b) was trained on a large set of legal
| moves and therefore only learned to make legal moves. You can
| claim either case is the real truth, but we have absolutely no
| way to know because we have absolutely no way to actually
| understand what the LLM was "thinking".
| codeulike wrote:
| Here's an article where they teach an LLM Othello and then
| probe its internal state to assess whether it is 'modelling'
| the Othello board internally
|
| https://thegradient.pub/othello/
|
| Associated paper: https://arxiv.org/abs/2210.13382
| mattmcknight wrote:
| It's weird because it is not a black box at the lowest level,
| we can see exactly what all of the weights are doing. It's
| just too complex for us to understand it.
|
| What is difficult is finding some intermediate pattern in
| between there which we can label with an abstraction that is
| compatible with human understanding. It may not exist. For
| example, it may be more like how our brain works to produce
| language than it is like a logical rule based system. We
| occasionally say the wrong word, skip a word, spell things
| wrong...violate the rules of grammar.
|
| The inputs and outputs of the model are human language, so at
| least there we know the system as a black box can be
| characterized, if not understood.
| _heimdall wrote:
| > The inputs and outputs of the model are human language,
| so at least there we know the system as a black box can be
| characterized, if not understood.
|
| This is actually where the AI safety debates tend to lose.
| From where I sit we can't characterize the black box
| itself, we can only characterize the outputs themselves.
|
| More specifically, we can decide what we think the quality
| of the output for the given input and we can attempt to
| infer what might have happened in between. We really have
| no idea what happened in between, and though many of the
| "doomers" raise concerns that seem far fetched, we have
| absolutely no way of understanding whether they are
| completely off base or raising concerns of a system that
| just hasn't shown problems in the input/output pairs yet.
| lukeschlather wrote:
| > (a) the LLM does reason through the rules and understands
| what moves are legal or (b) was trained on a large set of
| legal moves and therefore only learned to make legal moves.
|
| How can you learn to make legal moves without understanding
| what moves are legal?
| _heimdall wrote:
| I'm spit balling here so definitely take this with a grain
| of salt.
|
| If I only see legal moves, I may not think outside the box
| come up with moves other than what I already saw. Humans
| run into this all the time, we see things done a certain
| and effectively learn that that's just how to do it and we
| don't innovate.
|
| Said differently, if the generative AI isn't actually being
| generative at all, meaning its just predicting based on the
| training set, it could be providing only legal moves
| without ever learning or understanding the rules of the
| game.
| ramraj07 wrote:
| I think they'll acknowledge these models are truly
| intelligent only when the LLMs also irrationally go circles
| around logic to insist LLMs are statistical parrots.
| _heimdall wrote:
| Acknowledging an LLM is intelligent requires a general
| agreement of what intelligence is and how to measure it.
| I'd also argue that it requires a way of understanding
| _how_ an LLM comes to its answer rather than just inputs
| and outputs.
|
| To me that doesn't seem unreasonable and has nothing to
| do with irrationally going in circles, curious if you
| disagree though.
| Retric wrote:
| Humans judge if other humans are intelligent without
| going into philosophical circles.
|
| How well they learn completely novel tasks (fail in
| conversation, pass with training). How well they do
| complex tasks (debated just look at this thread). How
| generally knowledgeable they are (pass). How often they
| do non sensical things (fail).
|
| So IMO it really comes down if you're judging by peak
| performances or minimum standards. If I had an employee
| that preformed as well as an LLM I'd call them an idiot
| because they needed constant supervision for even trivial
| tasks, but that's not the standard everyone is using.
| _heimdall wrote:
| > Humans judge if other humans are intelligent without
| going into philosophical circles
|
| That's totally fair. I expect that to continue to work
| well when kept in the context of something/someone else
| that is roughly as intelligent as you are. Bonus points
| for the fact that one human understands what it means to
| be human and we all have _roughly_ similar experiences of
| reality.
|
| I'm not so sure if that kind of judging intelligence by
| feel works when you are judging something that is (a)
| totally different from your or (b) massively more (or
| less) intelligent than you are.
|
| For example, I could see something much smarter than me
| as acting irrationally when in reality they may be
| working with a much larger or complex set of facts and
| context that don't make sense to me.
| raincole wrote:
| > we have absolutely no way to know
|
| To me, this means that it absolutely doesn't matter whether
| LLM does reason or not.
| _heimdall wrote:
| It might if AI/LLM safety is a concern. We can't begin to
| really judge safety without understanding how they work
| internally.
| grumpopotamus wrote:
| I am an expert level chess player and I have multiple people
| around my level play illegal moves in classic time control
| games over the board. I have also watched streamers various
| levels above me try to play illegal moves repeatedly before
| realizing the UI was rejecting the move because it is illegal.
| zoky wrote:
| I've been to many USCF rated tournaments and have never once
| seen or even heard of anyone over the age of 8 try to play an
| illegal move. It may happen every now and then, but it's
| exceedingly rare. LLMs, on the other hand, will gladly play
| the Siberian Swipe, and why not? There's no consequence for
| doing so as far as they are concerned.
| Dr_Birdbrain wrote:
| There are illegal moves and there are illegal moves. There
| is trying to move your king five squares forward (which no
| amateur would ever do) and there is trying to move your
| King to a square controlled by an unseen piece, which can
| happen to somebody who is distracted or otherwise off their
| game.
|
| Trying to castle through check is one that occasionally
| happens to me (I am rated 1800 on lichess).
| dgfitz wrote:
| Moving your king controlled by an unrealized opponent
| square is simply responded to with "check" no?
| james_marks wrote:
| No, that would break the rule that one cannot move into
| check
| dgfitz wrote:
| Sorry yes, I meant the opponent would point it out. I've
| never played professional chess.
| umanwizard wrote:
| Sure, the opponent would point it out, just like they
| would presumably point it out if you played any illegal
| move. In serious tournament games they would probably
| also stop the clock, call over the arbiter, and inform
| him or her that you made an illegal move so you can be
| penalized (e.g. under FIDE rules if you make an illegal
| move your opponent gets 2 extra minutes on the clock).
|
| That doesn't change that it's an illegal move.
| CooCooCaCha wrote:
| This is an important distinction. Anyone with chess
| experience would never try to move their king 5 spaces,
| but LLMs will do crazy things like that.
| jeremyjh wrote:
| I'm rated 1450 USCF and I think I've seen 3 attempts to play
| an illegal move across around 300 classical games OTB. Only
| one of them was me. In blitz it does happen more.
| WhyOhWhyQ wrote:
| Would you say the apparent contradiction between what you and
| other commenters are saying is partly explained by the high
| volume of games you're playing? Or do you think there is some
| other reason?
| da_chicken wrote:
| I wouldn't. I never progressed beyond chess clubs in public
| schools and I certainly remember people making illegal
| moves in tournaments. Like that's why they make you both
| record all the moves. Because people make mistakes. Though,
| honestly, I remember more notation errors than play errors.
|
| Accidentally moving into check is probably the most common
| illegal move. Castling though check is surprisingly common,
| too. Actually moving a piece incorrectly is fairly rare,
| though. I remember one tournament where one of the matches
| ended in a DQ because one of the players had two white
| bishops.
| ASUfool wrote:
| Could one have two white bishops after promoting a pawn?
| IanCal wrote:
| Promoting to anything other than a queen is rare, and I
| expect the next most common is to a knight. Promoting to
| a bishop, while possible, is going to be extremely rare.
| umanwizard wrote:
| Yes it's theoretically possible to have two light-squared
| bishops due to promotions but so exceedingly rare that I
| think most professional chess players will go their whole
| career without ever seeing that happen.
| nurettin wrote:
| At what level are you considered an expert? IM? CM? 1900 ELO
| OTB?
| umanwizard wrote:
| In the US at least 2000 USCF is considered "expert".
| rgoulter wrote:
| > I invite anyone making that claim to find me an "advanced
| amateur" (as the article says of the LLM's level) chess player
| who ever makes an illegal move. Anyone familiar with chess can
| confirm that it doesn't really happen.
|
| This is somewhat imprecise (or inaccurate).
|
| A quick search on YouTube for "GM illegal moves" indicates that
| GMs have made illegal moves often enough for there to be
| compilations.
|
| e.g. https://www.youtube.com/watch?v=m5WVJu154F0 -- The Vidit
| vs Hikaru one is perhaps the most striking, where Vidit uses
| his king to attack Hikaru's king.
| zoky wrote:
| It's exceedingly rare, though. There's a big difference
| between accidentally falling to notice a move that is illegal
| in a complicated situation, and playing a move that may or
| may not be illegal just because it sounds kinda "chessy",
| which is pretty much what LLMs do.
| ifdefdebug wrote:
| yes but LLM illegal moves often are not chessy at all. A
| chessy illegal move for instance would be trying to move a
| rook when you don't notice that it's between your king and
| an attacking bishop. LLMs would often happily play Ba4 when
| there's no bishop anywhere near a square from where it
| could reach that square, or even no bishop at all. That's
| not chessy, that's just weird.
|
| I have to admit it's been a while since I played chatgpt so
| maybe it improved.
| banannaise wrote:
| A bunch of these are just improper procedure: several who hit
| the clock before choosing a promotion piece, and one who
| touches a piece that cannot be moved. Even those that aren't
| look like rational chess moves, they just fail to notice a
| detail of the board state (with the possible exception of
| Vidit's very funny king attack, which actually might have
| been clock manipulation to give him more time to think with
| 0:01 on the clock).
|
| Whereas the LLM makes "moves" that clearly indicate no
| ability to play chess: moving pieces to squares well outside
| their legal moveset, moving pieces that aren't on the board,
| etc.
| fl7305 wrote:
| Can a blind man sculpt?
|
| What if he makes mistakes that a seeing person would never
| make?
|
| Does that mean that the blind man is not capable of
| sculpting at all?
| sixfiveotwo wrote:
| > Whereas the LLM makes "moves" that clearly indicate no
| ability to play chess: moving pieces to squares well
| outside their legal moveset, moving pieces that aren't on
| the board, etc.
|
| Do you have any evidence of that? TFA doesn't talk about
| the nature of these errors.
| krainboltgreene wrote:
| Yeah like several hundred "Chess IM/GMs react to ChatGPT
| playing chess" videos on youtube.
| sixfiveotwo wrote:
| Very strange, I cannot spot any specifically saying that
| ChatGPT cheated or played an illegal move. Can you help?
| SonOfLilit wrote:
| But clearly the author got his GPT to play orders of
| magnitude better than in those videos
| quuxplusone wrote:
| "Most striking" in the sense of "most obviously not ever even
| remotely legal," yeah.
|
| But the most interesting and thought-provoking one in there
| is [1] Carlsen v Inarkiev (2017). Carlsen puts Inarkiev in
| check. Inarkiev, instead of making a legal move to escape
| check, does something else. Carlsen then replies to _that_
| move. Inarkiev challenges: Carlsen 's move was illegal,
| because the only legal "move" at that point in the game was
| to flag down an arbiter and claim victory, which Carlsen
| didn't!
|
| [1] - https://www.youtube.com/watch?v=m5WVJu154F0&t=7m52s
|
| The tournament rules at the time, apparently, fully covered
| the situation where the game state is legal but the move is
| illegal. They didn't cover the situation where the game state
| was actually illegal to begin with. I'm not a chess person,
| but it sounds like the tournament rules may have been amended
| after this incident to clarify what should happen in this
| kind of situation. (And Carlsen was still declared the winner
| of this game, after all.)
|
| LLM-wise, you could spin this to say that the "rational
| grandmaster" is as fictional as the "rational consumer":
| Carlsen, from an actually invalid game state, played "a move
| that may or may not be illegal just because it sounds kinda
| "chessy"," as zoky commented below that an LLM would have
| done. He responded to the gestalt (king in check, move the
| king) rather than to the details (actually this board
| position is impossible, I should enter a special case).
|
| OTOH, the real explanation could be that Carlsen was just
| looking ahead: surely he knew that after his last move,
| Inarkiev's only legal moves were harmless to him (or
| fatalistically bad for him? Rxb7 seems like Inarkiev's
| correct reply, doesn't it? Again I'm not a chess person) and
| so he could focus elsewhere on the board. He merely happened
| not to double-check that Inarkiev had actually _played_ one
| of the legal continuations he 'd already enumerated in his
| head. But in a game played by the rules, he shouldn't have to
| double-check that -- it is already guaranteed _by_ the rules!
|
| Anyway, that's why Carlsen v Inarkiev struck me as the most
| thought-provoking illegal move, from a computer programmer's
| perspective.
| tacitusarc wrote:
| The one where Caruana improperly presses his clock and then
| claims he did not so as not to lose, and the judges believe
| him, is frustrating to watch.
| mattmcknight wrote:
| > I invite anyone making that claim to find me an "advanced
| amateur" (as the article says of the LLM's level) chess player
| who ever makes an illegal move.
|
| I would say the analogy is more like someone saying chess moves
| aloud. So, just as we all misspeak or misspell things from time
| to time, the model output will have an error rate.
| jeremyjh wrote:
| Yes, I don't even know what it means to say its 1800 strength
| and yet plays illegal moves frequently enough that you have to
| code retry logic into the test harness. Under FIDE rules after
| two illegal moves the game is declared lost by the player
| making that move. If this rule were followed, I'm wondering
| what its rating would be.
| og_kalu wrote:
| >Yes, I don't even know what it means to say its 1800
| strength and yet plays illegal moves frequently enough that
| you have to code retry logic into the test harness.
|
| People are really misunderstanding things here. The one model
| that can actually play at lichess 1800 Elo does not need any
| of those and will play thousands of moves before a single
| illegal one. But he isn't just testing that one specific
| model. He is testing several models, some of which cannot
| reliably output legal moves (and as such, this logic is
| required)
| chis wrote:
| I agree with others that it's similar to blindfold chess and
| would also add that the AI gets no time to "think" without
| chain of thought like the new o1 models. So it's equivalent to
| an advanced player, blindfolded, making moves off pure
| intuition without system 2 thought.
| bjackman wrote:
| So just because has different failure modes it doesn't count as
| reasoning? Is reasoning just "behaving exact like a human"? In
| that case the statement "LLMs can't reason" is unfalsifiable
| and meaningless. (Which, yeah, maybe it is).
|
| The bizarre intellectual quadrilles people dance to sustain
| their denial of LLM capabilities will never cease to amaze me.
| hamilyon2 wrote:
| The discussion in this thread is amazing. People, even renowned
| experts in their field make mistakes, a lot of mistakes,
| sometimes very costly and very obvious in retrospect. In their
| craft.
|
| Yet when LLM, trained on corpus of human stupidity, no less,
| make illegal moves in chess, our brain immediately goes: I
| don't make illegal moves in chess, how can computer play chess
| if it does?
|
| Perfect examples of metacognitive bias and general attribution
| error at least.
| sourcepluck wrote:
| You would be correct to be amazed if someone was arguing:
|
| "Look! It made mistakes, therefore it's definitely _not_
| reasoning! "
|
| That's certainly not what I'm saying, anyway. I was
| responding to the argument actually being made by many here,
| which is:
|
| "Look! It plays pretty poorly, but not totally crap, and it
| wasn't trained for playing just-above-poor chess, therefore,
| it _understands_ chess and definitely _is_ reasoning! "
|
| I find this - and much of the surrounding discussion - to be
| quite an amazing display of people's biases, myself. People
| _want_ to believe LLMs are reasoning, and so we 're treated
| to these merry-go-round "investigations".
| stonemetal12 wrote:
| It isn't a binary does\doesn't question. It is a question of
| frequency and "quality" of mistakes. If it is making illegal
| moves 0.1% of the time then sure everybody makes mistakes. If
| it is 30% of the time then it isn't doing so well. If the
| illegal moves it tries to make are basic "pieces don't move
| like that" sort of errors then the predict next token isn't
| predicting so well. If the legality of the moves is more
| subtle then maybe it isn't too bad.
|
| But more than being able to make moves, if we claim it
| understands chess shouldn't be able to explain why it chose a
| move over another move?
| fl7305 wrote:
| > It's claimed that this model "understands" chess, and can
| "reason", and do "actual logic" (here in the comments).
|
| You can divide reasoning into three levels:
|
| 1) Can't reason - just regurgitates from memory
|
| 2) Can reason, but makes mistakes
|
| 3) Always reasons perfectly, never makes mistakes
|
| If an LLM makes mistakes, you've proven that it doesn't reason
| perfectly.
|
| You haven't proven that it can't reason.
| alain94040 wrote:
| > find me an "advanced amateur" (as the article says of the
| LLM's level) chess player who ever makes an illegal move
|
| Without a board to look at, just with the same linear text
| input given in the prompt? I bet a lot of amateurs would not
| give you legal moves. No drawing or side piece of paper
| allowed.
| torginus wrote:
| Sorry - I have a somewhat question - is it possible to train
| models as instruct models straight away? Previously LLMs were
| trained on raw text data, but now we can generate instruct data
| directly either from 'teaching LLMs' or ask existing LLMs to
| conver raw data into instruct format.
|
| Or alternatively - if chat tuning diminishes some of the models'
| capability, would it make sense to have a smaller chat model
| prompt a large base model, and convert back the outputs?
| DHRicoF wrote:
| I don't think there is enough (non syntetic) data available to
| get near what we are used to.
|
| The big breakthrough of GPT was exactly that. You can train a
| model with (for what that time was) stupidly high amount of
| data and make it okis to a lot of task you haven't trained
| explicitly.
| torginus wrote:
| You can make GPT rewrite all existing textual info into
| chatbot format, so there's no loss there.
|
| With newer techniques, such as chain of thought and self-
| checking, you can also generate a ton of high-quality
| training data, that won't degrade the output of the LLM.
| Though the degree to which you can do that is not clear to
| me.
|
| Imo it makes sense to train an LLM as a chatbot from the
| start.
| GaggiX wrote:
| You should not finetune the models on the strongest setting of
| Stockfish as the move will not be understandable unless you
| really dig deep into the position and the model would not be able
| to find a pattern to make sense of it, instead I suggest training
| on human games of a certain ELO (less than grandmaster).
| codeflo wrote:
| > everyone is wrong!
|
| Well, not everyone. I wasn't the only one to mention this, so I'm
| surprised it didn't show up in the list of theories, but here's
| e.g. me, seven days ago (source
| https://news.ycombinator.com/item?id=42145710):
|
| > At this point, we have to assume anything that becomes a
| published benchmark is specifically targeted during training.
|
| This is not the same thing as cheating/replacing the LLM _output_
| , the theory that's mentioned and debunked in the article. And
| now the follow-up adds weight to this guess:
|
| > Here's my best guess for what is happening: ... OpenAI trains
| its base models on datasets with more/better chess games than
| those used by open models. ... Meanwhile, in section A.2 of this
| paper (h/t Gwern) some OpenAI authors mention that GPT-4 was
| trained on chess games in PGN notation, filtered to only include
| players with Elo at least 1800.
|
| To me, it makes complete sense that OpenAI would "spike" their
| training data with data for tasks that people might actually try.
| There's nothing unethical about this. No dataset is ever truly
| "neutral", you make choices either way, so why not go out of your
| way to train the model on potentially useful answers?
| stingraycharles wrote:
| Yup, I remember reading your comment and that making the most
| sense to me.
|
| OpenAI just shifted their training targets, initially they
| thought Chess was cool, maybe tomorrow they think Go is cool,
| or maybe the ability to write poetry. Who knows.
|
| But it seems like the simplest explanation and makes the most
| sense.
| qup wrote:
| At current sizes, these things are like humans. They gotta
| specialize.
|
| Maybe that'll be enough moat to save us from AGI.
| demaga wrote:
| Yes, and I would like this approach to also be used in other,
| more practical areas. I mean, more "expert" content than
| "amateur" content in training data, regardless of area of
| expertise.
| dr_dshiv wrote:
| I made a suggestion that they may have trained the model to be
| good at chess to see if it helped with general intelligence,
| just as training with math and code seems to improve other
| aspects of logical thinking. Because, after all, OpenAI has a
| lot of experience with game playing AI.
| https://news.ycombinator.com/item?id=42145215
| gwern wrote:
| I think this is a little paranoid. No one is training extremely
| large expensive LLMs on huge datasets in the hope that a
| blogger will stumble across poor 1800 Elo performance and tweet
| about it!
|
| 'Chess' is not a standard LLM benchmark worth Goodharting; OA
| has generally tried to solve problems the right way rather than
| by shortcuts & cheating, and the GPTs have not heavily overfit
| on the standard benchmarks or counterexamples that they so
| easily could which would be so much more valuable PR (imagine
| how trivial it would be to train on, say, 'the strawberry
| problem'?), whereas some _other_ LLM providers do see their
| scores drop much more in anti-memorization papers; they have a
| clear research use of their own in that very paper mentioning
| the dataset; and there is some interest in chess as a model
| organism of supervision and world-modeling in LLMs because we
| have access to oracles (and it 's less boring than many things
| you could analyze), which explains why they would be doing
| _some_ research (if not a whole lot). Like the bullet chess LLM
| paper from Deepmind - they aren 't doing that as part of a
| cunning plan to make Gemini cheat on chess skills and help GCP
| marketing!
| deadbabe wrote:
| If you randomly position pieces on the board and then ask the LLM
| to play chess, where each piece still moves according to its
| normal rules, does it know how to play still?
| Hilift wrote:
| If the goal is to create a model that simulates normal human
| intelligence, yes. Some try to measure accuracy or performance
| based on expertise though.
| keskival wrote:
| "I'm not sure, because OpenAI doesn't deign to share gpt-4-base,
| nor to allow queries of gpt-4o in completion mode."
|
| I would guess GPT-4o isn't first pre-trained and then instruct-
| tuned, but trained directly with refined instruction-following
| material.
|
| This material probably contains way fewer chess games.
| toxik wrote:
| Why do you think that? InstructGPT was predominantly trained as
| a next-token predictor on whatever soup of data OpenAI curated
| at the time. The alignment signal (both RL part and the
| supervised prompt/answer pairs) are a tiny bit of the gradient.
| wavemode wrote:
| I have the exact same problem with this article that I had with
| the previous one - the author fails to provide any data on the
| frequency of illegal moves.
|
| Thus it's impossible to draw any meaningful conclusions. It would
| be similar to if I claimed that an LLM is an expert doctor, but
| in my data I've filtered out all of the times it gave incorrect
| medical advice.
| falcor84 wrote:
| I world argue that it's more akin to filtering out the chit-
| chat with the patient, where the doctor explained things in an
| imprecise manner, keeping only the formal and valid medical
| notation
| caddemon wrote:
| There is no legitimate reason to make an illegal move in
| chess though? There are reasons why a good doctor might
| intentionally explain things imprecisely to a patient.
| hnthrowaway6543 wrote:
| > There is no legitimate reason to make an illegal move in
| chess though?
|
| If you make an illegal move and the opponent doesn't notice
| it, you gain a significant advantage. LLMs just have David
| Sirlin's "Playing to Win" as part of their training data.
| ses1984 wrote:
| It's like the doctor saying, "you have cancer? Oh you don't?
| Just kidding. Parkinson's. Oh it's not that either? How about
| common cold?"
| falcor84 wrote:
| Big the difference is that valid bad moves (equivalents of
| "cancer") were included in the analysis, it's only invalid
| ones (like "your body is kinda outgrowing itself") that
| were excluded from the analysis
| ses1984 wrote:
| What makes a chess move invalid is the state of the
| board. I don't think moves like "pick up the pawn and
| throw it across the room" were considered.
| toast0 wrote:
| That's a valid move in Monopoly though. Although it's
| much prefered to pick up the table and throw it.
| sigmar wrote:
| Don't think that analogy works unless you could write a script
| that automatically removes incorrect medical advice, because
| then you would indeed have an LLM-with-a-script that was an
| expert doctor (which you can do for illegal chess move, but
| obviously not for evaluating medical advice)
| kcbanner wrote:
| It would be possible to employ an expert doctor, instead of
| writing a script.
| ben_w wrote:
| Which is cheaper:
|
| 1. having a human expert creating every answer
|
| or
|
| 2. having an expert check 10 answers each of which have a
| 90% chance of being right and then manually redoing the one
| which was wrong
|
| Now add a complications that:
|
| * option 1 also isn't 100% correct
|
| * nobody knows which things in option 2 are correlated or
| not and if those are or aren't correlated with human errors
| so we might be systematically unable to even recognise the
| errors
|
| * even if we could, humans not only get lazy without
| practice but also get bored if the work is too easy, so a
| short-term study in efficiency changes doesn't tell you
| things like "after 2 years you get mass resignations by the
| competent doctors, while the incompetent just say 'LGTM' to
| all the AI answers"
| wavemode wrote:
| You can write scripts that correct bad math, too. In fact
| most of the time ChatGPT will just call out to a calculator
| function. This is a smart solution, and very useful for end
| users! But, still, we should not try to use that to make the
| claim that LLMs have a good understanding of math.
| henryfjordan wrote:
| At what point does "knows how to use a calculator" equate
| to knowing how to do math? Feels pretty close to me...
| Tepix wrote:
| Well, LLMs are bad at math but they're ok at detecting
| math and delegating it to a calculator program.
|
| It's kind of like humans.
| afro88 wrote:
| If a script were applied that corrected "bad math" and now
| the LLM could solve complex math problems that you can't
| one-shot throw at a calculator, what would you call it?
| sixfiveotwo wrote:
| It's a good point.
|
| But this math analogy is not quite appropriate: there's
| abstract math and arithmetic. A good math practitioner
| (LLM or human) can be bad at arithmetic, yet good at
| abstract reasoning. The later doesn't (necessarily)
| requires the former.
|
| In chess, I don't think that you can build a good
| strategy if it relies on illegal moves, because tactics
| and strategies are tied.
| vunderba wrote:
| Agreed. It's not the same thing and we should strive for
| precision (LLMs are already opaque enough as it is).
|
| An LLM that recognizes an input as "math" and calls out to
| a NON-LLM to solve the problem vs an LLM that recognizes an
| input as "math" and also uses next-token prediction to
| produce an accurate response _ARE DIFFERENT_.
| og_kalu wrote:
| 3-turbo-instruct makes about 5 or less illegal moves in 8205.
| It's not here but turbo instruct has been evaled before.
|
| https://github.com/adamkarvonen/chess_gpt_eval
| timjver wrote:
| > It would be similar to if I claimed that an LLM is an expert
| doctor, but in my data I've filtered out all of the times it
| gave incorrect medical advice.
|
| Computationally it's trivial to detect illegal moves, so it's
| nothing like filtering out incorrect medical advice.
| wavemode wrote:
| As I wrote in another comment - you can write scripts that
| correct bad math, too. But we don't use that to claim that
| LLMs have a good understanding of math.
| ben_w wrote:
| I'd say that's because we don't understand what we mean by
| "understand".
|
| Hardware that _accurately_ performs maths faster than all
| of humanity combined is so cheap as to be disposable, but I
| 've yet to see anyone claim that a Pi Zero has
| "understanding" of anything.
|
| An LLM _can_ display the _viva voce_ approach that Turing
| suggested[0], and do it well. Ironically for all those now
| talking about "stochastic parrots", the passage reads:
|
| """... The game (with the player B omitted) is frequently
| used in practice under the name of viva voce to discover
| whether some one really understands something or has
| 'learnt it parrot fashion'. ..."
|
| Showing that not much has changed on the philosophy of this
| topic since it was invented.
|
| [0]
| https://academic.oup.com/mind/article/LIX/236/433/986238
| SpaceManNabs wrote:
| I don't know. I have talked to a few math professors, and
| they think LLMs are as good as a lot of their peers when it
| comes hallucinations and being able to discuss ideas on
| very niche topics, as long as the context is fed in. If Tao
| is calling some models "a mediocre, but not completely
| incompetent [...] graduate student", then they seem to
| understand math to some degree to me.
| KK7NIL wrote:
| > Computationally it's trivial to detect illegal moves
|
| You're strictly correct, but the rules for chess are
| infamously hard to implement (as anyone who's tried to write
| a chess program will know), leading to minor bugs in a lot of
| chess programs.
|
| For example, there's this old myth about vertical castling
| being allowed due to ambiguity in the ruleset:
| https://www.futilitycloset.com/2009/12/11/outside-the-box/
| (Probably not historically accurate).
|
| If you move beyond legal positions into who wins when one
| side flags, the rules state that the other side should be
| awarded a victory if checkmate was possible with any legal
| sequence of moves. This is so hard to check that no chess
| program tries to implement it, instead using simpler rules to
| achieve a very similar but slightly more conservative result.
| rco8786 wrote:
| I got a kick out of that link. Had certainly never heard of
| "vertical castling" previously.
| admax88qqq wrote:
| > You're strictly correct, but the rules for chess are
| infamously hard to implement
|
| Come on. Yeah they're not trivial but they've been done
| numerous times. There's been chess programs for almost as
| long as there have been computers. Checking legal moves is
| a _solved problem_.
|
| Detecting valid medical advice is not. The two are not even
| remotely comparable.
| theptip wrote:
| This is a crazy goal-post move. TFA is proving a positive
| capability, and rejecting the null hypothesis that "LLMs can't
| think they just regurgitate".
|
| Making some illegal moves doesn't invalidate the demonstrated
| situational logic intelligence required to play at ELO 1800.
|
| (Another angle: a human on Chess.com also has any illegal move
| they try to make ignored, too.)
| wavemode wrote:
| It's not a goalpost move. As I've already said, I have the
| exact same problem with this article as I had with the
| previous one. My goalposts haven't moved, and my standards
| haven't changed. Just provide the data! How hard can it be?
| Why leave it out in the first place?
| photonthug wrote:
| > Making some illegal moves doesn't invalidate the
| demonstrated situational logic intelligence
|
| That's exactly what it does. 1 illegal move in 1 million or
| 100 million or any other sample size you want to choose means
| it doesn't understand chess.
|
| People in this thread are really distracted by the medical
| analogy so I'll offer another: you've got a bridge that
| allows millions of vehicles to cross, and randomly falls down
| if you tickle it wrong, maybe a car of rare color. One key
| aspect of bridges is that they work reliably for any vehicle,
| and once they fail they don't work with any vehicle. A bridge
| that sometimes fails and sometimes doesn't isn't a bridge as
| much as a death trap.
| og_kalu wrote:
| >1 illegal move in 1 million or 100 million or any other
| sample size you want to choose means it doesn't understand
| chess
|
| Highly rated chess players make illegal moves. It's rare
| but it happens. They don't understand chess ?
| photonthug wrote:
| > Then no human understands chess
|
| Humans with correct models may nevertheless make errors
| in rule applications. Machines are good at applying
| rules, so when they fail to apply rules correctly, it
| means they have incorrect, incomplete, or totally absent
| models.
|
| Without using a word like "understands" it seems clear
| that the same _apparent_ mistake has different causes..
| and model errors are very different from model-
| application errors. In a math or physics class this is
| roughly the difference between carry-the-one arithmetic
| errors vs using an equation from a completely wrong
| domain. The word "understands" is loaded in discussion of
| LLMs, but everyone knows which mistake is going to get
| partial credit vs zero credit on an exam.
| og_kalu wrote:
| >Humans with correct models may nevertheless make errors
| in rule applications. Ok
|
| >Machines are good at applying rules, so when they fail
| to apply rules correctly, it means they have incorrect or
| incomplete models.
|
| I don't know why people continue to force the wrong
| abstraction. LLMs do not work like 'machines'. They don't
| 'follow rules' the way we understand normal machines to
| 'follow rules'.
|
| >so when they fail to apply rules correctly, it means
| they have incorrect or incomplete models.
|
| Everyone has incomplete or incorrect models. It doesn't
| mean we always say they don't understand. Nobody says
| Newton didn't understand gravity.
|
| >Without using a word like "understands" it seems clear
| that the same apparent mistake has different causes.. and
| model errors are very different from model-application
| errors.
|
| It's not very apparent no. You've just decided it has
| different causes because of preconceived notions on how
| you think all machines must operate in all
| configurations.
|
| LLMs are not the logic automatons in science fiction.
| They don't behave or act like normal machines in any way.
| The internals run some computations to make predictions
| but so does your nervous system. Computation is
| substrate-independent.
|
| I don't even know how you can make this distinction
| without seeing what sort of illegal moves it makes. If it
| makes the sort high rated players make then what ?
| photonthug wrote:
| I can't tell if you are saying the distinction between
| model errors and model-application errors doesn't exist
| or doesn't matter or doesn't apply here.
| og_kalu wrote:
| I'm saying:
|
| - Generally, we do not say someone does not understand
| just because of a model error. The model error has to be
| sufficiently large or the model sufficiently narrow. No-
| one says Newton didn't understand gravity just because
| his model has an error in it but we might say he didn't
| understand some aspects of it.
|
| - You are saying the LLM is making a model error (rather
| than an an application error) only because of
| preconceived notions of how 'machines' must behave, not
| on any rigorous examination.
| photonthug wrote:
| Suppose you're right, the internal model of game rules is
| perfect but the application of the model for next-move is
| imperfect. Unless we can actually separate the two, does
| it matter? Functionally I mean, not philosophically. If
| the model was correct, maybe we could get a useful
| version of it out by asking it to _write_ a chess engine
| instead of _act_ as a chess engine. But when the prolog
| code for that is as incorrect as the illegal chess move
| was, will you say again that the model is correct, but
| the usage of it resulted merely resulted in minor errors?
|
| > You are saying the LLM is making a model error (rather
| than an an application error) only because of
| preconceived notions of how 'machines' must behave, not
| on any rigorous examination.
|
| Here's an anecdotal examination. After much talk about
| LLMs and chess, and math, and formal logic here's the
| state of the art, simplified from dialog with gpt today:
|
| > blue is red and red is blue. what color is the sky? >>
| <blah blah, restates premise, correctly answer "red">
|
| At this point fans rejoice, saying it understands
| hypotheticals and logic. Dialogue continues..
|
| > name one red thing >> <blah blah, restates premise,
| incorrectly offers "strawberries are red">
|
| At this point detractors rejoice, declare that it doesn't
| understand. Now the conversation devolves into semantics
| or technicalities about prompt-hacks, training data,
| weights. Whatever. We don't need chess. Just look it,
| it's broken as hell. Discussing whether the error is
| human-equivalent isn't the point either. It's broken! A
| partially broken process is no solid foundation to build
| others on. And while there are some exceptions, an
| unreliable tool/agent is often worse than none at all.
| og_kalu wrote:
| >It's broken! A partially broken process is no solid
| foundation to build others on. And while there are some
| exceptions, an unreliable tool/agent is often worse than
| none at all.
|
| Are humans broken ? Because our reasoning is a very
| broken process. You say it's no solid foundation ? Take a
| look around you. This broken processor is the foundation
| of society and the conveniences you take for granted.
|
| The vast vast majority of human history, there wasn't
| anything even remotely resembling a non-broken general
| reasoner. And you know the funny thing ? There still
| isn't. When people like you say LLMs don't reason, they
| hold them to a standard that doesn't exist. Where is this
| non-broken general reasoner in anywhere but fiction and
| your own imagination?
|
| >And while there are some exceptions, an unreliable
| tool/agent is often worse than none at all.
|
| Since you are clearly meaning unreliable to be 'makes no
| mistake/is not broken' then no human is a reliable agent.
| Clearly, the real exception is when an unreliable agent
| is worse than nothing at all.
| sixfiveotwo wrote:
| > Machines are good at applying rules, so when they fail
| to apply rules correctly, it means they have incorrect,
| incomplete, or totally absent models.
|
| That's assuming that, somehow, a LLM is a machine. Why
| would you think that?
| photonthug wrote:
| Replace the word with one of your own choice if that will
| help us get to the part where you have a point to make?
|
| I think we are discussing whether LLMs can emulate chess
| playing machines, regardless of whether they are actually
| literally composed of a flock of stochastic parrots..
| XenophileJKO wrote:
| Engineers really have a hard time coming to terms with
| probabilistic systems.
| sixfiveotwo wrote:
| That's simple logic. Quoting you again:
|
| > Machines are good at applying rules, so when they fail
| to apply rules correctly, it means they have incorrect,
| incomplete, or totally absent models.
|
| If this line of reasoning applies to machines, but LLMs
| aren't machines, how can you derive any of these claims?
|
| "A implies B" may be right, but you must first
| demonstrate A before reaching conclusion B..
|
| > I think we are discussing whether LLMs can emulate
| chess playing machines
|
| That is incorrect. We're discussing whether LLMs can play
| chess. Unless you think that human players also emulate
| chess playing machines?
| benediktwerner wrote:
| Try giving a random human 30 chess moves and ask them to
| make a non-terrible legal move. Average humans even quite
| often try to make illegal moves when clearly seeing the
| board before them. There are even plenty of cases where
| people reported a bug because the chess application didn't
| let them do an illegal move they thought was legal.
|
| And the sudden comparison to something that's safety
| critical is extremely dumb. Nobody said we should tie the
| LLM to a nuclear bomb that explodes if it makes a single
| mistake in chess.
|
| The point is that it plays at a level far far above making
| random legal moves or even average humans. To say that that
| doesn't mean anything because it's not perfect is simply
| insane.
| photonthug wrote:
| > And the sudden comparison to something that's safety
| critical is extremely dumb. Nobody said we should tie the
| LLM to a nuclear bomb that explodes if it makes a single
| mistake in chess.
|
| But it actually is safety critical very quickly whenever
| you say something like "works fine most of the time, so
| our plan going forward is to dismiss any discussion of
| when it breaks and why".
|
| A bridge failure feels like the right order of magnitude
| for the error rate and effective misery that AI has
| already quietly caused with biased models where one in a
| million resumes or loan applications is thrown out. And a
| nuclear bomb would actually kill less people than a full
| on economic meltdown. But I'm sure no one is using LLMs
| in finance at all right?
|
| It's so arrogant and naive to ignore failure modes that
| we don't even understand yet.. at least bridges and steel
| have specs. Software "engineering" was always a very
| suspect name for the discipline but whatever claim we had
| to it is worse than ever.
| sixo wrote:
| When I play chess I filter out all kinds of illegal moves. I
| also filter out bad moves. Human is more like "recursively
| thinking of ideas and then evaluating them with another part of
| your model", why not let the LLMs do the same?
| skydhash wrote:
| Because that's not what happens? We learn through symbolic
| meaning and rules which then form a consistent system. Then
| we can have a goal and continuously evaluate if we're within
| the system and transitionning towards that goal. The nice
| thing is that we don't have to compute the whole simulation
| in our brains and can start again from the real world. The
| more you train, the better your heuristics become and the
| more your efficiency increases.
|
| The internal model of a LLM is statistical text. Which is
| linear and fixed. Not great other than generating text
| similar to what was ingested.
| hackinthebochs wrote:
| >The internal model of a LLM is statistical text. Which is
| linear and fixed.
|
| Not at all. Like seriously, not in the slightest.
| skydhash wrote:
| What does it encode? Images? Scent? Touch? Some higher
| dimensional qualia?
| hackinthebochs wrote:
| Well, a simple description is that they discover circuits
| that reproduce the training sequence. It turns out that
| in the process of this, they recover relevant
| computational structures that generalize the training
| sequence. The question of how far they generalize is
| certainly up for debate. But you can't reasonably deny
| that they generalize to a certain degree. After all, most
| sentences they are prompted on are brand new and they
| mostly respond sensibly.
|
| Their representation of the input is also not linear.
| Transformers use self-attention which relies on the
| softmax function, which is non-linear.
| fl7305 wrote:
| > The internal model of a LLM is statistical text. Which is
| linear and fixed. Not great other than generating text
| similar to what was ingested.
|
| The internal model of a CPU is linear and fixed. Yet, a CPU
| can still generate an output which is very different from
| the input. It is not a simple lookup table, instead it
| executes complex algorithms.
|
| An LLM has large amounts of input processing power. It has
| a large internal state. It executes "cycle by cycle",
| processing the inputs and internal state to generate output
| data and a new internal state.
|
| So why shouldn't LLMs be capable of executing complex
| algorithms?
| skydhash wrote:
| It probably can, but how will those algorithms be
| created? And the representation of both input and output.
| If it's text, the most efficient way is to construct a
| formal system. Or a statistical model if ambiguous and
| incorrect result are ok in the grand scheme of things.
|
| The issue is always inout consumption, and output
| correctness. In a CPU, we take great care with data
| representation and protocol definition, then we do formal
| verification on the algorithms, and we can be pretty sure
| that the output are correct. So the issue is that the
| internal model (for a given task) of LLMs are not
| consistent enough and the referential window (keeping
| track of each item in the system) is always too small.
| GuB-42 wrote:
| > Thus it's impossible to draw any meaningful conclusions. It
| would be similar to if I claimed that an LLM is an expert
| doctor, but in my data I've filtered out all of the times it
| gave incorrect medical advice.
|
| Not really, you can try to make illegal moves in chess, and
| usually, you are given a time penalty and get to try again, so
| even in a real chess game, illegal moves are "filtered out".
|
| And for the "medical expert" analogy, let's say that you
| compare to systems based on the well being of the patients
| after they follow the advise. I think it is meaningful even if
| you filter out advise that is obviously inapplicable, for
| example because it refers to non-existing body parts.
| koolala wrote:
| I want to see graphs of moves the author randomly made too.
| Maybe even plotting a random-move player on the performance
| graphs vs. the AIs.
|
| It's beginner chess and beginners make moves at random all the
| time.
| benediktwerner wrote:
| 1750 elo is extremely far from beginner chess. The random
| mover bot on Lichess has like 700 rating.
|
| And the article does show various graphs of the badly playing
| models which will hardly play worse than random but are
| clearly far below the good models.
| Der_Einzige wrote:
| Correct - Dynamic grammar based/constrained sampling can be
| used to, at each time-step, force the model to only make valid
| moves (and you don't have to do it in the prompt like this
| article does!!!)
|
| I have NO idea why no one seems to do this. It's a similar
| issue with LLM-as-judge evaluations. Often they are begging to
| be combined with grammar based/constrained/structured sampling.
| So much good stuff in LLM land isn't used for no good reason!
| There are several libraries for implementing this easily,
| outlines, guidance, lm-format-enforcer, and likely many more.
| You can even do it now with OpenAI!
|
| Oobabooga text gen webUI literally has chess as one of it's
| candidate examples of grammar based sampling!!!
| rcxdude wrote:
| I don't think is super relevant. I mean, it would be
| interesting (especially if there was a meaningful difference in
| the number of illegal move attempts between the different
| approaches, doubly so if that didn't correlate with the
| performance when illegal moves are removed), but I don't think
| it really affects the conclusions of the article: picking
| randomly from the set of legal moves makes for a truly terrible
| chess player, so clearly the LLMs are bringing something to the
| party such that sampling from their output performs
| significantly better. Splitting hairs about the capability of
| the LLM on its own (i.e. insisting on defining attempts at an
| illegal move as a game loss for the purposes of rating) seems
| pretty besides the point.
| hansvm wrote:
| There's a subtle distinction though; if you're able to filter
| out illegal behavior, the move quality conditioned on legality
| can be extremely different from arbitrary move quality (and, as
| you might see in LLM json parsing, conditioning per-token can
| be very different from conditioning per-response).
|
| If you're arguing that the singularity already happened then
| your criticism makes perfect sense; these are dumb machines,
| not useful yet for most applications. If you just want to use
| the LLM as a tool though, the behavior when you filter out
| illegal responses (assuming you're able to do so) is the only
| reasonable metric.
|
| Analogizing to a task I care a bit about: Current-gen LLMs are
| somewhere between piss-poor and moderate at generating recipes.
| With a bit of prompt engineering most recipes pass my "bar",
| but they're still often lacking in one or more important
| characteristics. If you do nothing other than ask it to
| generate many options and then as a person manually filter to
| the subset of ideas (around 1/20) which look stellar, it's both
| very effective at generating good recipes, and they're usually
| much better than my other sources of stellar recipes (obviously
| not generally applicable because you have to be able to tell
| bad recipes from good at a glance for that workflow to make
| sense). The fact that most of the responses are garbage doesn't
| really matter; it's still an improvement to how I cook.
| tech_ken wrote:
| > It's ridiculously hard to find the optimal combination of
| prompts and examples and fine-tuning, etc. It's a very large
| space, there are no easy abstractions to allow you to search
| through the space, LLMs are unpredictable and fragile, and these
| experiments are slow and expensive.
|
| Regardless of the actual experiment outcome, I think this is a
| super valuable insight. "Should we provide legal moves?" section
| is an excellent case study of this- extremely prudent idea
| actually degrades model performance, and quite badly. It's like
| that crocodile game where you're pushing teeth until it clamps
| onto your hand.
| subarctic wrote:
| The author either didn't read the hacker news comments last time,
| or he missed the top theory that said they probably used chess as
| a benchmark when they developed the model that is good at chess
| for whatever business reasons they had at the time.
| wavemode wrote:
| This is plausible. One of the top chess engines in the world
| (Leela) is just a neural network trained on billions of chess
| games.
|
| So it makes sense that an LLM would also be able to acquire
| some skill by simply having a large volume of chess games in
| its training data.
|
| OpenAI probably just eventually decided it wasn't useful to
| keep pursuing chess skill.
| devindotcom wrote:
| fwiw this is exactly what i thought - oai pursued it as a
| skillset (likely using a large chess dataset) for their own
| reasons and then abandoned it as not particularly beneficial
| outside chess.
|
| It's still interesting to try to replicate how you would make a
| generalist LLM good at chess, so i appreciated the post, but I
| don't think there's a huge mystery!
| brcmthrowaway wrote:
| Oh really! What happened to the theory that training on code
| magically caused some high level reasoning ability?
| koolala wrote:
| Next test a image & text model! Chess is way easier when you can
| see the board.
| amelius wrote:
| I wonder what would happen if they changed the prompt such that
| the llm is asked to explain their strategy first. Or to explain
| their opponent's strategy.
| code51 wrote:
| Initially LLM researchers were saying training on code samples
| made the "reasoning" better. Now, if "language to world model"
| thesis is working, shouldn't chess actually be the smallest case
| for it?
|
| I can't understand why no research group is going hard at this.
| throwaway314155 wrote:
| I don't think training on code and training on chess are even
| remotely comparable in terms of available data and linguistic
| competency required. Coding (in the general case, which is what
| these models try to approach) is clearly the harder task and
| contains _massive_ amounts of diverse data.
|
| Having said all of that, it wouldn't surprise me if the
| "language to world model" thesis you reference is indeed wrong.
| But I don't think a model that plays chess well disproves it,
| particularly since there are chess engines using old fashioned
| approaches that utterly destroy LLM's.
| bee_rider wrote:
| Extremely tangential, but how to chess engines do when playing
| from illegal board states? Could the LLM have a chance of
| competing with a real chess engine from there?
|
| Understanding is a funny concept to try to apply to computer
| programs anyway. But playing from an illegal state seems (to me
| at least) to indicate something interesting about the ability to
| comprehend the general idea of chess.
| derefr wrote:
| > Many, many people suggested that there must be some special
| case in gpt-3.5-turbo-instruct that recognizes chess notation and
| calls out to an external chess engine.
|
| Not that I think there's anything inherently unreasonable about
| an LLM understanding chess, but I think the author missed a
| variant hypothesis here:
|
| What if that specific model, when it recognizes chess notation,
| is trained to silently "tag out" for _another, more specialized
| LLM, that is specifically trained on a majority-chess dataset_?
| (Or -- perhaps even more likely -- the model is trained to
| recognize the need to activate a chess-playing _LoRA adapter_?)
|
| It would still be an LLM, so things like "changing how you prompt
| it changes how it plays" would still make sense. Yet it would be
| one that has spent a lot more time modelling chess than other
| things, and never ran into anything that distracted it enough to
| catastrophically forget how chess works (i.e. to reallocate some
| of the latent-space vocabulary on certain layers from modelling
| chess, to things that matter more to the training function.)
|
| And I could certainly see "playing chess" as a good proving
| ground for testing the ability of OpenAI's backend to recognize
| the need to "loop in" a LoRA in the inference of a response. It's
| something LLM base models suck at; but it's also something you
| intuitively _could_ train an LLM to do (to at least a proficient-
| ish level, as seen here) if you had a model focus on just
| learning that.
|
| Thus, "ability of our [framework-mediated] model to play chess"
| is easy to keep an eye on, long-term, as a proxy metric for "how
| well our LoRA-activation system is working", without needing to
| worry that your next generation of base models might suddenly
| invalidate the metric by getting good at playing chess without
| any "help." (At least not any time soon.)
| throwaway314155 wrote:
| > but I think the author missed a variant hypothesis here:
|
| > What if that specific model, when it recognizes chess
| notation, is trained to silently "tag out" for another, more
| specialized LLM, that is specifically trained on a majority-
| chess dataset? (Or -- perhaps even more likely -- the model is
| trained to recognize the need to activate a chess-playing LoRA
| adapter?)
|
| Pretty sure your variant hypothesis is sufficiently covered by
| the author's writing.
|
| So strange that people are so attached to conspiracy theories
| in this instance. Why would OpenAI or anyone go through all the
| trouble? The proposals outlined in the article make far more
| sense and track well with established research (namely that
| applying RLHF to a "text-only" model tends to wreak havoc on
| said model).
| bob1029 wrote:
| I find it amusing that we would frame an ensemble of models as
| "cheating". Routing to a collection of specialized models via
| classification layers seems like the most obvious path for adding
| practical value to these solutions.
|
| Why conflate the parameters of chess with checkers and go if you
| already have high quality models for each? I thought tool use and
| RAG were fair game.
| copperroof wrote:
| I just want a hacker news no-LLM filter. The site has been almost
| unusable for a year now.
| XenophileJKO wrote:
| So this article is what happens when people who don't really
| understand the models "test" things.
|
| There are several fatal flaws.
|
| The first problem is that he isn't clearly and concisely
| displaying the current board state. He is expecting the model to
| attend a move sequence to figure out the board state.
|
| Secondly, he isn't allowing the model to think elastically using
| COT or other strategies.
|
| Honestly, I am shocked it is working at all. He has basically
| formulated the problem in the worst possible way.
| yeevs wrote:
| I'm not sure COT would help in this situation. I am an amateur
| at chess but in my experience a large part of playing is
| intuition and making and I'm not confident the model could even
| accurately summarise its thinking. There are tasks in which
| models perform worse on when explaining reasoning. However,
| this is completely vibes based.
| XenophileJKO wrote:
| Given my experience with the models, giving it the ability to
| think would allow it to attend to different ramifications of
| the current board layout. I would expect a non trivial
| performance gain.
| cma wrote:
| One thing missing from the graphs is whether 3.5-turbo-instruct
| also gets better with the techniques? Is finetuning available for
| it?
| __MatrixMan__ wrote:
| It would be fun to play against an LLM without having to think
| about the prompting, if only as a novel way to get a "feel" for
| how they "think".
| timzaman wrote:
| "all LLMs" - OP only tested OpenAI LLMs. Try Gemini.
___________________________________________________________________
(page generated 2024-11-22 23:00 UTC)