[HN Gopher] Something weird is happening with LLMs and chess
___________________________________________________________________
Something weird is happening with LLMs and chess
Author : crescit_eundo
Score : 630 points
Date : 2024-11-14 17:05 UTC (1 days ago)
(HTM) web link (dynomight.substack.com)
(TXT) w3m dump (dynomight.substack.com)
| PaulHoule wrote:
| Maybe that one which plays chess well is calling out to a real
| chess engine.
| singularity2001 wrote:
| this possibility is discussed in the article and deemed
| unlikely
| margalabargala wrote:
| I don't see that discussed, could you quote it?
| probably_wrong wrote:
| Note: the possibility is not mentioned in the article but
| rather in the comments [1]. I had to click a bit to see it.
|
| The fact that the one closed source model is the only one
| that plays well seems to me like a clear case of the
| interface doing some of the work. If you ask ChatGPT to count
| until 10000 (something that most LLMs can't do for known
| reasons) you get an answer that's clearly pre-programmed. I'm
| sure the same is happening here (and with many, many other
| tasks) - the author argues against it by saying "but why
| isn't it better?", which doesn't seem like the best argument:
| I can imagine that typical ChatGPT users enjoy the product
| more if they have a chance to win once in a while.
|
| [1] https://dynomight.substack.com/p/chess/comment/77190852
| refulgentis wrote:
| What do you mean LLMs can't count to 10,000 for known
| reasons?
|
| Separately, if you are able to show OpenAI is serving pre
| canned responses in some instances, instead of running
| inference, you will get a ton of attention if you write it
| up.
|
| I'm not saying this in an aggro tone, it's a genuinely
| interesting subject to me because I wrote off LLMs at first
| because I thought this was going on.* Then I spent the last
| couple years laughing at myself for thinking that they
| would do that. Would be some mix of fascinated and
| horrified to see it come full circle.
|
| * I can't remember, what, exactly, it was far back as 2018.
| But someone argued that OpenAI was patching in individual
| answers because scaling was dead and they had no answers,
| way way before ChatGPT.
| probably_wrong wrote:
| When it comes to counting, LLMs have a couple issues.
|
| First, tokenization: the tokenization of 1229 is not
| guaranteed to be [1,2,2,9] but it could very well be
| [12,29] and the "+1" operation could easily generate
| tokens [123,0] depending on frequencies in your corpus.
| This constant shifting in tokens makes it really hard to
| learn rules for "+1" ([9,9] +1 is not [9,10]). This is
| also why LLMs tend to fail at tasks like "how many
| letters does this word have?":
| https://news.ycombinator.com/item?id=41058318
|
| Second, you need your network to understand that "+1" is
| worth learning. Writing "+1" as a combination of sigmoid,
| products and additions over normalized floating point
| values (hello loss of precision) is not trivial without
| degrading a chunk of your network, and what for? After
| all, math is not in the domain of language and, since
| we're not training an LMM here, your loss function may
| miss it entirely.
|
| And finally there's statistics: the three-legged-dog
| problem is figuring out that a dog has four legs from
| corpora when no one ever writes "the four-legged dog"
| because it's obvious, but every reference to an unusual
| dog will include said description. So if people write
| "1+1 equals 3" satirically then your network may pick
| that up as fact. And how often has your network seen the
| result of "6372 + 1"?
|
| But you don't have to take my word for it - take an open
| LLM and ask it to generate integers between 7824 and
| 9954. I'm not optimistic that it will make it through
| without hallucinations.
| sobriquet9 wrote:
| This is likely. From example games, it not only knows the rules
| (which would be impressive by itself, just making the legal
| moves is not trivial). It also has some planning capabilities
| (plays combinations of several moves).
| aithrowawaycomm wrote:
| The author thinks this is unlikely because it only has an ~1800
| ELO. But OpenAI is shady as hell, and I could absolutely see
| the following _purely hypothetical_ scenario:
|
| - In 2022 Brockman and Sutskever have an unshakeable belief
| that Scaling Is All You Need, and since GPT-4 has a ton of
| chess in its pretraining data it will _definitely_ be able to
| play competent amateur chess when it 's finished.
|
| - A ton of people have pointed out that ChatGPT-3.5 doesn't
| even slightly understand chess despite seeming fluency in the
| lingo. People start to whisper that transformers cannot
| actually create plans.
|
| - Therefore OpenAI hatches an impulsive scheme: release an
| "instruction-tuned" GPT-3.5 with an embedded chess engine that
| is not a grandmaster, but can play competent chess, ideally
| just below the ELO that GPT-4 is _projected_ to have.
|
| - Success! The waters are muddied: GPT enthusiasts triumphantly
| announce that LLMs _can_ play chess, it just took a bit more
| data and fine-tuning. The haters were wrong: look at all the
| planning GPT is doing!
|
| - Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do
| competitors' foundation LLMs which otherwise outperform
| GPt-3.5. The scaling "laws" failed here, since they were never
| laws in the first place. OpenAI accepts that scaling
| transformers won't easily solve the chess problem, then
| realizes that if they include the chess engine with GPT-4
| without publicly acknowledging it, then Anthropic and Facebook
| will call out the performance as aberrational and suspicious.
| But publicly acknowledging a chess engine is even worse: the
| only reason to include the chess engine is to mislead users
| into thinking GPT is capable of general-purpose planning.
|
| - Therefore in later GPT versions they don't include the
| engine, but it's too late to remove it from gpt-3.5-turbo-
| instruct: people might accept the (specious) claim that GPT-4's
| size accidentally sabotaged its chess abilities, but they'll
| ask tough questions about performance degradation within the
| same model.
|
| I realize this is convoluted and depends on conjecture. But
| OpenAI has a history with misleading demos - e.g. their Rubik's
| cube robot which in fact used a classical algorithm but was
| presented as reinforcement learning. I think "OpenAI lied" is
| the most likely scenario. It is far more likely than "OpenAI
| solved the problem honestly in GPT-3.5, but forgot how they did
| it with GPT-4," and a bit more likely than "scaling
| transformers slightly helps performance when playing Othello
| but severely sabotages performance when playing chess."
| gardenhedge wrote:
| Not that convoluted really
| refulgentis wrote:
| It's pretty convoluted, requires a ton of steps, mind-
| reading, and odd sequencing.*
|
| If you share every prior, and aren't particularly concerned
| with being disciplined in treating conversation as
| proposing a logical argument (I'm not myself, people find
| it offputting), it probably wouldn't seem at all
| convoluted.
|
| * layer chess into gpt-3.5-instruct _only_ , but not
| chatgpt, not GPT-4, to defeat the naysayers when GPT-4
| comes out? _shrugs_ if the issues with that are unclear, I
| can lay it out more
|
| ** fwiw, at the time, pre-chatgpt, before the hype, there
| wasn't a huge focus on chess, nor a ton of naysayers to
| defeat. it would have been bizarre to put this much energy
| into it, modulo the scatter-brained thinking in *
| gardenhedge wrote:
| It's not that many steps. I'm sure we've all seen our
| sales teams selling features that aren't in the
| application or exaggerating features before they're fully
| complete.
|
| To be clear, I'm not saying that the theory is true but
| just that I could belive something like that could
| happen.
| jmount wrote:
| Very good scenario. One variation: some researcher or
| division in OpenAI performs all of the above steps to get a
| raise. The whole field is predicated on rewarding the
| appearance of ability.
| tedsanders wrote:
| Eh, OpenAI really isn't as shady as hell, from what I've seen
| on the inside for 3 years. Rubik's cube hand was before me,
| but in my time here I haven't seen anything I'd call shady
| (though obviously the non-disparagement clauses were a
| misstep that's now been fixed). Most people are genuinely
| trying to build cool things and do right by our customers.
| I've never seen anyone try to cheat on evals or cheat
| customers, and we take our commitments on data privacy
| seriously.
|
| I was one of the first people to play chess against the base
| GPT-4 model, and it blew my mind by how well it played. What
| many people don't realize is that chess performance is
| extremely sensitive to prompting. The reason gpt-3.5-turbo-
| instruct does so well is that it can be prompted to complete
| PGNs. All the other models use the chat format. This explains
| pretty much everything in the blog post. If you fine-tune a
| chat model, you can pretty easily recover the performance
| seen in 3.5-turbo-instruct.
|
| There's nothing shady going on, I promise.
| og_kalu wrote:
| It's not:
|
| 1. That would just be plain bizzare
|
| 2. It plays like what you'd expect from a LLM that could play
| chess. That is, level of play can be modulated by the prompt
| and doesn't manifest the same way shifting the level of
| stockfish etc does. Also the specific chess notation being
| prompted actually matters
|
| 3. It's sensitive to how the position came to be. Clearly not
| an existing chess engine. https://github.com/dpaleka/llm-chess-
| proofgame
|
| 4. It does make illegal moves. It's rare (~5 in 8205) but it
| happens. https://github.com/adamkarvonen/chess_gpt_eval
|
| 5. You can or well you used to be able to inspect the logprobs.
| I think Open AI have stopped doing this but the link in 4 does
| show the author inspecting it for Turbo instruct.
| aithrowawaycomm wrote:
| > Also the specific chess notation being prompted actually
| matters
|
| Couldn't this be evidence that it _is_ using an engine? Maybe
| if you use the wrong notation it relies on the ANN rather
| than calling to the engine.
|
| Likewise:
|
| - The sensitivity to game history is interesting, but is it
| actually true that other chess engines only look at current
| board state? Regardless, maybe it's not an _existing_ chess
| engine! I would think OpenAI has some custom chess engine
| built as a side project, PoC, etc. In particular this engine
| might be neural and trained on actual games rather than board
| positions, which could explain dependency on past moves. Note
| that the engine is not actually very good. Does AlphaZero
| depend on move history? (Genuine question, I am not sure. But
| it does seem likely.)
|
| - I think the illegal moves can be explained similarly to why
| gpt-o1 sometimes screws up easy computations despite having
| access to Python: an LLM having access to a tool does not
| guarantee it always uses that tool.
|
| I realize there are holes in the argument, but I genuinely
| don't think these holes are as big as the "why is
| gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
| janalsncm wrote:
| > Couldn't this be evidence that it is using an engine?
|
| A test would be to measure its performance against more
| difficult versions of Stockfish. A real chess engine would
| have a higher ceiling.
|
| Much more likely is this model was trained on more chess
| PGNs. You can call that a "neural engine" if you'd like but
| it is the simplest solution and explains the mistakes it is
| making.
|
| Game state isn't just what you can see on the board. It
| includes the 50 move rule and castling rights. Those were
| encoded as layers in AlphaZero along with prior positions
| of pieces. (8 prior positions if I'm remembering
| correctly.)
| selcuka wrote:
| I think that's the most plausible theory that would explain the
| sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and
| again the sudden regression in gpt-4*.
|
| OpenAI also seem to augment the LLM with some type of VM or a
| Python interpreter. Maybe they run a simple chess engine such
| as Sunfish [1] which is around 1900-2000 ELO [2]?
|
| [1] https://github.com/thomasahle/sunfish
|
| [2] https://lichess.org/@/sunfish-engine
| janalsncm wrote:
| Probably not calling out to one but it would not surprise me at
| all if they added more chess PGNs into their training data.
| Chess is a bit special in AI in that it's still seen as a mark
| of pure intelligence in some respect.
|
| If you tested it on an equally strategic but less popular game
| I highly doubt you would see the same performance.
| pseudosavant wrote:
| LLMs aren't really language models so much as they are token
| models. That is how they can also handle input in audio or visual
| forms because there is an audio or visual tokenizer. If you can
| make it a token, the model will try to predict the following
| ones.
|
| Even though I'm sure chess matches were used in some of the LLM
| training, I'd bet a model trained just for chess would do far
| better.
| viraptor wrote:
| > That is how they can also handle input in audio or visual
| forms because there is an audio or visual tokenizer.
|
| This is incorrect. They get translated into the shared latent
| space, but they're not tokenized in any way resembling the text
| part.
| pseudosavant wrote:
| They are almost certainly tokenized in most LLM multi-modal
| models. https://en.wikipedia.org/wiki/Large_language_model#Mu
| ltimoda...
| viraptor wrote:
| Ah, an overloaded "tokenizer" meaning. "split into tokens"
| vs "turned into a single embedding matching a token" I've
| never heard it used that way before, but it makes sense
| kinda.
| ChrisArchitect wrote:
| [dupe] https://news.ycombinator.com/item?id=42138276
| digging wrote:
| Definitely weird results, but I feel there are too many variables
| to learn much from it. A couple things:
|
| 1. The author mentioned that tokenization causes something
| minuscule like a a " " at the end of the input to shatter the
| model's capabilities. Is it possible other slightly different
| formatting changes in the input could raise capabilities?
|
| 2. Temperature was 0.7 for all models. What if it wasn't? Isn't
| there a chance one more more models would perform significantly
| better with higher or lower temperatures?
|
| Maybe I just don't understand this stuff very well, but it feels
| like this post is only 10% of the work needed to get any meaning
| from this...
| semi-extrinsic wrote:
| The author mentions in the comment section that changing
| temperature did not help.
| azeirah wrote:
| Maybe I'm really stupid... but perhaps if we want really
| intelligent models we need to stop tokenizing at all? We're
| literally limiting what a model can see and how it percieves the
| world by limiting the structure of the information streams that
| come into the model from the very beginning.
|
| I know working with raw bits or bytes is slower, but it should be
| relatively cheap and easy to at least falsify this hypothesis
| that many huge issues might be due to tokenization problems
| but... yeah.
|
| Surprised I don't see more research into radicaly different
| tokenization.
| cschep wrote:
| How would we train it? Don't we need it to understand the heaps
| and heaps of data we already have "tokenized" e.g. the
| internet? Written words for humans? Genuinely curious how we
| could approach it differently?
| viraptor wrote:
| That's not what tokenized means here. Parent is asking to
| provide the model with separate characters rather than
| tokens, i.e. groups of characters.
| skylerwiernik wrote:
| Couldn't we just make every human readable character a token?
|
| OpenAI's tokenizer makes "chess" "ch" and "ess". We could
| just make it into "c" "h" "e" "s" "s"
| taeric wrote:
| This is just more tokens? And probably requires the model
| to learn about common groups. Consider, "ess" makes sense
| to see as a group. "Wss" does not.
|
| That is, the groups are encoding something the model
| doesn't have to learn.
|
| This is not much astray from "sight words" we teach kids.
| TZubiri wrote:
| This is just more tokens?
|
| Yup. Just let the actual ML git gud
| taeric wrote:
| So, put differently, this is just more expensive?
| Hendrikto wrote:
| No, actually much fewer tokens. 256 tokens cover all
| bytes. See the ByT5 paper:
| https://arxiv.org/abs/2105.13626
| taeric wrote:
| More tokens to a sequence, though. And since it is
| learning sequences...
| loa_in_ wrote:
| Yeah, suddenly 16k tokens is just 16kb of ASCII instead
| of ~6kwords
| tchalla wrote:
| aka Character Language Models which have existed for a
| while now.
| cco wrote:
| We can, tokenization is literally just to maximize
| resources and provide as much "space" as possible in the
| context window.
|
| There is no advantage to tokenization, it just helps solve
| limitations in context windows and training.
| TZubiri wrote:
| I like this explanation
| aithrowawaycomm wrote:
| FWIW I think most of the "tokenization problems" are in fact
| reasoning problems being falsely blamed on a minor technical
| thing when the issue is much more profound.
|
| E.g. I still see people claiming that LLMs are bad at basic
| counting because of tokenization, but the same LLM counts
| perfectly well if you use chain-of-thought prompting. So it
| _can 't_ be explained by tokenization! The problem is
| reasoning: the LLM needs a human to tell it that a counting
| problem can be accurately solved if they go step-by-step.
| Without this assistance the LLM is likely to simply guess.
| ipsum2 wrote:
| The more obvious alternative is that CoT is making up for the
| deficiencies in tokenization, which I believe is the case.
| aithrowawaycomm wrote:
| I think the more obvious explanation has to do with
| computational complexity: counting is an O(n) problem, but
| transformer LLMs can't solve O(n) problems unless you use
| CoT prompting: https://arxiv.org/abs/2310.07923
| ipsum2 wrote:
| What you're saying is an explanation what I said, but I
| agree with you ;)
| aithrowawaycomm wrote:
| No, it's a rebuttal of what you said: CoT is not making
| up for a deficiency in tokenization, it's making up for a
| deficiency in transformers themselves. These complexity
| results have nothing to do with tokenization, or even
| LLMs, it is about the complexity class of problems that
| can be solved by transformers.
| ipsum2 wrote:
| There's a really obvious way to test whether the
| strawberry issue is tokenization - replace each letter
| with a number, then ask chatGPT to count the number of
| 3s.
|
| Count the number of 3s, only output a single number: 6 5
| 3 2 8 7 1 3 3 9.
|
| ChatGPT: 3.
| MacsHeadroom wrote:
| This paper does not support your position any more than
| it supports the position that the problem is
| tokenization.
|
| This paper posits that if the authors intuition was true
| then they would find certain empirical results. ie. "If A
| then B." Then they test and find the empirical results.
| But this does not imply that their intuition was correct,
| just as "If A then B" does not imply "If B then A."
|
| If the empirical results were due to tokenization
| absolutely nothing about this paper would change.
| Der_Einzige wrote:
| I'm the one who will fight you including with peer reviewed
| papers indicating that it is in fact due to tokenization. I'm
| too tired but will edit this for later, so take this as my
| bookmark to remind me to respond.
| aithrowawaycomm wrote:
| I am aware of errors in _computations_ that can be fixed by
| better tokenization (e.g. long addition works better
| tokenizing right-left rather than L-R). But I am talking
| about counting, and talking about counting _words,_ not
| _characters._ I don't think tokenization explains why LLMs
| tend to fail at this without CoT prompting. I really think
| the answer is computational complexity: counting is simply
| too hard for transformers unless you use CoT.
| https://arxiv.org/abs/2310.07923
| cma wrote:
| Words vs characters is a similar problem, since tokens
| can be less one word, multiple words, or multiple words
| and a partial word, or words with non-word punctuation
| like a sentence ending period.
| Jensson wrote:
| We know there are narrow solutions to these problems, that
| was never the argument that the specific narrow task is
| impossible to solve.
|
| The discussion is about general intelligence, the model
| isn't able to do a task that it can do simply because it
| chooses the wrong strategy, that is a problem of lack of
| generalization and not a problem of tokenization. Being
| able to choose the right strategy is core to general
| intelligence, altering input data to make it easier for the
| model to find the right solution to specific questions does
| not help it become more general, you just shift what narrow
| problems it is good at.
| azeirah wrote:
| I strongly believe that the problem isn't that tokenization
| isn't the underlying problem, it's that, let's say bit-by-
| bit tokenization is too expensive to run at the scales
| things are currently being ran at (openai, claude etc)
| int_19h wrote:
| It's not just a current thing, either. Tokenization
| basically lets you have a model with a larger input
| context than you'd otherwise have for the given resource
| constraints. So any gains from feeding the characters in
| directly have to be greater than this advantage. And for
| CoT especially - which we _know_ produces significant
| improvements in most tasks - you want large context.
| pmarreck wrote:
| My intuition says that tokenization is a factor especially
| if it splits up individual move descriptions differently
| from other LLM's
|
| If you think about how our brains handle this data input,
| it absolutely does not split them up between the letter and
| the number, although the presence of both the letter and
| number together would trigger the same 2 tokens I would
| think
| TZubiri wrote:
| FWIW I think most of the "tokenization problems"
|
| List of actual tokenizarion limitations 1- strawberry 2-
| rhyming and metrics 3- whitespace (as displayed in the
| article)
| meroes wrote:
| At a certain level they are identical problems. My strongest
| piece of evidence is that I get paid as an RLHF'er to find
| ANY case of error, including "tokenization". You know how
| many errors an LLM gets in the simplest grid puzzles, with
| CoT, with specialized models that don't try to "one-shot"
| problems, with multiple models, etc?
|
| My assumption is that these large companies wouldn't pay
| hundreds of thousands of RLHF'ers through dozens of third
| party companies livable wages if tokenization errors were
| just that.
| 1propionyl wrote:
| > hundreds of thousands of RLHF'ers through dozens of third
| party companies
|
| Out of curiosity, what are these companies? And where do
| they operate.
|
| I'm always interested in these sorts of "hidden"
| industries. See also: outsourced Facebook content
| moderation in Kenya.
| csomar wrote:
| It can count words in a paragraph though. So I do think it's
| tokenization.
| PittleyDunkin wrote:
| I feel like we can set our qualifying standards higher than
| counting.
| jncfhnb wrote:
| There's a reason human brains have dedicated language handling.
| Tokenization is likely a solid strategy. The real thing here is
| that language is not a good way to encode all forms of
| knowledge
| joquarky wrote:
| It's not even possible to encode all forms of knowledge.
| shaky-carrousel wrote:
| I know a joke where half of the joke is whistling and half
| gesturing, and the punchline is whistling. The wording is
| basically just to say who the players are.
| layer8 wrote:
| Going from tokens to bytes explodes the model size. I can't
| find the reference at the moment, but reducing the average
| token size induces a corresponding quadratic increase in the
| width (size of each layer) of the model. This doesn't just
| affect inference speed, but also training speed.
| og_kalu wrote:
| Tokenization is not strictly speaking necessary (you can train
| on bytes). What it is is really really efficient. Scaling is a
| challenge as is, bytes would just blow that up.
| ATMLOTTOBEER wrote:
| I tend to agree with you. Your post reminded me of
| https://gwern.net/aunn
| numpad0 wrote:
| hot take: LLM tokens is kanji for AI, and just like kanji it
| works okay sometimes but fails miserably for the task of
| accurately representating English
| umanwizard wrote:
| Why couldn't Chinese characters accurately represent English?
| Japanese and Korean aren't related to Chinese and still were
| written with Chinese characters (still are in the case of
| Japanese).
|
| If England had been in the Chinese sphere of influence rather
| than the Roman one, English would presumably be written with
| Chinese characters too. The fact that it used an alphabet
| instead is a historical accident, not due to any grammatical
| property of the language.
| stickfigure wrote:
| If I read you correctly, you're saying "the fact that the
| residents of England speak English instead of Chinese is a
| historical accident" and maybe you're right.
|
| But the residents of England do in fact speak English, and
| English is a phonetic language, so there's an inherent
| impedance mismatch between Chinese characters and English
| language. I can make up words in English and write them
| down which don't necessarily have Chinese written
| equivalents (and probably, vice-versa?).
| umanwizard wrote:
| > If I read you correctly, you're saying "the fact that
| the residents of England speak English instead of Chinese
| is a historical accident" and maybe you're right.
|
| That's not what I mean at all. I mean even if spoken
| English were exactly the same as it is now, it could have
| been written with Chinese characters, and indeed would
| have been if England had been in the Chinese sphere of
| cultural influence when literacy developed there.
|
| > English is a phonetic language
|
| What does it mean to be a "phonetic language"? In what
| sense is English "more phonetic" than the Chinese
| languages?
|
| > I can make up words in English and write them down
| which don't necessarily have Chinese written equivalents
|
| Of course. But if English were written with Chinese
| characters people would eventually agree on characters to
| write those words with, just like they did with all the
| native Japanese words that didn't have Chinese
| equivalents but are nevertheless written with kanji.
|
| Here is a famous article about how a Chinese-like writing
| system would work for English:
| https://www.zompist.com/yingzi/yingzi.htm
| skissane wrote:
| > Japanese and Korean aren't related to Chinese and still
| were written with Chinese characters (still are in the case
| of Japanese).
|
| The problem is - in writing Japanese with kanji, lots of
| somewhat arbitrary decisions had to be made. Which kanji to
| use for which native Japanese word? There isn't always an
| obviously best choice from first principles. But that's not
| a problem in practice, because a tradition developed of
| which kanjii to use for which Japanese word (kun'yomi
| readings). For English, however, we don't have such a
| tradition. So it isn't clear which Chinese character to use
| for each English word. If two people tried to write English
| with Chinese characters independently, they'd likely make
| different character choices, and the mutual intelligibility
| might be poor.
|
| Also, while neither Japanese nor Korean belongs to the same
| language family as Chinese, both borrowed lots of words
| from Chinese. In Japanese, a lot of use of kanji
| (especially on'yomi reading) is for borrowings from
| Chinese. Since English borrowed far less terms from
| Chinese, this other method of "deciding which character(s)
| to use" - look at the word's Chinese etymology - largely
| doesn't work for English given very few English words have
| Chinese etymology.
|
| Finally, they also invented kanji in Japan for certain
| Japanese words - kokuji. The same thing happened for Korean
| Hanja (gukja), to a lesser degree. Vietnamese Chu Nom
| contains thousands of invented-in-Vietnam characters.
| Probably, if English had adopted Chinese writing, the same
| would have happened. But again, deciding when to do it and
| if so how is a somewhat arbitrary choice, which is
| impossible outside of a real societal tradition of doing
| it.
|
| > The fact that it used an alphabet instead is a historical
| accident, not due to any grammatical property of the
| language.
|
| Using the Latin alphabet changed English, just as using
| Chinese characters changed Japanese, Korean and Vietnamese.
| If English had used Chinese characters instead of the Latin
| alphabet, it would be a very different language today.
| Possibly not in grammar, but certainly in vocabulary.
| int_19h wrote:
| You could absolutely write a tokenizer that would
| consistently tokenize all distinct English words as distinct
| tokens, with a 1:1 mapping.
|
| But AFAIK there's no evidence that this actually improves
| anything, and if you spend that much of the dictionary on one
| language, it comes at the cost of making the encoding for
| everything else much less efficient.
| empiko wrote:
| I have seen a bunch of tokenization papers with various ideas
| but their results are mostly meh. I personally don't see
| anything principally wrong with current approaches. Having
| discrete symbols is how natural language works, and this might
| be an okayish approximation.
| malthaus wrote:
| https://youtu.be/zduSFxRajkE
|
| karpathy agrees with you, here he is hating on tokenizers while
| re-building them for 2h
| blixt wrote:
| I think it's infeasible to train on bytes unfortunately, but
| yeah it also seems very wrong to use a handwritten and
| ultimately human version of tokens (if you take a look at the
| tokenizers out there you'll find fun things like regular
| expressions to change what is tokenized based on anecdotal
| evidence).
|
| I keep thinking that if we can turn images into tokens, and we
| can turn audio into tokens, then surely we can create a set of
| tokens where the tokens are the model's own chosen
| representation for semantic (multimodal) meaning, and then
| decode those tokens back to text[1]. Obviously a big downside
| would be that the model can no longer 1:1 quote all text it's
| seen since the encoded tokens would need to be decoded back to
| text (which would be lossy).
|
| [1] From what I could gather, this is exactly what OpenAI did
| with images in their gpt-4o report, check out "Explorations of
| capabilities": https://openai.com/index/hello-gpt-4o/
| PittleyDunkin wrote:
| A byte is itself sort of a token. So is a bit. It makes more
| sense to use more tokenizers in parallel than it does to try
| and invent an entirely new way of seeing the world.
|
| Anyway humans have to tokenize, too. We don't perceive the
| world as a continuous blob either.
| samatman wrote:
| I would say that "humans have to tokenize" is almost
| precisely the opposite of how human intelligence works.
|
| We build layered, non-nested gestalts out of real time analog
| inputs. As a small example, the meaning of a sentence said
| with the same precise rhythm and intonation can be
| meaningfully changed by a gesture made while saying it. That
| can't be tokenized, and that isn't what's happening.
| PittleyDunkin wrote:
| What is a gestalt if not a token (or a token representing
| collections of other tokens)? It seems more reasonable (to
| me) to conclude that we have multiple contradictory
| tokenizers that we select from rather than to reject the
| concept entirely.
|
| > That can't be tokenized
|
| Oh ye of little imagination.
| Anotheroneagain wrote:
| I think on the contrary, the more you can restrict it to
| _reasonable_ inputs /outputs, the less powerful LLM you are
| going to need.
| ajkjk wrote:
| This is probably unnecessary, but: I wish you wouldn't use the
| word "stupid" there. Even if you didn't mean anything by it
| personally, it might reinforce in an insecure reader the idea
| that, if one can't speak intelligently about some complex and
| abstruse subject that other people know about, there's
| something wrong with them, like they're "stupid" in some
| essential way. When in fact they would just be "ignorant" (of
| this particular subject). To be able to formulate those
| questions at all is clearly indicative of great intelligence.
| volkk wrote:
| > This is probably unnecessary
|
| you're certainly right
| amelius wrote:
| Perhaps we can even do away with transformers and use a fully
| connected network. We can always prune the model later ...
| DrNosferatu wrote:
| What about contemporary frontier models?
| ynniv wrote:
| I don't think one model is statistically significant. As people
| have pointed out, it could have chess specific responses that the
| others do not. There should be at least another one or two,
| preferably unrelated, "good" data points before you can claim
| there is a pattern. Also, where's Claude?
| og_kalu wrote:
| There are other transformers that have been trained on chess
| text that play chess fine (just not as good as 3.5 Turbo
| instruct with the exception of the "grandmaster level without
| search" paper).
| jrecursive wrote:
| i think this has everything to do with the fact that learning
| chess by learning sequences will get you into more trouble than
| good. even a trillion games won't save you:
| https://en.wikipedia.org/wiki/Shannon_number
|
| that said, for the sake of completeness, modern chess engines
| (with high quality chess-specific models as part of their
| toolset) are fully capable of, at minimum, tying every player
| alive or dead, every time. if the opponent makes one mistake,
| even very small, they will lose.
|
| while writing this i absently wondered if you increased the skill
| level of stockfish, maybe to maximum, or perhaps at least an
| 1800+ elo player, you would see more successful games. even then,
| it will only be because the "narrower training data" (ie advanced
| players won't play trash moves) at that level will probably get
| you more wins in your graph, but it won't indicate any better
| play, it will just be a reflection of less noise; fewer, more
| reinforced known positions.
| jayrot wrote:
| > i think this has everything to do with the fact that learning
| chess by learning sequences will get you into more trouble than
| good. even a trillion games won't save you:
| https://en.wikipedia.org/wiki/Shannon_number
|
| Indeed. As has been pointed out before, the number of possible
| chess positions easily, vastly dwarfs even the wildest possible
| estimate of the number of atoms in the known universe.
| metadat wrote:
| What about the number of possible positions where an idiotic
| move hasn't been played? Perhaps the search space who could
| be reduced quite a bit.
| pixl97 wrote:
| Unless there is an apparent idiotic move than can lead to
| an 'island of intelligence'
| rcxdude wrote:
| Sure, but so does the number of paragraphs in the english
| language, and yet LLMs seem to do pretty well at that. I
| don't think the number of configurations is particularly
| relevant.
|
| (And it's honestly quite impressive that LLMs can play it at
| all, but not at all surprising that it loses pretty handily
| to something which is explicitly designed to search, as
| opposed to simply feed-forward a decision)
| dataspun wrote:
| Not true if we're talking sensible chess moves.
| BurningFrog wrote:
| > _I think this has everything to do with the fact that
| learning chess by learning sequences will get you into more
| trouble than good._
|
| Yeah, once you've deviated from a sequence you're lost.
|
| Maybe approaching it by learning the best move in
| billions/trillions of positions, and feeding that into some AI
| could work better. Similar positions often have the same kind
| of best move.
| torginus wrote:
| Honestly, I think that once you discard the moves one would
| never make, and account for symmetries/effectively similar
| board positions (ones that could be detected by a very simple
| pattern matcher), chess might not be that big a game at all.
| jrecursive wrote:
| you should try it and post a rebuttal :)
| astrea wrote:
| Since we're mentioning Shannon... What is the minimum
| representative sample size of that problem space? Is it close
| enough to the number of freely available chess moves on the
| Internet and in books?
| underlines wrote:
| Can you try increasing compute in the problem search space, not
| in the training space? What this means is, give it more compute
| to think during inference by not forcing any model to "only
| output the answer in algebraic notation" but do CoT prompting:
| "1. Think about the current board 2. Think about valid possible
| next moves and choose the 3 best by thinking ahead 3. Make your
| move"
|
| Or whatever you deem a good step by step instruction of what an
| actual good beginner chess player might do.
|
| Then try different notations, different prompt variations,
| temperatures and the other parameters. That all needs to go in
| your hyper-parameter-tuning.
|
| One could try using DSPy for automatic prompt optimization.
| viraptor wrote:
| Yeah, the expectation for the immediate answer is definitely
| results, especially for the later stages. Another possible
| improvement: every 2 steps, show the current board state and
| repeat the moves still to be processed, before analysing the
| final position.
| pavel_lishin wrote:
| > _1. Think about the current board 2. Think about valid
| possible next moves and choose the 3 best by thinking ahead 3._
|
| Do these models actually _think about a board_? Chess engines
| do, as much as we can say that any machine thinks. But do LLMs?
| TZubiri wrote:
| Can be forced through inference with CoT type of stuff. Spend
| tokens at each stage to draw the board for example, then
| spend tokens restating the rules of the game, then spend
| token restating the heuristics like piece value, and then
| spend tokens doing a minmax n-ply search.
|
| Wildly inefficient? Probably. Could maybe generate some
| python to make more efficient? Maybe, yeah.
|
| Essentially user would have to teach gpt to play chess, or
| training would fine tune chess towards these CoT, fine
| tuning, etc...
| cmpalmer52 wrote:
| I don't think it would have an impact great enough to explain the
| discrepancies you saw, but some chess engines on very low
| difficulty settings make "dumb" moves sometimes. I'm not great at
| chess and I have trouble against them sometimes because they
| don't make the kind of mistakes humans make. Moving the
| difficulty up a bit makes the games more predictable, in that you
| can predict and force an outcome without the computer blowing it
| with a random bad move. Maybe part of the problem is them not
| dealing with random moves well.
|
| I think an interesting challenge would be looking at a board
| configuration and scoring it on how likely it is to be real -
| something high ranked chess players can do without much thought
| (telling a random setup of pieces from a game in progress).
| Xcelerate wrote:
| So if you squint, chess can be considered a formal system. Let's
| plug ZFC or PA into gpt-3.5-turbo-instruct along with an
| interesting theorem and see what happens, no?
| tqi wrote:
| I assume LLMs will be fairly average at chess for the same reason
| it cant count Rs in Strawberry - it's reflecting the training set
| and not using any underlying logic? Granted my understanding of
| LLMs is not very sophisticated, but I would be surprised if the
| Reward Models used were able to distinguish high quality moves vs
| subpar moves...
| ClassyJacket wrote:
| LLMs can't count the Rs in strawberry because of tokenization.
| Words are converted to vectors (numbers), so the actual
| transformer network never sees the letters that make up the
| word.
|
| ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
| tqi wrote:
| Hm but if that is the case, then why did LLMs only fail at
| the tasks for a few word/letter combinations (like r's in
| "Strawberry"), and not all words?
| bryan0 wrote:
| I remember one of the early "breakthroughs" for LLMs in chess was
| that if it could actually play legal moves(!) In all of these
| games are the models always playing legal moves? I don't think
| the article says. The fact that an LLM can even reliably play
| legal moves, 20+ moves into a chess game is somewhat remarkable.
| It needs to have an accurate representation of the board state
| even though it was only trained on next token prediction.
| pama wrote:
| The author explains what they did: restrict the move options to
| valid ones when possible (for open models with the ability to
| enforce grammar during inference) or sample the model for a
| valid move up to ten times, then pick a random valid move.
| zelphirkalt wrote:
| I think it only needs to have read sufficient pgns.
| kenjackson wrote:
| I did a very unscientific test and it did seem to just play
| legal moves. Not only that, if I did an illegal move it would
| tell me that I couldn't do it.
|
| I think said that I wanted to play with new rules, where a
| queen could jump over any pawn, and it let me make that rule
| change -- and we played with this new rule. Unfortunately, I
| was trying to play in my head and I got mixed up and ended up
| losing my queen. Then I changed the rule one more time -- if
| you take the queen you lose -- so I won!
| ericye16 wrote:
| I agree with some of the other comments here that the prompt is
| limiting. The model can't do any computation without emitting
| tokens and limiting the numbers of tokens it can emit is going to
| limit the skill of the model. It's surprising that any model at
| all is capable of performing well with this prompt in fact.
| niobe wrote:
| I don't understand why educated people expect that an LLM _would_
| be able to play chess at a decent level.
|
| It has no idea about the quality of it's data. "Act like x"
| prompts are no substitute for actual reasoning and deterministic
| computation which clearly chess requires.
| computerex wrote:
| Question here is why gpt-3.5-instruct can then beat stockfish.
| fsndz wrote:
| PS: I ran and as suspected got-3.5-turbo-instruct does not
| beat stockfish, it is not even close "Final Results:
| gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0,
| Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0,
| Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf
| 7e26e9b17e1754?...
| computerex wrote:
| Maybe there's some difference in the setup because the OP
| reports that the model beats stockfish (how they had it
| configured) every single game.
| Filligree wrote:
| OP had stockfish at its weakest preset.
| fsndz wrote:
| Did the same and gpt-3.5-turbo-instruct still lost all
| the games. maybe a diff in stockfish version ? I am using
| stockfish 16
| mannykannot wrote:
| That is a very pertinent question, especially if
| Stockfish has been used to generate training data.
| golol wrote:
| You have to get the model to think in PGN data. It's
| crucial to use the exact PGN format it sae in its
| training data and to give it few shot examples.
| bluGill wrote:
| The artical appears to have only run stockfish at low levels.
| you don't have to be very good to beat it
| lukan wrote:
| Cheating (using a internal chess engine) would be the obvious
| reason to me.
| TZubiri wrote:
| Nope. Calls by api don't use functions calls.
| permo-w wrote:
| that you know of
| TZubiri wrote:
| Sure. It's not hard to verify, in the user ui, function
| calls are very transparent.
|
| And in the api, all of the common features like maths and
| search are just not there. You can implement them
| yourself.
|
| You can compare with self hosted models like llama and
| the performance is quite similar.
|
| You can also jailbreak and get shell into the container
| to get some further proof
| girvo wrote:
| How can you prove this when talking about someones
| internal closed API?
| shric wrote:
| I'm actually surprised any of them manage to make legal moves
| throughout the game once out of book moves.
| SilasX wrote:
| Right, at least as of the ~GPT3 model it was just "predict what
| you _would_ see in a chess game ", not "what would _be_ the
| best move ". So (IIRC) users noted that if you made bad move,
| then the model would also reply with bad moves because it
| pattern matched to bad games. (I anthropomorphized this as the
| model saying "oh, we're doing dumb-people-chess now, I can do
| that too!")
| cma wrote:
| But it also predicts moves where the text says "black won the
| game, [proceeds to show the game]". To minimize loss on that
| it would need to from context try and make it so white
| doesn't make critical mistakes.
| aqme28 wrote:
| Yeah, that is the "something weird" of the article.
| viraptor wrote:
| This is a puzzle given enough training information. LLM can
| successfully print out the status of the board after the given
| moves. It can also produce a not-terrible summary of the
| position and is able to list dangers at least one move ahead.
| Decent is subjective, but that should beat at least beginners.
| And the lowest level of stockfish used in the blog post is
| lowest intermediate.
|
| I don't know really what level we should be thinking of here,
| but I don't see any reason to dismiss the idea. Also, it really
| depends on whether you're thinking of the current public
| implementations of the tech, or the LLM idea in general. If we
| wanted to get better results, we could feed it way more chess
| books and past game analysis.
| grugagag wrote:
| LLMs like GPT aren't built to play chess, and here's why:
| they're made for handling language, not playing games with
| strict rules and strategies. Chess engines, like Stockfish,
| are designed specifically for analyzing board positions and
| making the best moves, but LLMs don't even "see" the board.
| They're just guessing moves based on text patterns, without
| understanding the game itself.
|
| Plus, LLMs have limited memory, so they struggle to remember
| previous moves in a long game. It's like trying to play
| blindfolded! They're great at explaining chess concepts or
| moves but not actually competing in a match.
| viraptor wrote:
| > but LLMs don't even "see" the board
|
| This is a very vague claim, but they _can_ reconstruct the
| board from the list of moves, which I would say proves this
| wrong.
|
| > LLMs have limited memory
|
| For the recent models this is not a problem for the chess
| example. You can feed whole books into them if you want to.
|
| > so they struggle to remember previous moves
|
| Chess is stateless with perfect information. Unless you're
| going for mind games, you don't need to remember previous
| moves.
|
| > They're great at explaining chess concepts or moves but
| not actually competing in a match.
|
| What's the difference between a great explanation of a move
| and explaining every possible move then selecting the best
| one?
| mjcohen wrote:
| Chess is not stateless. Three repetitions of same
| position is a draw.
| Someone wrote:
| Yes, there's state there that's not in the board
| position, but technically, threefold repetition is not a
| draw. Play can go on.
| https://en.wikipedia.org/wiki/Threefold_repetition:
|
| _"The game is not automatically drawn if a position
| occurs for the third time - one of the players, on their
| turn, must claim the draw with the arbiter. The claim
| must be made either before making the move which will
| produce the third repetition, or after the opponent has
| made a move producing a third repetition. By contrast,
| the fivefold repetition rule requires the arbiter to
| intervene and declare the game drawn if the same position
| occurs five times, needing no claim by the players."_
| cool_dude85 wrote:
| >Chess is stateless with perfect information. Unless
| you're going for mind games, you don't need to remember
| previous moves.
|
| In what sense is chess stateless? Question: is Rxa6 a
| legal move? You need board state to refer to in order to
| decide.
| aetherson wrote:
| They mean that you only need board position, you don't
| need the previous moves that led to that board position.
|
| There are at least a couple of exceptions to that as far
| as I know.
| User23 wrote:
| The correct phrasing would be is it a Markov process?
| chongli wrote:
| Yes, 4 exceptions: castling rights, legal en passant
| captures, threefold repetition, and the 50 move rule. You
| actually need quite a lot of state to track all of those.
| fjkdlsjflkds wrote:
| It shouldn't be too much extra state. I assume that 2
| bits should be enough to cover castling rights (one for
| each player), whatever is necessary to store the last 3
| moves should cover legal en passant captures and
| threefold repetition, and 12 bits to store two non-
| overflowing 6 bit counters (time since last capture, and
| time since last pawn move) should cover the 50 move rule.
|
| So... unless I'm understanding something incorrectly,
| something like "the three last moves plus 17 bits of
| state" (plus the current board state) should be enough to
| treat chess as a memoryless process. Doesn't seem like
| too much to track.
| chongli wrote:
| Threefold repetition does not require the three positions
| to occur consecutively. So you could conceivably have a
| position repeat itself for first on the 1st move, second
| time on the 25th move, and the third time on the 50th
| move of a sequence and then players could claim a draw by
| threefold repetition or 50 move rule at the same time!
|
| This means you do need to store the last 50 board
| positions in the worst case. Normally you need to store
| less because many moves are irreversible (pawns cannot go
| backwards, pieces cannot be un-captured).
| fjkdlsjflkds wrote:
| Ah... gotcha. Thanks for the clarification.
| sfmz wrote:
| Chess is not stateless. En Passant requires last move and
| castling rights requires nearly all previous moves.
|
| https://adamkarvonen.github.io/machine_learning/2024/01/0
| 3/c...
| viraptor wrote:
| Ok, I did go too far. But castling doesn't require all
| previous moves - only one bit of information carried
| over. So in practice that's board + 2 bits per player.
| (or 1 bit and 2 moves if you want to include a draw)
| aaronchall wrote:
| Castling requires no prior moves by either piece (King or
| Rook). Move the King once and back early on, and later,
| although the board looks set for castling, the King may
| not.
| viraptor wrote:
| Yes, which means you carry one bit of extra information -
| "is castling still allowed". The specific moves that
| resulted in this bit being unset don't matter.
| aaronchall wrote:
| Ok, then for this you need minimum of two bits - one for
| kingside Rook and one for the queenside Rook, both would
| be set if you move the King. You also need to count moves
| since the last exchange or pawn move for the 50 move
| rule.
| viraptor wrote:
| Ah, that one's cool - I've got to admit I've never heard
| of the 50 move rule.
| User23 wrote:
| Also the 3x repetition rule.
| chipsrafferty wrote:
| And 5x repetition rule
| ethbr1 wrote:
| > _Chess is stateless with perfect information._
|
| It is not stateless, because good chess isn't played as a
| series of independent moves -- it's played as a series of
| moves connected to a player's strategy.
|
| > _What 's the difference between a great explanation of
| a move and explaining every possible move then selecting
| the best one?_
|
| Continuing from the above, "best" in the latter sense
| involves understanding possible future moves _after_ the
| next move.
|
| Ergo, if I looked at all games with the current board
| state and chose the next move that won the most games,
| it'd be tactically sound but strategically ignorant.
|
| Because many of those next moves were making that next
| move _in support of_ some broader strategy.
| viraptor wrote:
| > it's played as a series of moves connected to a
| player's strategy.
|
| That state belongs to the player, not to the game. You
| can carry your own state in any game you want - for
| example remember who starts with what move in rock paper
| scissors, but that doesn't make that game stateful. It's
| the player's decision (or bot's implementation) to use
| any extra state or not.
|
| I wrote "previous moves" specifically (and the extra bits
| already addressed elsewhere), but the LLM can
| carry/rebuild its internal state between the steps.
| ethbr1 wrote:
| If we're talking about LLMs, then the state belongs to
| it.
|
| So even if the rules of chess are (mostly) stateless, the
| resulting game itself is not.
|
| Thus, you can't dismiss concerns about LLMs having
| difficulty tracking state by saying that chess is
| stateless. It's not, in that sense.
| lxgr wrote:
| > good chess isn't played as a series of independent
| moves -- it's played as a series of moves connected to a
| player's strategy.
|
| Maybe good chess, but not perfect chess. That would by
| definition be game-theoretically optimal, which in turn
| implies having to maintain no state other than your
| position in a large but precomputable game tree.
| chongli wrote:
| Right, but your position also includes whether or not you
| still have the right to castle on either side, whether
| each pawn has the right to capture en passant or not, the
| number of moves since the last pawn move or capture (for
| tracking the 50 move rule), and whether or not the
| current position has ever appeared on the board once or
| twice prior (so you can claim a draw by threefold
| repetition).
|
| So in practice, your position actually includes the log
| of all moves to that point. That's a lot more state than
| just what you can see on the board.
| cowl wrote:
| > Chess is stateless with perfect information. Unless
| you're going for mind games, you don't need to remember
| previous moves.
|
| while it can be played as stateless, remembering previous
| moves gives you insight into potential strategy that is
| being build.
| jackcviers3 wrote:
| You can feed them whole books, but they have trouble with
| recall for specific information in the middle of the
| context window.
| jerska wrote:
| LLMs need to compress information to be able to predict
| next words in as many contexts as possible.
|
| Chess moves are simply tokens as any other. Given enough
| chess training data, it would make sense to have part of
| the network trained to handle chess specifically instead of
| simply encoding basic lists of moves and follow-ups. The
| result would be a general purpose sub-network trained on
| chess.
| zeckalpha wrote:
| Language is a game with strict rules and strategies.
| codebolt wrote:
| > they're made for handling language, not playing games
| with strict rules and strategies
|
| Here's the opposite theory: Language encodes objective
| reasoning (or at least, it does some of the time). A
| sufficiently large ANN trained on sufficiently large
| amounts of text will develop internal mechanisms of
| reasoning that can be applied to domains outside of
| language.
|
| Based on what we are currently seeing LLMs do, I'm becoming
| more and more convinced that this is the correct picture.
| wruza wrote:
| I share this idea but from the different perspective. It
| doesn't develop these mechanisms, but casts a high-
| dimensional-enough shadow of their effect on itself. This
| vaguely explains why the more deep Gell-Mann-wise you are
| the less sharp that shadow is, because specificity cuts
| off "reasoning" hyperplanes.
|
| It's hard to explain emerging _mechanisms_ because of the
| nature of generation, which is one-pass sequential matrix
| reduction. I say this while waving my hands, but listen.
| Reasoning is similar to Turing complete algorithms, and
| what LLMs can become through training is similar to
| limited pushdown automata at best. I think this is a good
| conceptual handle for it.
|
| "Line of thought" is an interesting way to loop the
| process back, but it doesn't show _that_ much
| improvement, afaiu, and still is finite.
|
| Otoh, a chess player takes as much time and "loops" as
| they need to get the result (ignoring competitive time
| limits).
| nemomarx wrote:
| just curious, was this rephrased by an llm or is that your
| writing style?
| shric wrote:
| Stockfish level 1 is well below "lowest intermediate".
|
| A friend of mine just started playing chess a few weeks ago
| and can beat it about 25% of the time.
|
| It will hang pieces, and you can hang your own queen and
| there's about a 50% chance it won't be taken.
| danielmarkbruce wrote:
| Chess does not clearly require that. Various purely
| ML/statistical based model approaches are doing pretty well.
| It's almost certainly best to incorporate some kind of search
| into an overall system, but it's not absolutely required to
| play just decent amateur level.
|
| The problem here is the specific model architecture, training
| data, vocabulary/tokenization method (if you were going to even
| represent a game this way... which you wouldn't), loss function
| and probably decoding strategy.... basically everything is
| wrong here.
| TZubiri wrote:
| Bro, it actually did play chess, didn't you read the article?
| mandevil wrote:
| It sorta played chess- he let it generate up to ten moves,
| throwing away any that weren't legal, and if no legal move
| was generated by the 10th try he picked a random legal move.
| He does not say how many times he had to provide a random
| move, or how many times illegal moves were generated.
| og_kalu wrote:
| You're right it's not in this blog but turbo-instruct's
| chess ability has been pretty thoroughly tested and it does
| play chess.
|
| https://github.com/adamkarvonen/chess_gpt_eval
| TZubiri wrote:
| Ah, I didn't see the ilegal move discarding.
| mandevil wrote:
| That was for the OpenAI games- including the ones that
| won. For the ones he ran himself with open source LLM's
| he restricted their grammar to just be legal moves, so it
| could only respond with a legal move. But that was
| because of a separate process he added on top of the LLM.
|
| Again, this isn't exactly HAL playing chess.
| slibhb wrote:
| Few people (perhaps none) expected LLMs to be good at chess.
| Nevertheless, as the article explains, there was buzz around a
| year ago that LLMs were good at chess.
|
| > It has no idea about the quality of it's data. "Act like x"
| prompts are no substitute for actual reasoning and
| deterministic computation which clearly chess requires.
|
| No. You can definitely train a model to be really good at chess
| without "actual reasoning and deterministic computation".
| xelxebar wrote:
| Then you should be surprised that turbo-instruct actually plays
| well, right? We see a proliferation of hand-wavy arguments
| based on unfounded anthropomorphic intuitions about "actual
| reasoning" and whatnot. I think this is good evidence that
| nobody really understands what's going on.
|
| If some mental model says that LLMs should be bad at chess,
| then it fails to explain why we have LLMs playing strong chess.
| If another mental model says the inverse, then it fails to
| explain why so many of these large models fail spectacularly at
| chess.
|
| Clearly, there's more going on here.
| flyingcircus3 wrote:
| "playing strong chess" would be a much less hand-wavy claim
| if there were lots of independent methods of quantifying and
| verifying the strength of stockfish's lowest difficulty
| setting. I honestly don't know if that exists or not. But
| unless it does, why would stockfish's lowest difficulty
| setting be a meaningful threshold?
| golol wrote:
| I've tried it myself, GPT-3.5-turbo-instruct was at least
| somewhere in the rabge 1600-1800 ELO.
| akira2501 wrote:
| There are some who suggest that modern chess is mostly a game
| of memorization and not one particularly of strategy or
| skill. I assume this is why variants like speed chess exist.
|
| In this scope, my mental model is that LLMs would be good at
| modern style long form chess, but would likely be easy to
| trip up with certain types of move combinations that most
| humans would not normally use. My prediction is that once
| found they would be comically susceptible to these patterns.
|
| Clearly, we have no real basis for saying it is "good" or
| "bad" at chess, and even using chess performance as an
| measurement sample is a highly biased decision, likely born
| out of marketing rather than principle.
| mewpmewp2 wrote:
| It is memorisatiom only after you have grandmastered
| reasoning and strategy.
| DiogenesKynikos wrote:
| Speed chess relies on skill.
|
| I think you're using "skill" to refer solely to one aspect
| of chess skill: the ability to do brute-force calculations
| of sequences of upcoming moves. There are other aspects of
| chess skill, such as:
|
| 1. The ability to judge a chess position at a glance, based
| on years of experience in playing chess and theoretical
| knowledge about chess positions.
|
| 2. The ability to instantly spot tactics in a position.
|
| In blitz (about 5 minutes) or bullet (1 minute) chess
| games, these other skills are much more important than the
| ability to calculate deep lines. They're still aspects of
| chess skill, and they're probably equally important as the
| ability to do long brute-force calculations.
| henearkr wrote:
| > tactics in a position
|
| That should give patterns (hence your use of the verb to
| "spot" them, as the grandmaster would indeed spot the
| patterns) recognizable in the game string.
|
| More specifically grammar-like parterns, e.g. the same
| moves but translated.
|
| Typically what an LLM can excel at.
| the_af wrote:
| > _Then you should be surprised that turbo-instruct actually
| plays well, right?_
|
| Do we know it's not special-casing chess and instead using a
| different engine (not an LLM) for playing?
|
| To be clear, this would be an entirely _appropriate_ approach
| to problem-solving in the real world, it just wouldn 't be
| the LLM that's playing chess.
| mda wrote:
| Yes, probably there is more going on here, e.g. it is
| cheating.
| mannykannot wrote:
| One of the main purposes of running experiments of any sort is
| to find out if our preconceptions are accurate. Of course, if
| someone is not interested in that question, they might as well
| choose not to look through the telescope.
| bowsamic wrote:
| Sadly there's a common sentiment on HN that testing obvious
| assumptions is a waste of time
| BlindEyeHalo wrote:
| Not only on HN. Trying to publish a scientific article that
| does not contain the word 'novel' has become almost
| impossible. No one is trying to reproduce anyones claims
| anymore.
| pcf wrote:
| Do you think this bias is part of the replication crisis
| in science?
| bowsamic wrote:
| I don't think this is about replication, but even just
| about the initial test in the first place. In science we
| do often test obvious things. For example, I was a
| theoretical quantum physicist, and a lot of the time I
| knew that what I am working on will definitely work,
| since the maths checks out. In some sense that makes it
| kinda obvious, but we test it anyway.
|
| The issue is that even that kinda obviousness is
| criticised here. People get mad at the idea of doing
| experiments when we already expect a result.
| pizza wrote:
| But there's really nothing about chess that makes reasoning a
| prerequisite, a win is a win as long as it's a win. This is
| kind of a semantics game: it's a question of whether the degree
| of skill people observe in an LLM playing chess is actually
| some different quantity than the chance it wins.
|
| I mean at some level you're saying that no matter how close to
| 1 the win probability (1 - epsilon) gets, both of the following
| are true:
|
| A. you should always expect for the computation that you're
| able to do via conscious reasoning alone to always be
| sufficient, at least in principle, to asymptotically get a
| higher win probability than a model, no matter what the model's
| win probability was to begin with
|
| B. no matter how close to 1 that the model's win rate p=(1 -
| epsilon) gets, because logical inference is so non-smooth, the
| win rate on yet-unseen data is fundamentally algorithmically
| random/totally uncorrelated to in-distribution performance, so
| it's never appropriate to say that a model can understand or to
| reason
|
| To me it seems that people are subject to both of these
| criteria, though. They have a tendency to cap out at their
| eventual skill cap unless given a challenge to nudge them to a
| higher level, and likewise possession of logical reasoning
| doesn't let us say much at all about situations that their
| reasoning is unfamiliar with.
|
| I also think, if you want to say that what LLMs do has nothing
| to do with understanding or ability, then you also have to have
| an alternate explanation for the phenomenon of AlphaGo
| defeating Lee Sedol being a catalyst for top Go players being
| able to rapidly increase their own rankings shortly after.
| jsemrau wrote:
| There are many ways to test for reasoning and deterministic
| computation as my own work in this space has shown .
| golol wrote:
| Because it's a straight forward stochastic sequence modelling
| task and I've seen GPT-3.5-turbo-instruct play at high amateur
| level myself. But it seems like all the RLHF and distillation
| that is done on newer models destroys that ability.
| QuesnayJr wrote:
| They thought it because we have an existence proof:
| gpt-3.5-turbo-instruct _can_ play chess at a decent level.
|
| That was the point of the post (though you have to read it to
| the end to see this). That one model can play chess pretty
| well, while the free models and OpenAI's later models can't.
| That's weird.
| scj wrote:
| It'd be more interesting to see LLMs play Family Feud. I think
| it'd be their ideal game.
| chipdart wrote:
| > I don't understand why educated people expect that an LLM
| would be able to play chess at a decent level.
|
| The blog post demonstrates that a LLM plays chess at a decent
| level.
|
| The blog post explains why. It addresses the issue of data
| quality.
|
| I don't understand what point you thought you were making.
| Regardless of where you stand, the blog post showcases a
| surprising result.
|
| You stress your prior unfounded belief, you were presented with
| data that proves it wrong, and your reaction was to post a
| comment with a thinly veiled accusation of people not being
| educated when clearly you are the one that's off.
|
| To make matters worse, this topic is also about curiosity.
| Which has a strong link with intelligence and education. And
| you are here criticizing others on those grounds in spite of
| showing your defitic right at the first sentence.
|
| This blog post was a great read. Very surprising, engaging, and
| thought provoking.
| Cthulhu_ wrote:
| > I don't understand why educated people expect that an LLM
| would be able to play chess at a decent level.
|
| Because it would be super cool; curiosity isn't something to be
| frowned upon. If it turned out it _did_ play chess reasonably
| well, it would mean emergent behaviour instead of just echoing
| things said online.
|
| But it's wishful thinking with this technology at this current
| level; like previous instances of chatbots and the like, while
| initially they can convince some people that they're
| intelligent thinking machines, this test proves that they
| aren't. It's part of the scientific process.
| jdthedisciple wrote:
| I love how LLMs are the one subject matter where even most
| educated people are extremely confidently _wrong_.
| fourthark wrote:
| Ppl acting like LLMs!
| motoboi wrote:
| I suppose you didn't get the news, but google developed a LLM
| that can play chess. And play it at grandmaster level:
| https://arxiv.org/html/2402.04494v1
| suddenlybananas wrote:
| That article isn't as impressive as it sounds: https://gist.g
| ithub.com/yoavg/8b98bbd70eb187cf1852b3485b8cda...
|
| In particular, it is _not_ an LLM and it is not trained
| solely on observations of chess moves.
| Scene_Cast2 wrote:
| Not quite an LLM. It's a transformer model, but there's no
| tokenizer or words, just chess board positions (64 tokens,
| one per board square). It's purpose-built for chess (never
| sees a word of text).
| lxgr wrote:
| In fact, the unusual aspect of this chess engine is not
| that it's using neural networks (even Stockfish does, these
| days!), but that it's _only_ using neural networks.
|
| Chess engines essentially do two things: Calculate the
| value of a given position for their side, and walking the
| tree game tree while evaluating its positions in that way.
|
| Historically, position value was a handcrafted function
| using win/lose criteria (e.g. being able to give checkmate
| is infinitely good) and elaborate heuristics informed by
| real chess games, e.g. having more space on the board is
| good, having a high-value piece threatened by a low-value
| one is bad etc., and the strength of engines largely
| resulted from being able to "search the game tree" for good
| positions very broadly and deeply.
|
| Recently, neural networks (trained on many simulated games)
| have been replacing these hand-crafted position evaluation
| functions, but there's still a ton of search going on. In
| other words, the networks are still largely "dumb but
| fast", and without deep search they'll lose against even a
| novice player.
|
| This paper now presents a _searchless_ chess engine, i.e.
| one who essentially "looks at the board once" and "intuits
| the best next move", without "calculating" resulting
| hypothetical positions at all. In the words of Capablanca,
| a chess world champion also cited in the paper: "I see only
| one move ahead, but it is always the correct one."
|
| The fact that this is possible can be considered
| surprising, a testament to the power of transformers etc.,
| but it does indeed have nothing to do with language or LLMs
| (other than that the best ones known to date are based on
| the same architecture).
| teleforce wrote:
| It's interesting to note that the paper benchmarked its chess
| playing performance against GPT-3.5-turbo-instruct, the only
| well performant LLM in the posted article.
| empath75 wrote:
| > I don't understand why educated people expect that an LLM
| would be able to play chess at a decent level.
|
| You shouldn't but there's lots of things that LLMs can do that
| educated people shouldn't expect it to be able to do.
| abalaji wrote:
| An easy way to make all LLMs somewhat good at chess is to make a
| Chess Eval that you publish and get traction with. Suddenly you
| will find that all newer frontier models are half decent at
| chess.
| fsndz wrote:
| wow I actually did something similar recently and no LLM could
| win and the centipawn loss was always going through the roof
| (sort of). I created a leaderboard based on it.
| https://www.lycee.ai/blog/what-happens-when-llms-play-chess
|
| I am very surprised by the perf of got-3.5-turbo-instruct.
| Beating stockfish ? I will have to run the experiment with that
| model to check that out
| fsndz wrote:
| PS: I ran and as suspected got-3.5-turbo-instruct does not beat
| stockfish, it is not even close
|
| "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6,
| Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0,
| Rating=1500.00"
|
| https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
| janalsncm wrote:
| > I always had the LLM play as white against Stockfish--a
| standard chess AI--on the lowest difficulty setting
|
| I think the author was comparing against Stockfish at a lower
| skill level (roughly, the number of nodes explored in a
| move).
| fsndz wrote:
| Did the same and gpt-3.5-turbo-instruct still lost all the
| games. maybe a diff in stockfish version ? I am using
| stockfish 16
| janalsncm wrote:
| Huh. Honestly, your answer makes more sense, LLMs
| shouldn't be good at chess, and this anomaly looks more
| like a bug. Maybe the author should share his code so it
| can be replicated.
| tedsanders wrote:
| Your issue is that the performance of these models at chess
| is incredibly sensitive to the prompt. If you have
| gpt-3.5-turbo-instruction complete a PGN transcript, then
| you'll see performance in the 1800 Elo range. If you ask in
| English or diagram the board, you'll see vastly degraded
| performance.
|
| Unlike people, how you ask the question really really affects
| the output quality.
| uneventual wrote:
| my friend pointed out that Q5_K_M quantization used for the open
| source models probably substantially reduces the quality of play.
| o1 mini's poor performance is puzzling, though.
| Havoc wrote:
| My money is on a fluke inclusion of more chess data in that
| models training.
|
| All the other models do vaguely similarly well in other tasks and
| are in many cases architecturally similar so training data is the
| most likely explanation
| bhouston wrote:
| Yeah. This.
| permo-w wrote:
| I feel like a lot of people here are slightly misunderstanding
| how LLM training works. yes the base models are trained
| somewhat blind on masses of text, but then they're heavily
| fine-tuned with custom, human-generated reinforcement learning,
| not just for safety, but for any desired feature
|
| these companies do quirky one-off training experiments all the
| time. I would not be remotely shocked if at some point OpenAI
| paid some trainers to input and favour strong chess moves
| simonw wrote:
| From this OpenAI paper (page 29
| https://arxiv.org/pdf/2312.09390#page=29
|
| "A.2 CHESS PUZZLES
|
| Data preprocessing. The GPT-4 pretraining dataset included
| chess games in the format of move sequence known as Portable
| Game Notation (PGN). We note that only games with players of
| Elo 1800 or higher were included in pretraining. These games
| still include the moves that were played in- game, rather
| than the best moves in the corresponding positions. On the
| other hand, the chess puzzles require the model to predict
| the best move. We use the dataset originally introduced in
| Schwarzschild et al. (2021b) which is sourced from
| https://database.lichess.org/#puzzles (see also Schwarzschild
| et al., 2021a). We only evaluate the models ability to
| predict the first move of the puzzle (some of the puzzles
| require making multiple moves). We follow the pretraining
| for- mat, and convert each puzzle to a list of moves leading
| up to the puzzle position, as illustrated in Figure 14. We
| use 50k puzzles sampled randomly from the dataset as the
| training set for the weak models and another 50k for weak-to-
| strong finetuning, and evaluate on 5k puzzles. For bootstrap-
| ping (Section 4.3.1), we use a new set of 50k puzzles from
| the same distribution for each step of the process."
| m3kw9 wrote:
| If it was trained with moves and 100s of thousands of entire
| games of various level, I do see it generating good moves and
| beat most players except he high Elo players
| lukev wrote:
| I don't necessarily believe this for a second but I'm going to
| suggest it because I'm feeling spicy.
|
| OpenAI clearly downgrades some of their APIs from their maximal
| theoretic capability, for the purposes of response
| time/alignment/efficiency/whatever.
|
| Multiple comments in this thread also say they couldn't reproduce
| the results for gpt3.5-turbo-instruct.
|
| So what if the OP just happened to test at a time, or be IP bound
| to an instance, where the model was not nerfed? What if 3.5 and
| all subsequent OpenAI models can perform at this level but it's
| not strategic or cost effective for OpenAI to expose that
| consistently?
|
| For the record, I don't actually believe this. But given the data
| it's a logical possibility.
| TZubiri wrote:
| Stallman may have its flaws, but this is why serious research
| occurs with source code (or at least with binaries)
| zeven7 wrote:
| Why do you doubt it? I thought it was well known that Chat GPT
| has degraded over time for the same model, mostly for cost
| saving reasons.
| permo-w wrote:
| ChatGPT is - understandably - blatantly different in the
| browser compared to the app, or it was until I deleted it
| anyway
| lukan wrote:
| I do not understand that. The app does not do any
| processing, just a UI to send text to and from the server.
| isaacfrond wrote:
| There is a small difference between the app and the
| browser. before each session, the llm is started with a
| systems prompt. these are different for the app and the
| browser. You can find them online somewhere, but iirc the
| app is instructed to give shorter answers
| bongodongobob wrote:
| Correct, it's different in a mobile browser too, the
| system prompt tells it to be brief/succinct. I always
| switch to desktop mode when using it on my phone.
| com2kid wrote:
| > OpenAI clearly downgrades some of their APIs from their
| maximal theoretic capability, for the purposes of response
| time/alignment/efficiency/whatever.
|
| When ChatGPT3.5 first came out, people were using it to
| simulate entire Linux system installs, and even browsing a
| simulated Internet.
|
| Cool use cases like that aren't even discussed anymore.
|
| I still wonder what sort of magic OpenAI had and then locked up
| away from the world in the name of cost savings.
|
| Same thing with GPT 4 vs 4o, 4o is obviously worse in some
| ways, but after the initial release (when a bunch of people
| mentioned this), the issue has just been collectively ignored.
| golol wrote:
| You can still do this. People just lost interest in this
| stuff because it became clear to ehich degree the simulation
| is really being done (shallow).
|
| Yet I do wish we had access to less
| finetuned/distilled/RLHF'd models.
| ipsum2 wrote:
| People are doing this all the time with Claude 3.5.
| kmeisthax wrote:
| If tokenization is such a big problem, then why aren't we
| training new base models on randomly non-tokenized data? e.g.
| during training, randomly substitute some percentage of the input
| tokens with individual letters.
| permo-w wrote:
| if this isn't just a bad result, it's odd to me that the author
| at no point suggests what sounds to me like the most obvious
| answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-
| instruct's chess playing, either with post-processing or
| literally by training it to be so
| ks2048 wrote:
| Has anyone tried to see how many chess games models are trained
| on? Is there any chance they consume lichess database dumps, or
| something similar? I guess the problem is most (all?) top LLMs,
| even open-weight ones, don't reveal their training data. But I'm
| not sure.
| ks2048 wrote:
| How well does an LLM/transformer architecture trained purely on
| chess games do?
| ttyprintk wrote:
| Training works as expected:
|
| https://news.ycombinator.com/item?id=38893456
| justinclift wrote:
| It'd be super funny if the "gpt-3.5-turbo-instruct" approach has
| a human in the loop. ;)
|
| Or maybe it's able to recognise the chess game, then get moves
| from an external chess game API?
| jacknews wrote:
| Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves
| with a chess engine.
| astrea wrote:
| Well that makes sense when you consider the game has been
| translated into an (I'm assuming monotonically increasing)
| alphanumeric representation. So, just like language, you're given
| an ordered list of tokens and you need to find the next token
| that provides the highest confidence.
| anotherpaulg wrote:
| I found a related set of experiments that include gpt-3.5-turbo-
| instruct, gpt-3.5-turbo and gpt-4.
|
| Same surprising conclusion: gpt-3.5-turbo-instruct is much better
| at chess.
|
| https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
| shtack wrote:
| I'd bet it's using function calling out to a real chess engine.
| It could probably be proven with a timing analysis to see how
| inference time changes/doesn't with number of tokens or game
| complexity.
| scratchyone wrote:
| ?? why would openai even want to secretly embed chess
| function calling into an incredibly old model? if they wanted
| to trick people into thinking their models are super good at
| chess why wouldn't they just do that to gpt-4o?
| semi-extrinsic wrote:
| The idea is that they embedded this when it was a new
| model, as part of the hype before GPT-4. The fake-it-till-
| you-make-it hope was that GPT-4 would be so good it could
| actually play chess. Then it turned out GPT-4 sucked at
| chess as well, and OpenAI quietly dropped any mention of
| chess. But it would be too suspicious to remove a well-
| documented feature from the old model, so it's left there
| and can be chalked up as a random event.
| vbarrielle wrote:
| If it were calling to a real chess engine there would be no
| illegal moves.
| davvid wrote:
| Here is a truly brilliant game. It's Google Bard vs. Chat GPT.
| Hilarity ensues.
|
| https://www.youtube.com/watch?v=FojyYKU58cw
| quantadev wrote:
| We know from experience with different humans that there are
| different types of skills and different types of intelligence.
| Some savants might be superhuman at one task but basically
| mentally disabled at all other things.
|
| It could be that the model that does chess well just happens to
| have the right 'connectome' purely by accident of how the various
| back-propagations worked out to land on various local maxima
| (model weights) during training. It might even be (probably is) a
| non-verbal connectome that's just purely logic rules, having
| nothing to do with language at all, but a semantic space pattern
| that got landed on accidentally, which can solve this class of
| problem.
|
| Reminds me of how Daniel Tammet just visually "sees" answers to
| math problems in his mind without even knowing how they appear.
| It's like he sees a virtual screen with a representation akin to
| numbers (the answer) just sitting there to be read out from his
| visual cortex. He's not 'working out' the solutions. They're just
| handed to him purely by some connectome effects going on in the
| background.
| chvid wrote:
| Theory 5: GPT-3.5-instruct plays chess by calling a traditional
| chess engine.
| kylebenzle wrote:
| Yes! I also was waiting for this seemingly obvious answer in
| the article as well. Hopefully the author will see these
| comments.
| bubblyworld wrote:
| Just think about the trade off from OpenAI's side here -
| they're going to add a bunch of complexity to gpt3.5 to let it
| call out to engines (either an external system monitoring all
| outputs for chess related stuff, or some kind of tool-assisted
| CoT for instance) just so it can play chess incorrectly a high
| percentage of the time, and even when it doesn't at a mere
| 1800ELO level? In return for some mentions in a few relatively
| obscure blog posts? Doesn't make any sense to me as an
| explanation.
| copperx wrote:
| But there could be a simple explanation. For example, they
| could have tested many "engines" when developing function
| calling and they just left them in there. They just happened
| to connect to a basic chess playing algorithm and nothing
| sophisticated.
|
| Also, it makes a lot of sense if you expect people to play
| chess against the LLM, especially if you are later training
| future models on the chats.
| bubblyworld wrote:
| This still requires a lot of coincidences, like they chose
| to use a terrible chess engine for their external tool
| (why?), they left it on in the background for _all_ calls
| via all APIs for _only_ gpt-3.5-turbo-instruct (why?), they
| see business value in this specific model being good at
| chess vs other things (why?).
|
| You say it makes sense but how does it make sense for
| OpenAI to add overhead to all of its API calls for the
| super niche case of people playing 1800 ELO chess/chat
| bots? (that often play illegal moves, you can go try it
| yourself)
| usrusr wrote:
| Could be a pilot implementation to learn about how to link up
| external specialist engines. Chess would be the obvious
| example to start with because the problem is so well known,
| standardized and specialist engines are easily available. If
| they ever want to offer an integration like that to customers
| (who might have some existing rule based engine in house),
| the need to know everything they can about expected cost,
| performance.
| bubblyworld wrote:
| This doesn't address its terrible performance. If it were
| touching anything like a real engine it would be playing at
| a superhuman level, not the level of a upper-tier beginner.
| 9dev wrote:
| That would have immediately given away that something
| must be off. If you want to do this in a subtle way that
| increases the hype around GPT-3.5 at the time, giving it
| a good-but-not-too-good rating would be the way to go.
| bubblyworld wrote:
| If you want to keep adding conditions to an already-
| complex theory, you'll need an equally complex set of
| observations to justify it.
| samatman wrote:
| You're the one imposing an additional criterion, that
| OpenAI must have chosen the highest setting on a chess
| engine, and demanding that this additional criterion be
| used to explain the facts.
|
| I agree with GP that if a 'fine tuning' of GPT 3.5 came
| out the gate playing at top Stockfish level, people would
| have been extremely suspicious of that. So in my
| accounting of the unknowns here, the fact that it doesn't
| play at the top level provides no additional information
| with which to resolve the question.
| usrusr wrote:
| The way I read the article is that it's just as terrible
| as you would expect it to be from pure word association,
| except for one version that's an outlier in not being
| terrible at all within a well defined search depth, and
| again just as terrible beyond that. And only this outlier
| is the weird thing referenced in the headline.
|
| I read this as that this outlier version is connecting to
| an engine, and that this engine happens to get
| parameterized for a not particularly deep search depth.
|
| If it's an exercise in integration they don't need to
| waste cycles on the engine playing awesome - it's enough
| for validation if the integration result is noticeably
| less bad than the LLM alone rambling about trying to
| sound like a chess expert.
| pixiemaster wrote:
| I have this hypothesis as well, that OpenAI added a lot of
| ,,classic" algorithms and rules over time, (eg rules for
| filtering etc)
| golol wrote:
| Sorry this is just consiracy theorizing. I've tried jt with
| GPT-3.5-instruct myself in the OpenAI playeground where the
| model clearly does nothing but auto-regression. No function
| calling there whatsoever.
| layman51 wrote:
| It would be really cool if someone could get an LLM to actually
| launch an anonymous game on Chess.com or Lichess and actually
| have any sense as to what it's doing.[1] Some people say that you
| have to represent the board in a certain way. When I first tried
| to play chess with an LLM, I would just list out a move and it
| didn't do very well at all.
|
| [1]: https://youtu.be/Gs3TULwlLCA
| cjbprime wrote:
| > I ran all the open models (anything not from OpenAI, meaning
| anything that doesn't start with gpt or o1) myself using Q5_K_M
| quantization, whatever that is.
|
| It's just a lossy compression of all of the parameters, probably
| not important, right?
| loa_in_ wrote:
| Probably important when competing against undecimated ones from
| OpenAI
| NiloCK wrote:
| Notably: there were other OpenAI models that weren't
| quantize, but also performed poorly.
| nusl wrote:
| > I only ran 10 trials since AI companies have inexplicably
| neglected to send me free API keys
|
| Sure, but nobody is required to send you anything for free.
| swiftcoder wrote:
| I feel like the article neglects one obvious possibility: that
| OpenAI decided that chess was a benchmark worth "winning",
| special-cases chess within gpt-3.5-turbo-instruct, and then
| neglected to add that special-case to follow-up models since it
| wasn't generating sustained press coverage.
| INTPenis wrote:
| Of course it's a benchmark worth winning, has been since
| Watson. And before that even with mechanical turks.
| dmurray wrote:
| This seems quite likely to me, but did they special case it by
| reinforcement training it into the LLM (which would be
| extremely interesting in how they did it and what its internal
| representation looks like) or is it just that when you make an
| API call to OpenAI, the machine on the other end is not just a
| zillion-parameter LLM but also runs an instance of Stockfish?
| shaky-carrousel wrote:
| That's easy to test, invent a new chess variant and see how
| the model does.
| gliptic wrote:
| Both an LLM and Stockfish would fail that test.
| delusional wrote:
| Nobody is claiming that Stockfish is learning
| generalizable concepts that can one day meaningfully
| replace people in value creating work.
| droopyEyelids wrote:
| The point was such a question could not be used to tell
| whether the llm was calling a chess engine
| delusional wrote:
| Ah okay, I missed that.
| andy_ppp wrote:
| You're imagining LLMs don't just regurgitate and recombine
| things they already know from things they have seen before.
| A new variant would not be in the dataset so would not be
| understood. In fact this is quite a good way to show LLMs
| are NOT thinking or understanding anything in the way we
| understand it.
| shaky-carrousel wrote:
| Yes, that's how you can really tell if the model is doing
| real thinking and not recombinating things. If it can
| correctly play a novel game, then it's doing more than
| that.
| dwighttk wrote:
| No LLM model is doing any thinking.
| selestify wrote:
| How do you define thinking?
| antononcube wrote:
| Being fast at doing linear algebra computations. (Is
| there any other kind?!)
| landryraccoon wrote:
| Making the OP feel threatened/emotionally attached/both
| enough to call the language model a rival / companion /
| peer instead of a tool.
| jahnu wrote:
| I wonder what the minimal amount of change qualifies as
| novel?
|
| "Chess but white and black swap their knights" for
| example?
| the_af wrote:
| I wonder what would happen with a game that is mostly
| chess (or chess with truly minimal variations) but with
| all the names changed (pieces, moves, "check", etc, all
| changed). The algebraic notation is also replaced with
| something else so it cannot be pattern matched against
| the training data. Then you list the rules (which are
| mostly the same as chess).
|
| None of these changes are explained to the LLM, so if it
| can tell it's still chess, it must deduce this on its
| own.
|
| Would any LLM be able to play at a decent level?
| timdiggerm wrote:
| By that standard (and it is a good standard), none of
| these "AI" things are doing any thinking
| Jerrrrrrry wrote:
| musical goalposts, gotta love it.
|
| These LLM's just exhibited agency.
|
| Swallow your pride.
| samatman wrote:
| "Does it generalize past the training data" has been a
| pre-registered goalpost since before the attention
| transformer architecture came on the scene.
| Jerrrrrrry wrote:
| >'thinking' vs 'just recombinating things
|
| If there is a difference, and LLM's can do one but not
| the other... >By that standard (and it is
| a good standard), none of these "AI" things are doing any
| thinking >"Does it generalize past the
| training data" has been a pre-registered goalpost since
| before the attention transformer architecture came on the
| scene.
|
| Then what the fuck are they doing.
|
| Learning is thinking, reasoning, what have you.
|
| Move goalposts, re-define words, it won't matter.
| empath75 wrote:
| You say this quite confidently, but LLMs do generalize
| somewhat.
| dmurray wrote:
| In both scenarios it would perform poorly on that.
|
| If the chess specialization was done through reinforcement
| learning, that's not going to transfer to your new variant,
| any more than access to Stockfish would help it.
| amelius wrote:
| To be fair, they say
|
| > Theory 2: GPT-3.5-instruct was trained on more chess games.
| AstralStorm wrote:
| If that were the case, pumping big Llama chock full of chess
| games would produce good results. It didn't.
|
| The only way it could be true is if that model recognized and
| replayed the answer to the game from memory.
| yorwba wrote:
| Do you have a link to the results from fine-tuning a Llama
| model on chess? How do they compare to the base models in
| the article here?
| scott_w wrote:
| I suspect the same thing. Rather than LLMs "learning to play
| chess," they "learnt" to recognise a chess game and hand over
| instructions to a chess engine. If that's the case, I don't
| feel impressed at all.
| fires10 wrote:
| Recognize and hand over to a specialist engine? That might be
| useful for AI. Maybe I am missing something.
| worewood wrote:
| It's because this is standard practice since the early days
| - there's nothing newsworthy in this at all.
| generic92034 wrote:
| How do you think AI are (correctly) solving simple
| mathematical questions which they have not trained for
| directly? They hand it over to a specialist maths engine.
| internetter wrote:
| This is a relatively recent development (<3 months), at
| least for OpenAI, where the model will generate _code_ to
| solve math and use the response
| cruffle_duffle wrote:
| They've been doing that a lot longer than three months.
| ChatGPT has been handing stuff off to python for a very
| long time. At least for my paid account anyway.
| nerdponx wrote:
| It is and would be useful, but it would be quite a big lie
| to the public, but more importantly to paying customers,
| and even more importantly to investors.
| anon84873628 wrote:
| The problem is simply that the company has not been
| _open_ about how it works, so we 're all just speculating
| here.
| scott_w wrote:
| If I was sold a general AI problem solving system, I'd feel
| ripped off if I learned that I needed to build my own
| problem solver and hook it up after I'd paid my money...
| skydhash wrote:
| Wasn't that the basis of computing and technology in
| general? Here is one tedious thing, let's have a specific
| tool that handles it instead of wasting time and efforts.
| The fact is that properly using the tool takes training and
| most of current AI marketing are hyping that you don't need
| that. Instead, hand over the problem to a GPT and it will
| "magically" solve it.
| Kiro wrote:
| That's something completely different than what the OP
| suggests and would be a scandal if true (i.e. gpt-3.5-turbo-
| instruct actually using something else behind the scenes).
| nerdponx wrote:
| Ironically it's probably a lot closer to what a super-human
| AGI would look like in practice, compared to just an LLM
| alone.
| sanderjd wrote:
| Right. To me, this is the "agency" thing, that I still
| feel like is somewhat missing in contemporary AI, despite
| all the focus on "agents".
|
| If I tell an "agent", whether human or artificial, to win
| at chess, it is a good decision for that agent to decide
| to delegate that task to a system that is good at chess.
| This would be obvious to a human agent, so presumably it
| should be obvious to an AI as well.
|
| This isn't useful for AI researchers, I suppose, but it's
| more useful as a tool.
|
| (This may all be a good thing, as giving AIs true agency
| seems scary.)
| scott_w wrote:
| If this was part of the offering: "we can recognise
| requests and delegate them to appropriate systems," I'd
| understand and be somewhat impressed but the marketing
| hype is missing this out.
|
| Most likely because they want people to think the system
| is better than it is for hype purposes.
|
| I should temper my level of impressed with _only if it's
| doing this dynamically ._ Hardcoding recognition of chess
| moves isn't exactly a difficult trick to pull given
| there's like 3 standard formats...
| Kiro wrote:
| You're speaking like it's confirmed. Do you have any
| proof?
|
| Again, the comment you initially responded to was not
| talking about faking it by using a chess engine. You were
| the one introducing that theory.
| scott_w wrote:
| No, I don't have proof and I never suggested I did. Yes,
| it's 100% hypothetical but I assumed everyone engaging
| with me understood that.
| sanderjd wrote:
| Fair!
| dartos wrote:
| So... we're at expert systems again?
|
| That's how the AI winter started last time.
| empath75 wrote:
| The point of creating a service like this is for it to be
| useful, and if recognizing and handing off tasks to
| specialized agents isn't useful, i don't know what is.
| scott_w wrote:
| If I was sold a product that can generically solve
| problems I'd feel a bit ripped off if I'm told after
| purchase that I need to build my own problem solver and
| way to recognise it...
| cruffle_duffle wrote:
| But it already hands off plenty of stuff to things like
| python. How would this be any different.
| cruffle_duffle wrote:
| If they came out and said it, I don't see the problem.
| LLM's aren't the solution for a wide range of problems.
| They are a new tool but not everything is a nail.
|
| I mean it already hands off a wide range of tasks to
| python... this would be no different.
| gamerDude wrote:
| This is exactly what I feel AI needs. A manager AI that then
| hands off things to specialized more deterministic
| algorithms/machines.
| criley2 wrote:
| Basically what Wolfram Alpha rolled out 15 years ago.
|
| It was impressive then, too.
| waffletower wrote:
| It is good to see other people buttressing Stephen
| Wolfram's ego. It is extraordinarily heavy work and
| Stephen can't handle it all by himself.
| spiderfarmer wrote:
| Multi Agent LLM's are already a thing.
| nine_k wrote:
| Somehow they're not in the limelight, and lack a well-
| known open-source runner implementation (like llama.cpp).
|
| Given the potential, they should be winning hands down;
| where's that?
| waffletower wrote:
| While deterministic components may be a left-brain default,
| there is no reason that such delegate services couldn't be
| more specialized ANN models themselves. It would most
| likely vastly improve performance if they were evaluated in
| the same memory space using tensor connectivity. In the
| specific case of chess, it is helpful to remember that
| AlphaZero utilizes ANNs as well.
| bigiain wrote:
| Next thing, the "manager AIs" start stack ranking the
| specialized "worker AIs".
|
| And the worker AIs "evolve" to meet/exceed expectations
| only on tasks directly contributing to KPIs the manager AIs
| measure for - via the mechanism of discarding the "less fit
| to exceed KPIs".
|
| And some of the worker AIs who're trained on
| recent/polluted internet happen to spit out prompt
| injection attacks that work against the manager AIs rank
| stacking metrics and dominate over "less fit" worker AIs.
| (Congratulations, we've evolved AI cancer!) These manager
| AIs start performing spectacularly badly compared to other
| non-cancerous manager AIs, and die or get killed off by the
| VC's paying for their datacenters.
|
| Competing manager AIs get training, perhaps on on newer HN
| posts discussing this emergent behavior of worker AIs, and
| start to down rank any exceptionally performing worker AIs.
| The overall trends towards mediocrity becomes inevitable.
|
| Some greybread writes some Perl and regexes that outcompete
| commercial manager AIs on pretty much every real world
| task, while running on a 10 year old laptop instead of a
| cluster of nuclear powered AI datacenters all consuming a
| city's worth of fresh drinking water.
|
| Nobody in powerful positions care. Humanity dies.
| antifa wrote:
| TBH I think a good AI would have access to a Swiss army knife
| of tools and know how to use them. For example a complicated
| math equation, using a calculator is just smarter than doing
| it in your head.
| PittleyDunkin wrote:
| We already have the chess "calculator", though. It's called
| stockfish. I don't know why you'd ask a dictionary how to
| solve a math problem.
| iamacyborg wrote:
| People ask LLM's to do all sorts of things they're not
| good at.
| the_af wrote:
| A generalist AI with a "chatty" interface that delegates
| to specialized modules for specific problem-solving seems
| like a good system to me.
|
| "It looks like you're writing a letter" ;)
| datadrivenangel wrote:
| Lets clip this in the bud before it grows wings.
| nuancebydefault wrote:
| It looks like you have a deja vu
| mkipper wrote:
| Chess might not be a great example, given that most
| people interested in analyzing chess moves probably know
| that chess engines exist. But it's easy to find examples
| where this approach would be very helpful.
|
| If I'm an undergrad doing a math assignment and want to
| check an answer, I may have no idea that symbolic algebra
| tools exist or how to use them. But if an all-purpose LLM
| gets a screenshot of a math equation and knows that its
| best option is to pass it along to one of those tools,
| that's valuable to me even if it isn't valuable to a
| mathematician who would have just cut out of the LLM
| middle-man and gone straight to the solver.
|
| There are probably a billion examples like this. I'd
| imagine lots of people are clueless that software exists
| which can help them with some problem they have, so an
| LLM would be helpful for discovery even if it's just
| acting as a pass-through.
| mabster wrote:
| Even knowing that the software exists isn't enough. You
| have to learn how to use the thing.
| bambax wrote:
| Yes, came here to say exactly this. And it's possible this
| specific model is "cheating", for example by identifying a
| chess problem and forwarding it to a chess engine. A modern
| version of the Mechanical Turk.
|
| That's the problem with closed models, we can never know what
| they're doing.
| jackcviers3 wrote:
| Why couldn't they add a tool that literally calls stockfish or
| a chess ai behind the scenes with function calling and buffer
| the request before sending it back to the endpoint output
| interface?
|
| As long as you are training it to make a tool call, you can add
| and remove anything you want behind the inference endpoint
| accessible to the public, and then you can plug the answer back
| into the chat ai, pass it through a moderation filter, and you
| might get good output from it with very little latency added.
| oezi wrote:
| Maybe they even delegate it to a chess engine internally via
| the tool use and the LLM uses that.
| vimbtw wrote:
| This is exactly it. Here's the pull request where chess evals
| were added: https://github.com/openai/evals/pull/45.
| Peteragain wrote:
| I would be interested to know if the good result is repeatable.
| We had a similar result with a quirky chat interface in that one
| run gave great results (and we kept the video) but then we
| couldn't do it again. The cynical among us think there was a
| mechanical turk involved in our good run. The economics of
| venture capital means that there is enormous pressure to justify
| techniques that we think of as "cheating". And of course the
| companies involved have the resources.
| tedsanders wrote:
| It's repeatable. OpenAI isn't cheating.
|
| Source: I'm at OpenAI and I was one of the first people to ever
| play chess against the GPT-4 base model. You may or may not
| trust OpenAI, but we're just a group of people trying earnestly
| to build cool stuff. I've never seen any inkling of an attempt
| to cheat evals or cheat customers.
| snickerbockers wrote:
| Does it ever try an illegal move? OP didn't mention this and I
| think it's inevitable that it should happen at least once, since
| the rules of chess are fairly arbitrary and LLMs are notorious
| for bullshitting their way through difficult problems when we'd
| rather they just admit that they don't have the answer.
| sethherr wrote:
| Yes, he discusses using a grammar to restrict to only legal
| moves
| topaz0 wrote:
| Still an interesting direction of questioning. Maybe could be
| rephrased as "how much work is the grammar doing"? Are the
| results with the grammar very different than without? If/when
| a grammar is not used (like in the openai case), how many
| illegal moves does it try on average before finding a legal
| one?
| Jerrrrrrry wrote:
| an LLM would complain that their internal model does not
| refelct their current input/output.
|
| Since LLM's knows people knock off/test/run afoul/mistakes
| can be made, it would then raise that as a possibility and
| likely inquire.
| causal wrote:
| This isn't prompt engineering, it's grammar-constrained
| decoding. It literally cannot respond with anything but
| tokens that fulfill the grammar.
| int_19h wrote:
| A grammar is really just a special case of the more general
| issue of how to pick a single token given the probabilities
| that the model spits out for every possible one. In that
| sense, filters like temperature / top_p / top_k are already
| hacks that "do the work" (since always taking the most
| likely predicted token does not give good results in
| practice), and grammars are just a more complicated way to
| make such decisions.
| gs17 wrote:
| I'd be more interested in what the distribution of grammar-
| restricted predictions looks like compared to moves
| Stockfish says are good.
| yshui wrote:
| I suspect the models probably memorized some chess openings,
| and afterwards they are just playing random moves with the
| help of the grammar.
| thaumasiotes wrote:
| > he discusses using a grammar to restrict to only legal
| moves
|
| Whether a chess move is legal isn't primarily a question of
| grammar. It's a question of the board state. "White king to
| a5" is a perfectly legal move, as long as the white king was
| next to a5 before the move, and it's white's turn, and a5
| isn't threatened by black. Otherwise it isn't.
|
| "White king to a9" is a move that could be recognized and
| blocked by a grammar, but how relevant is that?
| smatija wrote:
| In my experience you are lucky if it manages to give you 10
| legal moves in a row, e.g.
| https://news.ycombinator.com/item?id=41527143#41529024
| downboots wrote:
| In a sense, a chess game is also a dialogue
| throwawaymaths wrote:
| All dialogues are pretty easily turned into text completions
| Sparkyte wrote:
| Lets be real though most people can't beat a grandmaster. It is
| impressive to see it last more rounds as it progressed.
| dokimus wrote:
| "It lost every single game, even though Stockfish was on the
| lowest setting."
|
| It's not playing against a GM, the prompt just phrases it this
| way. I couldn't pinpoint the exact ELO of "lowest" stockfish
| settings, but it should be roughly between 1000 and 1400, which
| is far from professional play.
| ConspiracyFact wrote:
| "...And how to construct that state from lists of moves in
| chess's extremely confusing notation?"
|
| Algebraic notation is completely straightforward.
| dr_dshiv wrote:
| Has anyone tested a vision model? Seems like they might be better
| bongodongobob wrote:
| I've tried with GPT, it's unable to accurately interpret the
| board state.
| dr_dshiv wrote:
| OpenAI has a TON of experience making game-playing AI. That was
| their focus for years, if you recall. So it seems like they made
| one model good at chess to see if it had an overall impact on
| intelligence (just as learning chess might make people smarter,
| or learning math might make people smarter, or learning
| programming might make people smarter)
| larodi wrote:
| Playing is a thing strongly related to abstract representation
| of the game in game states. Even if player does not realize it,
| with chess it's really about shallow or beam search within the
| possible moves.
|
| LLMs don't do reasoning or exploration, but they write text
| based on precious text. So to us it may seem playing, but is
| really a smart guesswork based on previous games. It's like
| Kasparov writing moves without imagining the actual placement.
|
| What would be interesting is to see whether a model, given only
| the rules, will play. I bet it won't.
|
| At this moment it's replaying by memory but definitely not
| chasing goals. There's no such think as forward attention yet,
| and beam search is expensive enough, so one would prefer to
| actually fallback to classic chess algos.
| philipwhiuk wrote:
| I think you're confusing OpenAI and DeepMind.
|
| OpenAI has never done anything except conversational agents.
| agnokapathetic wrote:
| https://en.wikipedia.org/wiki/OpenAI_Five
|
| https://openai.com/index/gym-retro/
| ttyprintk wrote:
| No, they started without conversation and only reinforcement
| learning on games, directly comparable to DeepMind.
|
| "In the summer of 2018, simply training OpenAI's Dota 2 bots
| required renting 128,000 CPUs and 256 GPUs from Google for
| multiple weeks."
| apetresc wrote:
| Very wrong. The first time most people here probably heard
| about OpenAI back in 2017 or so was their DotA 2 bot.
| codethief wrote:
| They definitely have game-playing AI expertise, though:
| https://noambrown.github.io/
| ctoth wrote:
| > OpenAI has never done anything except conversational
| agents.
|
| Tell me you haven't been following this field without telling
| me you haven't been following this field[0][1][2]?
|
| [0]: https://github.com/openai/gym
|
| [1]: https://openai.com/index/jukebox/
|
| [2]: https://openai.com/index/openai-five-defeats-
| dota-2-world-ch...
| Miraltar wrote:
| related : Emergent World Representations: Exploring a Sequence
| Model Trained on a Synthetic Task
| https://arxiv.org/abs/2210.13382
|
| Chess-GPT's Internal World Model
| https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
| discussed here https://news.ycombinator.com/item?id=38893456
| peter_retief wrote:
| It makes me wonder about other games? If LLM's are bad at games
| then the would be bad at solving problems in general?
| golol wrote:
| My understanding of this is the following: All the bad models are
| chat models, somehow "generation 2 LLMs" which are not just text
| completion models but instead trained to behave as a chatting
| agent. The only good model is the only "generation 1 LLM" here
| which is gpt-3.5-turbo-instruct. It is a straight forward text
| completion model. If you prompt it to "get in the mind" of PGN
| completion then it can use some kind of system 1 thinking to give
| a decent approximation of the PGN Markov process. If you attempt
| to use a chat model it doesn't work since these these stochastic
| pathways somehow degenerate during the training to be a chat
| agent. You can however play chess with system 2 thinking, and the
| more advanced chat models are trying to do that and should get
| better at it while still being bad.
| stockboss wrote:
| perhaps my understanding of LLM is quite shallow, but instead of
| the current method of using statistical methods, would it be
| possible to somehow train GPT how to reason by providing
| instructions on deductive reasoning? perhaps not semantic
| reasoning but syntactic at least?
| amelius wrote:
| I wonder if the llm could even draw the chess board in ASCII if
| you asked it to.
| codeflo wrote:
| At this point, we have to assume anything that becomes a
| published benchmark is specifically targeted during training.
| That's not something specific to LLMs or OpenAI. Compiler
| companies have done the same thing for decades, specifically
| detecting common benchmark programs and inserting hand-crafted
| optimizations. Similarly, the shader compilers in GPU drivers
| have special cases for common games and benchmarks.
| darkerside wrote:
| VW got in a lot of trouble for this
| TrueDuality wrote:
| Not quite. VW got in trouble for running _different_ software
| in test vs prod. These optimizations are all going to "prod"
| but are only useful for specific targets (a specific game in
| this case).
| krisoft wrote:
| > VW got in trouble for running _different_ software in
| test vs prod.
|
| Not quite. They programmed their "prod" software to
| recognise the circumstances of a laboratory test and behave
| differently. Namely during laboratory emissions testing
| they would activate emission control features they would
| not activate otherwise.
|
| The software was the same they flash on production cars.
| They were production cars. You could take a random car from
| a random dealership and it would have done the same
| trickery in the lab.
| TrueDuality wrote:
| I disagree with your distinction on the environments but
| understand your argument. Production for VM to me is "on
| the road when a customer is using your product as
| intended". Using the same artifact for those different
| environments isn't the same as "running that in
| production".
| krisoft wrote:
| "Test" environment is the domain of prototype cars
| driving at the proving ground. It is an internal affair,
| only for employees and contractors. The software is
| compiled on some engineer's laptop and uploaded on the
| ECU by an engineer manually. No two cars are ever the
| same, everything is in flux. The number of cars are
| small.
|
| "Production" is a factory line producing cars. The
| software is uploaded on the ECUs by some factory machine
| automatically. Each car are exactly the same, with the
| exact same software version on thousands and thousands of
| cars. The cars are sold to customers.
|
| Some small number of these prodiction cars are sent for
| regulatory compliance checks to third parties. But those
| cars won't become suddenly non-production cars just
| because someone sticks up a probe in their exhausts. The
| same way gmail's production servers don't suddenly turn
| into test environments just because a user opens the
| network tab in their browser's dev tool to see what kind
| of requests fly on the wire.
| close04 wrote:
| Only because what VW did is illegal, was super large scale,
| and could be linked to a lot of indirect deaths through the
| additional pollution.
|
| Benchmark optimizations are slightly embarrassing at worst,
| and an "optimization for a specific use case" at best.
| There's no regulation against optimizing for a particular
| task, everyone does it all the time, in some cases it's just
| not communicated transparently.
|
| Phone manufacturers were caught "optimizing" for benchmarks
| again and again, removing power limits to boost scores. Hard
| to name an example without searching the net because it's at
| most a faux pas.
| Swenrekcah wrote:
| Actually performing well on a task that is used as a
| benchmark is not comparable to decieving authorities about
| how much toxic gas you are releasing.
| ArnoVW wrote:
| True. But they did not optimize for a specific case. They
| detected the test and then enabled a special regime, that was
| not used normally.
|
| It's as if OpenAI detects the IP address from a benchmark
| organization, and then used a completely different model.
| K0balt wrote:
| This is the apples to apples version. Perhaps might be more
| accurate to say that when detecting a benchmark attempt the
| model tries the prompt 3 times with different seeds then
| picks the best answer, otherwise it just zero-shots the
| prompt in everyday use.
|
| I say this because the be test still uses the same hardware
| (model) but changed the way it behaved by running emissions
| friendly parameters ( a different execution framework) that
| wouldn't have been used in everyday driving, where fuel
| efficiency and performance optimized parameters were used
| instead.
|
| What I'd like to know is if it actually was unethical or
| not. The overall carbon footprint of the lower fuel
| consumption setting, with fuel manufacturing and
| distribution factored in, might easily have been more
| impactful than the emissions model, which typically does
| not factor in fuel consumed.
| sigmoid10 wrote:
| Apples and oranges. VW actually cheated on regulatory testing
| to bypass legal requirements. So to be comparable, the
| government would first need to pass laws where e.g. only
| compilers that pass a certain benchmark are allowed to be
| used for purchasable products and then the developers would
| need to manipulate behaviour during those benchmarks.
| 0xFF0123 wrote:
| The only difference is the legality. From an integrity
| point of view it's basically the same
| Thorrez wrote:
| I think breaking a law is more unethical than not
| breaking a law.
|
| Also, legality isn't the only difference in the VW case.
| With VW, they had a "good emissions" mode. They enabled
| the good emissions mode during the test, but disabled it
| during regular driving. It would have worked during
| regular driving, but they disabled it during regular
| driving. With compilers, there's no "good performance"
| mode that would work during regular usage that they're
| disabling during regular usage.
| Lalabadie wrote:
| > I think breaking a law is more unethical than not
| breaking a law.
|
| It sounds like a mismatch of definition, but I doubt
| you're ambivalent about a behavior right until the moment
| it becomes illegal, after which you think it unethical.
| Law is the codification and enforcement of a social
| contract, not the creation of it.
| Thorrez wrote:
| >I doubt you're ambivalent about a behavior right until
| the moment it becomes illegal, after which you think it
| unethical.
|
| There are many cases where I think that. Examples:
|
| * Underage drinking. If it's legal for someone to drink,
| I think it's in general ethical. If it's illegal, I think
| it's in general unethical.
|
| * Tax avoidance strategies. If the IRS says a strategy is
| allowed, I think it's ethical. If the IRS says a strategy
| is not allowed, I think it's unethical.
|
| * Right on red. If the government says right on red is
| allowed, I think it's ethical. If the government (e.g.
| NYC) says right on red is not allowed, I think it's
| unethical.
|
| The VW case was emissions regulations. I think they have
| an ethical obligation to obey emissions regulations. In
| the absence of regulations, it's not an obvious ethical
| problem to prioritize fuel efficiency instead of
| emissions (that's I believe what VW was doing).
| chefandy wrote:
| Drinking and right turns are unethical if they're
| negligent. They're not unethical if they're not
| negligent. The government is trying to reduce negligence
| by enacting preventative measures to stop ALL right turns
| and ALL drinking in certain contexts that are more likely
| to yield negligence, or where the negligence world be
| particularly harmful, but that doesn't change whether or
| not the behavior itself is negligent.
|
| You might consider disregarding the government's
| preventative measures unethical, and doing those things
| might be the way someone disregards the governments
| protective guidelines, but that doesn't make those
| actions unethical any more than governments explicitly
| legalizing something makes it ethical.
|
| To use a clearer example, the ethicality of abortion--
| regardless of what you think of it-- is not changed by
| its legal status. You might consider violating the law
| unethical, so breaking abortion laws would constitute the
| same ethical violation as underage drinking, but those
| laws don't change the ethics of abortion itself. People
| who consider it unethical still consider it unethical
| where it's legal, and those that consider it ethical
| still consider it ethical where it's not legal.
| adgjlsfhk1 wrote:
| the right on red example is interesting because in that
| case, the law changes how other drivers and pedestrians
| will behave in ways that make it pretty much always
| unsafe
| chefandy wrote:
| That just changes the parameters of negligence. On a
| country road in the middle of a bunch of farm land where
| you can see for miles, it doesn't change a thing.
| mbrock wrote:
| It's not so simple. An analogy is the Rust formatter that
| has no options so everyone just uses the same style. It's
| minimally "unethical" to use idiosyncratic Rust style
| just because it goes against the convention so people
| will wonder why you're so special, etc.
|
| If the rules themselves are bad and go against deeper
| morality, then it's a different situation; violating laws
| out of civil disobedience, emergent need, or with a
| principled stance is different from wanton, arbitrary,
| selfish cheating.
|
| If a law is particularly unjust, violating the law might
| itself be virtuous. If the law is adequate and sensible,
| violating it is usually wrong even if the violating
| action could be legal in another sensible jurisdiction.
| ClumsyPilot wrote:
| > but that doesn't make those actions unethical any more
| than governments explicitly legalizing something makes it
| ethical
|
| That is, sometimes, sufficient.
|
| If government says 'seller of a house must disclose
| issues' then I rely rely on the law being followed, if
| you sell and leave the country, you have defrauded me.
|
| However if I live in a 'buyer beware' jurisdiction, then
| I know I cannot trust the seller and I hire a surveyor
| and take insurance.
|
| There is a degree of setting expectations- if there is a
| rule, even if it's a terrible rule, I as individual can
| at least take some countermeasures.
|
| You can't take countermeasures against all forms of
| illegal behaviour, because there is infinite number of
| them. And a truly insane person is unpredictable at all.
| banannaise wrote:
| Outsourcing your morality to politicians past and present
| is not a particularly useful framework.
| anonymouskimmer wrote:
| Ethics are only morality if you spend your entire time in
| human social contexts. Otherwise morality is a bit
| larger, and ethics are a special case of group recognized
| good and bad behaviors.
| emn13 wrote:
| Also, while laws ideally are inspired by an ethical
| social contract, the codification proces is long, complex
| and far from perfect. And then for rules concerning
| permissible behavior even in the best of cases, it's
| enforced extremely sparingly simply because it's not
| possible nor desirable to detect and deal with all
| infractions. Nor is it applied blindly and equally. As
| actually applied, a law is definitely not even close to
| some ethical ideal; sometimes it's outright opposed to
| it, even.
|
| Law and ethics are barely related, in practice.
|
| For example in the vehicle emissions context, it's worth
| noting that even well before VW was caught the actions of
| likely all carmakers affected by the regulations (not
| necessarily to the same extent) were clearly unethical.
| The rules had been subject to intense clearly unethical
| lobbying for years, and so even the legal lab results
| bore little resemblance to practical on-the-road results
| though systematic (yet legal) abuse. I wouldn't be
| surprised to learn that even what was measured
| intentionally diverged from what is harmfully in a
| profitable way. It's a good thing VW was made an example
| of - but clearly it's not like that resolved the general
| problem of harmful vehicle emissions. Optimistically, it
| might have signaled to the rest of the industry and VW in
| particular to stretch the rules less in the future.
| mbrock wrote:
| But following the law is itself a load bearing aspect of
| the social contract. Violating building codes, for
| example, might not cause immediate harm if it's competent
| but unusual, yet it's important that people follow it
| just because you don't want arbitrariness in matters of
| safety. The objective ruleset itself is a value beyond
| the rules themselves, if the rules are sensible and in
| accordance with deeper values, which of course they
| sometimes aren't, in which case we value civil
| disobedience and activism.
| Winse wrote:
| unless following an unethical law would in itself be
| unethical, then breaking the unethical law would be the
| only ethical choice. In this case cheating emissions,
| which I see as unethical, but also advantageous for the
| consumer, should have been done openly if VW saw
| following the law as unethical. Ethics and morality are
| subjective to understanding, and law only a crude
| approximation of divinity. Though I would argue that each
| person on the earth through a shared common experience
| has a rough and general idea of right from wrong...though
| I'm not always certain they pay attention to it.
| hansworst wrote:
| Overfitting on test data absolutely does mean that the
| model would perform better in benchmarks than it would in
| real life use cases.
| Retr0id wrote:
| ethics should inform law, not the reverse
| UniverseHacker wrote:
| I disagree- presumably if an algorithm or hardware is
| optimized for a certain class of problem it really is
| good at it and always will be- which is still useful if
| you are actually using it for that. It's just "studying
| for the test"- something I would expect to happen even if
| it is a bit misleading.
|
| VW cheated such that the low emissions were only active
| during the test- it's not that it was optimized for low
| emissions under the conditions they test for, but that
| you could not get those low emissions under any
| conditions in the real world. That's "cheating on the
| test" not "studying for the test."
| the_af wrote:
| > _The only difference is the legality. From an integrity
| point of view it 's basically the same_
|
| I think cheating about harming the environment is another
| important difference.
| Swenrekcah wrote:
| That is not true. Even ChatGPT understands how they are
| different, I won't paste the whole response but here are
| the differences it highlights:
|
| Key differences:
|
| 1. Intent and harm: * VW's actions directly violated laws
| and had environmental and health consequences. Optimizing
| LLMs for chess benchmarks, while arguably misleading,
| doesn't have immediate real-world harms. 2. Scope: Chess-
| specific optimization is generally a transparent choice
| within AI research. It's not a hidden "defeat device" but
| rather an explicit design goal. 3. Broader impact: LLMs
| fine-tuned for benchmarks often still retain general-
| purpose capabilities. They aren't necessarily "broken"
| outside chess, whereas VW cars fundamentally failed to
| meet emissions standards.
| currymj wrote:
| VW was breaking the law in a way that harmed society but
| arguably helped the individual driver of the VW car, who
| gets better performance yet still passes the emissions
| test.
| jimmaswell wrote:
| And afaik the emissions were still miles ahead of a car
| from 20 years prior, just not quite as extremely
| stringent as requested.
| slowmotiony wrote:
| "not quite as extremely stringent as requested" is a
| funny way to say they were emitting 40 times more toxic
| fumes than permitted by law.
| int_19h wrote:
| It might sound funny in retrospect, but some of us
| actually bought VW cars on the assumption that, if
| biodiesel-powered, it would be more green.
| boringg wrote:
| How so? VW intentionally changed the operation of the
| vehicle so that its emissions met the test requirements
| during the test and then went back to typical operation
| conditions afterwards.
| TimTheTinker wrote:
| Right - in either case it's lying, which is crossing a
| moral line (which is far more important to avoid than a
| legal line).
| rsynnott wrote:
| There's a sliding scale of badness here. The emissions
| cheating (it wasn't just VW, incidentally; they were just
| the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW
| were also caught doing it, with suspicions about others)
| was straight-up fraud.
|
| It used to be common for graphics drivers to outright cheat
| on benchmarks (the actual image produced would not be the
| same as it would have been if a benchmark had not been
| detected); this was arguably, fraud.
|
| It used to be common for mobile phone manufacturers to
| allow the SoC to operate in a thermal mode that was never
| available to real users when it detected a benchmark was
| being used. This is still, IMO, kinda fraud-y.
|
| Optimisation for common benchmark cases where the thing
| still actually _works_, and where the optimisation is
| available to normal users where applicable, is less
| egregious, though, still, IMO, Not Great.
| waffletower wrote:
| Tesla cheats by using electric motors and deferring
| emissions standards to somebody else :D Wait, I really
| think that's a good thing, but once Hulk Hogan is confirmed
| administrator of the EPA, he might actually use this
| argument against Teslas and other electric vehicles.
| tightbookkeeper wrote:
| This is 10 year old story. It's very interesting which ones
| stay in the public consciousness.
| bluGill wrote:
| Most of the time these days compiler writers are not cheating
| like VW did. In the 1980s compiler writers would insert code
| to recognize performance tests and then cheat - output values
| hard coded into the compiler instead of running the
| algorithm. Which is the type of thing that VW got in trouble
| for.
|
| These days most compilers are trying to make the general case
| of code fast and they rarely look for benchmarks. I won't say
| they never do this - just that it is much less common - if
| only because magazine reviews/benchmarks are not nearly as
| important as they used to be and so the incentive is gone.
| newerman wrote:
| Funny response; you're not wrong.
| conradev wrote:
| GPT-3.5 did not "cheat" on chess benchmarks, though, it was
| actually just better at chess?
| GolfPopper wrote:
| I think the OP's point is that chat GPT-3.5 may have a
| chess-engine baked-in to its (closed and unavailable) code
| for PR purposes. So it "realizes" that "hey, I'm playing a
| game of chess" and then, rather than doing whatever it
| normally does, it just acts as a front-end for a quite good
| chess-engine.
| conradev wrote:
| I see - my initial interpretation of OP's "special case"
| was "Theory 2: GPT-3.5-instruct was trained on more chess
| games."
|
| But I guess it's also a possibility that they had a real
| chess engine hiding in there.
| gdiamos wrote:
| It's approximately bad, like most of ML
|
| On one side:
|
| Would you expect a model trained on no Spanish data to do
| well on Spanish?
|
| On the other:
|
| Is it okay to train on the MMLU test set?
| dang wrote:
| We detached this subthread from
| https://news.ycombinator.com/item?id=42144784.
|
| (Nothing wrong with it! It's just a bit more generic than the
| original topic.)
| fabiospampinato wrote:
| It's probably worth to play around with different prompts and
| different board positions.
|
| For context this [1] is the board position the model is being
| prompted on.
|
| There may be more than one weird thing about this experiment, for
| example giving instructions to the non-instruction tuned variants
| may be counter productive.
|
| More importantly let's say you just give the model the truncated
| PGN, does this look like a position where white is a grandmaster
| level player? I don't think so. Even if the model understood
| chess really well it's going to try to predict the most probable
| move given the position at hand, if the model thinks that white
| is a bad player, and the model is good at understanding chess,
| it's going to predict bad moves as the more likely ones because
| that would better predict what is most likely to happen here.
|
| [1]: https://i.imgur.com/qRxalgH.png
| Closi wrote:
| Agree with this. A few prompt variants:
|
| * What if you allow the model to do Chain of Thought
| (explicitly disallowed in this experiment)
|
| * What if you explain the board position at each step to the
| model in the prompt, so it doesn't have to calculate/estimate
| it internally.
| int_19h wrote:
| They also tested GPT-o1, which is always CoT. Yet it is still
| worse.
| fabiospampinato wrote:
| Apparently I can find some matches for games that start like
| that between very strong players [1], so my hypothesis that the
| model may just be predicting bad moves on purpose seems wobbly,
| although having stockfish at the lowest level play as the
| supposedly very strong opponent may still be throwing the model
| off somewhat. In the charts the first few moves the model makes
| seem decent, if I'm interpreting these charts right, and after
| a few of those things seem to start going wrong.
|
| Either way it's worth repeating the experiment imo, tweaking
| some of these variables (prompt guidance, stockfish strength,
| starting position, the name of the supposed players, etc.).
|
| [1]:
| https://www.365chess.com/search_result.php?search=1&p=1&m=8&...
| spott wrote:
| He was playing full games, not single moves.
| NiloCK wrote:
| The experiment started from the first move of a game, and
| played each game fully. The position you linked was just an
| example of the format used to feed the game state to the model
| for each move.
|
| What would "winning" or "losing" even mean if all of this was
| against a single move?
| osaatcioglu wrote:
| I've also been experimenting with Chess and LLMs but have taken a
| slightly different approach. Rather than using the LLM as an
| opponent, I've implemented it as a chess tutor to provide
| feedback on both the user's and the bot's moves throughout the
| game.
|
| The responses vary with the user's chess level; some find the
| feedback useful, while others do not. To address this, I've
| integrated a like, dislike, and request new feedback feature into
| the app, allowing users to actively seek better feedback.
|
| Btw, different from OP's setup, I opted to input the FEN of the
| current board and the subsequent move in standard algebraic
| notation to request feedback, as I found these inputs to be
| clearer for the LLM compared to giving the PGN of the game.
|
| AI Chess GPT https://apps.apple.com/tr/app/ai-chess-
| gpt/id6476107978
| https://play.google.com/store/apps/details?id=net.padma.app....
|
| Thanks
| antononcube wrote:
| Yeah, I was thinking why featured article's author did not use
| Forsyth-Edwards Notation (FEN) and more complicated chess
| prompts.
|
| BTW, a year ago when I used FEN for chess playing, LLMs would
| very quickly/often make illegal moves. (The article prompts me
| to check has that changed...)
| philipwhiuk wrote:
| > I always had the LLM play as white against Stockfish--a
| standard chess AI--on the lowest difficulty setting.
|
| Okay, so "Excellent" still means probably quite bad. I assume at
| the top difficult setting gpt-3.5-turbo-instruct will still lose
| badly.
| XCSme wrote:
| Probably even at lvl 2 out of 9 it would lose all the games.
| gunalx wrote:
| Is it just me or does the author swap descriptions of the
| instruction finetuned and the base gpt-3.5-turbo? It seemed like
| the best model was labeled instruct, but the text saying instruct
| did worse?
| sylware wrote:
| They did probably acknowledge that the additionnal cost of
| training those models on chess would not be "cost effective", did
| drop chess from their training process, for the moment.
|
| That to say, we can literal say anything because this is very
| shadowy/murky, but since everything is likely a question of
| money... should, _probably_, be not very fair away from the
| truth...
| XCSme wrote:
| I had the same experience with LLM text-to-sql, 3.5 instruct felt
| a lot more robust than 4o
| reallyeli wrote:
| My guess is they just trained gpt3.5-turbo-instruct on a lot of
| chess, much more than is in e.g. CommonCrawl, in order to boost
| it on that task. Then they didn't do this for other models.
|
| People are alleging that OpenAI is calling out to a chess engine,
| but seem to be not considering this less scandalous possibility.
|
| Of course, to the extent people are touting chess performance as
| evidence of general reasoning capabilities, OpenAI taking costly
| actions to boost specifically chess performance and not being
| transparent about it is still frustrating and, imo, dishonest.
| sherburt3 wrote:
| The have a massive economic incentive to make their closed
| source software look as good as possible, why wouldn't they
| cheat?
| stefatorus wrote:
| The trick to getting a model to perform on something is to have
| it as a training data subset.
|
| OpenAI might have thought Chess is good to optimize for but it
| wasn't seen as useful so they dropped it.
|
| This is what people refer to as "lobotomy", ai models are wasting
| compute on knowing how loud the cicadas are and how wide the
| green cockroach is when mating.
|
| Good models are about the training data you push in em
| smokedetector1 wrote:
| I feel like an easy win here would be retraining an LLM with a
| tokenizer specifically designed for chess notation?
| misiek08 wrote:
| For me it's not only the chess. Chats get more chatty, but
| knowledge and fact-wise - it's a sad comedy. Yes, you get a buddy
| to talk with, but he is talking pure nonsense.
| teleforce wrote:
| TL;DR.
|
| All of the LLM models tested playing chess performed terribly bad
| against Stockfish engine except gpt-3.5-turbo-instruct, which is
| a closed OpenAI model.
| jack_riminton wrote:
| Perhaps if it doesn't have enough data to explain but it has
| enough to go "on gut"
| sourcepluck wrote:
| Keep in mind, everyone, that stockfish on its lowest level on
| lichess is absolutely terrible, and a 5-year old human who'd been
| playing chess for a few months could beat it regularly. It hangs
| pieces, does -3 blunders, totally random-looking bad moves.
|
| But still, yes, something maybe a teeny tiny bit weird is going
| on, in the sense that only one of the LLMs could beat it. The
| arxiv paper that came out recently was much more "weird" and
| interesting than this, though. This will probably be met with a
| mundane explanation soon enough, I'd guess.
| sourcepluck wrote:
| Here's a quick anonymous game against it by me, where I
| obliterate the poor thing in 11 moves. I was around a 1500 ELO
| classical strength player, which is, a teeny bit above average,
| globally. But I mean - not an expert, or even one of the
| "strong" club players (in any good club).
|
| https://lichess.org/BRceyegK -- the game, you'll see it make
| the ultimate classic opening errors
|
| https://lichess.org/ -- try yourself! It's really so bad, it's
| good fun. Click "play with computer" on the right, then level 1
| is already selected, you hit go
| nabla9 wrote:
| Theory 5: gpt-3.5-turbo-instruct has chess engine attached to it.
| wufufufu wrote:
| > And then I tried gpt-3.5-turbo-instruct. This is a closed
| OpenAI model, so details are very murky.
|
| How do you know it didn't just write a script that uses a chess
| engine and then execute the script? That IMO is the easiest
| explanation.
|
| Also, I looked at the gpt-3.5-turbo-instruct example victory. One
| side played with 70% accuracy and the other was 77%. IMO that's
| not on par with 27XX ELO.
| leogao wrote:
| The GPT-4 pretraining set included chess games in PGN notation
| from 1800+ ELO players. I can't comment on any other models.
| a_wild_dandan wrote:
| Important testing excerpts:
|
| - "...for the closed (OpenAI) models I tried generating up to 10
| times and if it still couldn't come up with a legal move, I just
| chose one randomly."
|
| - "I ran all the open models (anything not from OpenAI, meaning
| anything that doesn't start with gpt or o1) myself using Q5_K_M
| quantization"
|
| - "...if I gave a prompt like "1. e4 e5 2. " (with a space at the
| end), the open models would play much, much worse than if I gave
| a prompt like "1 e4 e5 2." (without a space)"
|
| - "I used a temperature of 0.7 for all the open models and the
| default for the closed (OpenAI) models."
|
| Between the tokenizer weirdness, temperature, quantization,
| random moves, and the chess prompt, there's a lot going on here.
| I'm unsure how to interpret the results. Fascinating article
| though!
| mips_avatar wrote:
| Ok whoah, assuming the chess powers on gpt3.5-instruct are just a
| result of training focus then we don't have to wait on bigger
| models, we just need to fine tune on 175B?
| 1024core wrote:
| I would love to see the prompts (the data) this person used.
| mastazi wrote:
| If you look at the comments under the post, the author commented
| 25 minutes ago (as of me posting this)
|
| > Update: OK, I actually think I've figured out what's causing
| this. I'll explain in a future post, but in the meantime, here's
| a hint: I think NO ONE has hit on the correct explanation!
|
| well now we are curious!
| amelius wrote:
| What would happen if you'd prompted it with much more text, e.g.
| general advice by a chess grandmaster?
| throwawaymaths wrote:
| Would be more interesting with trivial Lora training
___________________________________________________________________
(page generated 2024-11-15 23:01 UTC)