[HN Gopher] Something weird is happening with LLMs and chess
       ___________________________________________________________________
        
       Something weird is happening with LLMs and chess
        
       Author : crescit_eundo
       Score  : 630 points
       Date   : 2024-11-14 17:05 UTC (1 days ago)
        
 (HTM) web link (dynomight.substack.com)
 (TXT) w3m dump (dynomight.substack.com)
        
       | PaulHoule wrote:
       | Maybe that one which plays chess well is calling out to a real
       | chess engine.
        
         | singularity2001 wrote:
         | this possibility is discussed in the article and deemed
         | unlikely
        
           | margalabargala wrote:
           | I don't see that discussed, could you quote it?
        
           | probably_wrong wrote:
           | Note: the possibility is not mentioned in the article but
           | rather in the comments [1]. I had to click a bit to see it.
           | 
           | The fact that the one closed source model is the only one
           | that plays well seems to me like a clear case of the
           | interface doing some of the work. If you ask ChatGPT to count
           | until 10000 (something that most LLMs can't do for known
           | reasons) you get an answer that's clearly pre-programmed. I'm
           | sure the same is happening here (and with many, many other
           | tasks) - the author argues against it by saying "but why
           | isn't it better?", which doesn't seem like the best argument:
           | I can imagine that typical ChatGPT users enjoy the product
           | more if they have a chance to win once in a while.
           | 
           | [1] https://dynomight.substack.com/p/chess/comment/77190852
        
             | refulgentis wrote:
             | What do you mean LLMs can't count to 10,000 for known
             | reasons?
             | 
             | Separately, if you are able to show OpenAI is serving pre
             | canned responses in some instances, instead of running
             | inference, you will get a ton of attention if you write it
             | up.
             | 
             | I'm not saying this in an aggro tone, it's a genuinely
             | interesting subject to me because I wrote off LLMs at first
             | because I thought this was going on.* Then I spent the last
             | couple years laughing at myself for thinking that they
             | would do that. Would be some mix of fascinated and
             | horrified to see it come full circle.
             | 
             | * I can't remember, what, exactly, it was far back as 2018.
             | But someone argued that OpenAI was patching in individual
             | answers because scaling was dead and they had no answers,
             | way way before ChatGPT.
        
               | probably_wrong wrote:
               | When it comes to counting, LLMs have a couple issues.
               | 
               | First, tokenization: the tokenization of 1229 is not
               | guaranteed to be [1,2,2,9] but it could very well be
               | [12,29] and the "+1" operation could easily generate
               | tokens [123,0] depending on frequencies in your corpus.
               | This constant shifting in tokens makes it really hard to
               | learn rules for "+1" ([9,9] +1 is not [9,10]). This is
               | also why LLMs tend to fail at tasks like "how many
               | letters does this word have?":
               | https://news.ycombinator.com/item?id=41058318
               | 
               | Second, you need your network to understand that "+1" is
               | worth learning. Writing "+1" as a combination of sigmoid,
               | products and additions over normalized floating point
               | values (hello loss of precision) is not trivial without
               | degrading a chunk of your network, and what for? After
               | all, math is not in the domain of language and, since
               | we're not training an LMM here, your loss function may
               | miss it entirely.
               | 
               | And finally there's statistics: the three-legged-dog
               | problem is figuring out that a dog has four legs from
               | corpora when no one ever writes "the four-legged dog"
               | because it's obvious, but every reference to an unusual
               | dog will include said description. So if people write
               | "1+1 equals 3" satirically then your network may pick
               | that up as fact. And how often has your network seen the
               | result of "6372 + 1"?
               | 
               | But you don't have to take my word for it - take an open
               | LLM and ask it to generate integers between 7824 and
               | 9954. I'm not optimistic that it will make it through
               | without hallucinations.
        
         | sobriquet9 wrote:
         | This is likely. From example games, it not only knows the rules
         | (which would be impressive by itself, just making the legal
         | moves is not trivial). It also has some planning capabilities
         | (plays combinations of several moves).
        
         | aithrowawaycomm wrote:
         | The author thinks this is unlikely because it only has an ~1800
         | ELO. But OpenAI is shady as hell, and I could absolutely see
         | the following _purely hypothetical_ scenario:
         | 
         | - In 2022 Brockman and Sutskever have an unshakeable belief
         | that Scaling Is All You Need, and since GPT-4 has a ton of
         | chess in its pretraining data it will _definitely_ be able to
         | play competent amateur chess when it 's finished.
         | 
         | - A ton of people have pointed out that ChatGPT-3.5 doesn't
         | even slightly understand chess despite seeming fluency in the
         | lingo. People start to whisper that transformers cannot
         | actually create plans.
         | 
         | - Therefore OpenAI hatches an impulsive scheme: release an
         | "instruction-tuned" GPT-3.5 with an embedded chess engine that
         | is not a grandmaster, but can play competent chess, ideally
         | just below the ELO that GPT-4 is _projected_ to have.
         | 
         | - Success! The waters are muddied: GPT enthusiasts triumphantly
         | announce that LLMs _can_ play chess, it just took a bit more
         | data and fine-tuning. The haters were wrong: look at all the
         | planning GPT is doing!
         | 
         | - Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do
         | competitors' foundation LLMs which otherwise outperform
         | GPt-3.5. The scaling "laws" failed here, since they were never
         | laws in the first place. OpenAI accepts that scaling
         | transformers won't easily solve the chess problem, then
         | realizes that if they include the chess engine with GPT-4
         | without publicly acknowledging it, then Anthropic and Facebook
         | will call out the performance as aberrational and suspicious.
         | But publicly acknowledging a chess engine is even worse: the
         | only reason to include the chess engine is to mislead users
         | into thinking GPT is capable of general-purpose planning.
         | 
         | - Therefore in later GPT versions they don't include the
         | engine, but it's too late to remove it from gpt-3.5-turbo-
         | instruct: people might accept the (specious) claim that GPT-4's
         | size accidentally sabotaged its chess abilities, but they'll
         | ask tough questions about performance degradation within the
         | same model.
         | 
         | I realize this is convoluted and depends on conjecture. But
         | OpenAI has a history with misleading demos - e.g. their Rubik's
         | cube robot which in fact used a classical algorithm but was
         | presented as reinforcement learning. I think "OpenAI lied" is
         | the most likely scenario. It is far more likely than "OpenAI
         | solved the problem honestly in GPT-3.5, but forgot how they did
         | it with GPT-4," and a bit more likely than "scaling
         | transformers slightly helps performance when playing Othello
         | but severely sabotages performance when playing chess."
        
           | gardenhedge wrote:
           | Not that convoluted really
        
             | refulgentis wrote:
             | It's pretty convoluted, requires a ton of steps, mind-
             | reading, and odd sequencing.*
             | 
             | If you share every prior, and aren't particularly concerned
             | with being disciplined in treating conversation as
             | proposing a logical argument (I'm not myself, people find
             | it offputting), it probably wouldn't seem at all
             | convoluted.
             | 
             | * layer chess into gpt-3.5-instruct _only_ , but not
             | chatgpt, not GPT-4, to defeat the naysayers when GPT-4
             | comes out? _shrugs_ if the issues with that are unclear, I
             | can lay it out more
             | 
             | ** fwiw, at the time, pre-chatgpt, before the hype, there
             | wasn't a huge focus on chess, nor a ton of naysayers to
             | defeat. it would have been bizarre to put this much energy
             | into it, modulo the scatter-brained thinking in *
        
               | gardenhedge wrote:
               | It's not that many steps. I'm sure we've all seen our
               | sales teams selling features that aren't in the
               | application or exaggerating features before they're fully
               | complete.
               | 
               | To be clear, I'm not saying that the theory is true but
               | just that I could belive something like that could
               | happen.
        
           | jmount wrote:
           | Very good scenario. One variation: some researcher or
           | division in OpenAI performs all of the above steps to get a
           | raise. The whole field is predicated on rewarding the
           | appearance of ability.
        
           | tedsanders wrote:
           | Eh, OpenAI really isn't as shady as hell, from what I've seen
           | on the inside for 3 years. Rubik's cube hand was before me,
           | but in my time here I haven't seen anything I'd call shady
           | (though obviously the non-disparagement clauses were a
           | misstep that's now been fixed). Most people are genuinely
           | trying to build cool things and do right by our customers.
           | I've never seen anyone try to cheat on evals or cheat
           | customers, and we take our commitments on data privacy
           | seriously.
           | 
           | I was one of the first people to play chess against the base
           | GPT-4 model, and it blew my mind by how well it played. What
           | many people don't realize is that chess performance is
           | extremely sensitive to prompting. The reason gpt-3.5-turbo-
           | instruct does so well is that it can be prompted to complete
           | PGNs. All the other models use the chat format. This explains
           | pretty much everything in the blog post. If you fine-tune a
           | chat model, you can pretty easily recover the performance
           | seen in 3.5-turbo-instruct.
           | 
           | There's nothing shady going on, I promise.
        
         | og_kalu wrote:
         | It's not:
         | 
         | 1. That would just be plain bizzare
         | 
         | 2. It plays like what you'd expect from a LLM that could play
         | chess. That is, level of play can be modulated by the prompt
         | and doesn't manifest the same way shifting the level of
         | stockfish etc does. Also the specific chess notation being
         | prompted actually matters
         | 
         | 3. It's sensitive to how the position came to be. Clearly not
         | an existing chess engine. https://github.com/dpaleka/llm-chess-
         | proofgame
         | 
         | 4. It does make illegal moves. It's rare (~5 in 8205) but it
         | happens. https://github.com/adamkarvonen/chess_gpt_eval
         | 
         | 5. You can or well you used to be able to inspect the logprobs.
         | I think Open AI have stopped doing this but the link in 4 does
         | show the author inspecting it for Turbo instruct.
        
           | aithrowawaycomm wrote:
           | > Also the specific chess notation being prompted actually
           | matters
           | 
           | Couldn't this be evidence that it _is_ using an engine? Maybe
           | if you use the wrong notation it relies on the ANN rather
           | than calling to the engine.
           | 
           | Likewise:
           | 
           | - The sensitivity to game history is interesting, but is it
           | actually true that other chess engines only look at current
           | board state? Regardless, maybe it's not an _existing_ chess
           | engine! I would think OpenAI has some custom chess engine
           | built as a side project, PoC, etc. In particular this engine
           | might be neural and trained on actual games rather than board
           | positions, which could explain dependency on past moves. Note
           | that the engine is not actually very good. Does AlphaZero
           | depend on move history? (Genuine question, I am not sure. But
           | it does seem likely.)
           | 
           | - I think the illegal moves can be explained similarly to why
           | gpt-o1 sometimes screws up easy computations despite having
           | access to Python: an LLM having access to a tool does not
           | guarantee it always uses that tool.
           | 
           | I realize there are holes in the argument, but I genuinely
           | don't think these holes are as big as the "why is
           | gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
        
             | janalsncm wrote:
             | > Couldn't this be evidence that it is using an engine?
             | 
             | A test would be to measure its performance against more
             | difficult versions of Stockfish. A real chess engine would
             | have a higher ceiling.
             | 
             | Much more likely is this model was trained on more chess
             | PGNs. You can call that a "neural engine" if you'd like but
             | it is the simplest solution and explains the mistakes it is
             | making.
             | 
             | Game state isn't just what you can see on the board. It
             | includes the 50 move rule and castling rights. Those were
             | encoded as layers in AlphaZero along with prior positions
             | of pieces. (8 prior positions if I'm remembering
             | correctly.)
        
         | selcuka wrote:
         | I think that's the most plausible theory that would explain the
         | sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and
         | again the sudden regression in gpt-4*.
         | 
         | OpenAI also seem to augment the LLM with some type of VM or a
         | Python interpreter. Maybe they run a simple chess engine such
         | as Sunfish [1] which is around 1900-2000 ELO [2]?
         | 
         | [1] https://github.com/thomasahle/sunfish
         | 
         | [2] https://lichess.org/@/sunfish-engine
        
         | janalsncm wrote:
         | Probably not calling out to one but it would not surprise me at
         | all if they added more chess PGNs into their training data.
         | Chess is a bit special in AI in that it's still seen as a mark
         | of pure intelligence in some respect.
         | 
         | If you tested it on an equally strategic but less popular game
         | I highly doubt you would see the same performance.
        
       | pseudosavant wrote:
       | LLMs aren't really language models so much as they are token
       | models. That is how they can also handle input in audio or visual
       | forms because there is an audio or visual tokenizer. If you can
       | make it a token, the model will try to predict the following
       | ones.
       | 
       | Even though I'm sure chess matches were used in some of the LLM
       | training, I'd bet a model trained just for chess would do far
       | better.
        
         | viraptor wrote:
         | > That is how they can also handle input in audio or visual
         | forms because there is an audio or visual tokenizer.
         | 
         | This is incorrect. They get translated into the shared latent
         | space, but they're not tokenized in any way resembling the text
         | part.
        
           | pseudosavant wrote:
           | They are almost certainly tokenized in most LLM multi-modal
           | models. https://en.wikipedia.org/wiki/Large_language_model#Mu
           | ltimoda...
        
             | viraptor wrote:
             | Ah, an overloaded "tokenizer" meaning. "split into tokens"
             | vs "turned into a single embedding matching a token" I've
             | never heard it used that way before, but it makes sense
             | kinda.
        
       | ChrisArchitect wrote:
       | [dupe] https://news.ycombinator.com/item?id=42138276
        
       | digging wrote:
       | Definitely weird results, but I feel there are too many variables
       | to learn much from it. A couple things:
       | 
       | 1. The author mentioned that tokenization causes something
       | minuscule like a a " " at the end of the input to shatter the
       | model's capabilities. Is it possible other slightly different
       | formatting changes in the input could raise capabilities?
       | 
       | 2. Temperature was 0.7 for all models. What if it wasn't? Isn't
       | there a chance one more more models would perform significantly
       | better with higher or lower temperatures?
       | 
       | Maybe I just don't understand this stuff very well, but it feels
       | like this post is only 10% of the work needed to get any meaning
       | from this...
        
         | semi-extrinsic wrote:
         | The author mentions in the comment section that changing
         | temperature did not help.
        
       | azeirah wrote:
       | Maybe I'm really stupid... but perhaps if we want really
       | intelligent models we need to stop tokenizing at all? We're
       | literally limiting what a model can see and how it percieves the
       | world by limiting the structure of the information streams that
       | come into the model from the very beginning.
       | 
       | I know working with raw bits or bytes is slower, but it should be
       | relatively cheap and easy to at least falsify this hypothesis
       | that many huge issues might be due to tokenization problems
       | but... yeah.
       | 
       | Surprised I don't see more research into radicaly different
       | tokenization.
        
         | cschep wrote:
         | How would we train it? Don't we need it to understand the heaps
         | and heaps of data we already have "tokenized" e.g. the
         | internet? Written words for humans? Genuinely curious how we
         | could approach it differently?
        
           | viraptor wrote:
           | That's not what tokenized means here. Parent is asking to
           | provide the model with separate characters rather than
           | tokens, i.e. groups of characters.
        
           | skylerwiernik wrote:
           | Couldn't we just make every human readable character a token?
           | 
           | OpenAI's tokenizer makes "chess" "ch" and "ess". We could
           | just make it into "c" "h" "e" "s" "s"
        
             | taeric wrote:
             | This is just more tokens? And probably requires the model
             | to learn about common groups. Consider, "ess" makes sense
             | to see as a group. "Wss" does not.
             | 
             | That is, the groups are encoding something the model
             | doesn't have to learn.
             | 
             | This is not much astray from "sight words" we teach kids.
        
               | TZubiri wrote:
               | This is just more tokens?
               | 
               | Yup. Just let the actual ML git gud
        
               | taeric wrote:
               | So, put differently, this is just more expensive?
        
               | Hendrikto wrote:
               | No, actually much fewer tokens. 256 tokens cover all
               | bytes. See the ByT5 paper:
               | https://arxiv.org/abs/2105.13626
        
               | taeric wrote:
               | More tokens to a sequence, though. And since it is
               | learning sequences...
        
               | loa_in_ wrote:
               | Yeah, suddenly 16k tokens is just 16kb of ASCII instead
               | of ~6kwords
        
             | tchalla wrote:
             | aka Character Language Models which have existed for a
             | while now.
        
             | cco wrote:
             | We can, tokenization is literally just to maximize
             | resources and provide as much "space" as possible in the
             | context window.
             | 
             | There is no advantage to tokenization, it just helps solve
             | limitations in context windows and training.
        
               | TZubiri wrote:
               | I like this explanation
        
         | aithrowawaycomm wrote:
         | FWIW I think most of the "tokenization problems" are in fact
         | reasoning problems being falsely blamed on a minor technical
         | thing when the issue is much more profound.
         | 
         | E.g. I still see people claiming that LLMs are bad at basic
         | counting because of tokenization, but the same LLM counts
         | perfectly well if you use chain-of-thought prompting. So it
         | _can 't_ be explained by tokenization! The problem is
         | reasoning: the LLM needs a human to tell it that a counting
         | problem can be accurately solved if they go step-by-step.
         | Without this assistance the LLM is likely to simply guess.
        
           | ipsum2 wrote:
           | The more obvious alternative is that CoT is making up for the
           | deficiencies in tokenization, which I believe is the case.
        
             | aithrowawaycomm wrote:
             | I think the more obvious explanation has to do with
             | computational complexity: counting is an O(n) problem, but
             | transformer LLMs can't solve O(n) problems unless you use
             | CoT prompting: https://arxiv.org/abs/2310.07923
        
               | ipsum2 wrote:
               | What you're saying is an explanation what I said, but I
               | agree with you ;)
        
               | aithrowawaycomm wrote:
               | No, it's a rebuttal of what you said: CoT is not making
               | up for a deficiency in tokenization, it's making up for a
               | deficiency in transformers themselves. These complexity
               | results have nothing to do with tokenization, or even
               | LLMs, it is about the complexity class of problems that
               | can be solved by transformers.
        
               | ipsum2 wrote:
               | There's a really obvious way to test whether the
               | strawberry issue is tokenization - replace each letter
               | with a number, then ask chatGPT to count the number of
               | 3s.
               | 
               | Count the number of 3s, only output a single number: 6 5
               | 3 2 8 7 1 3 3 9.
               | 
               | ChatGPT: 3.
        
               | MacsHeadroom wrote:
               | This paper does not support your position any more than
               | it supports the position that the problem is
               | tokenization.
               | 
               | This paper posits that if the authors intuition was true
               | then they would find certain empirical results. ie. "If A
               | then B." Then they test and find the empirical results.
               | But this does not imply that their intuition was correct,
               | just as "If A then B" does not imply "If B then A."
               | 
               | If the empirical results were due to tokenization
               | absolutely nothing about this paper would change.
        
           | Der_Einzige wrote:
           | I'm the one who will fight you including with peer reviewed
           | papers indicating that it is in fact due to tokenization. I'm
           | too tired but will edit this for later, so take this as my
           | bookmark to remind me to respond.
        
             | aithrowawaycomm wrote:
             | I am aware of errors in _computations_ that can be fixed by
             | better tokenization (e.g. long addition works better
             | tokenizing right-left rather than L-R). But I am talking
             | about counting, and talking about counting _words,_ not
             | _characters._ I don't think tokenization explains why LLMs
             | tend to fail at this without CoT prompting. I really think
             | the answer is computational complexity: counting is simply
             | too hard for transformers unless you use CoT.
             | https://arxiv.org/abs/2310.07923
        
               | cma wrote:
               | Words vs characters is a similar problem, since tokens
               | can be less one word, multiple words, or multiple words
               | and a partial word, or words with non-word punctuation
               | like a sentence ending period.
        
             | Jensson wrote:
             | We know there are narrow solutions to these problems, that
             | was never the argument that the specific narrow task is
             | impossible to solve.
             | 
             | The discussion is about general intelligence, the model
             | isn't able to do a task that it can do simply because it
             | chooses the wrong strategy, that is a problem of lack of
             | generalization and not a problem of tokenization. Being
             | able to choose the right strategy is core to general
             | intelligence, altering input data to make it easier for the
             | model to find the right solution to specific questions does
             | not help it become more general, you just shift what narrow
             | problems it is good at.
        
             | azeirah wrote:
             | I strongly believe that the problem isn't that tokenization
             | isn't the underlying problem, it's that, let's say bit-by-
             | bit tokenization is too expensive to run at the scales
             | things are currently being ran at (openai, claude etc)
        
               | int_19h wrote:
               | It's not just a current thing, either. Tokenization
               | basically lets you have a model with a larger input
               | context than you'd otherwise have for the given resource
               | constraints. So any gains from feeding the characters in
               | directly have to be greater than this advantage. And for
               | CoT especially - which we _know_ produces significant
               | improvements in most tasks - you want large context.
        
             | pmarreck wrote:
             | My intuition says that tokenization is a factor especially
             | if it splits up individual move descriptions differently
             | from other LLM's
             | 
             | If you think about how our brains handle this data input,
             | it absolutely does not split them up between the letter and
             | the number, although the presence of both the letter and
             | number together would trigger the same 2 tokens I would
             | think
        
           | TZubiri wrote:
           | FWIW I think most of the "tokenization problems"
           | 
           | List of actual tokenizarion limitations 1- strawberry 2-
           | rhyming and metrics 3- whitespace (as displayed in the
           | article)
        
           | meroes wrote:
           | At a certain level they are identical problems. My strongest
           | piece of evidence is that I get paid as an RLHF'er to find
           | ANY case of error, including "tokenization". You know how
           | many errors an LLM gets in the simplest grid puzzles, with
           | CoT, with specialized models that don't try to "one-shot"
           | problems, with multiple models, etc?
           | 
           | My assumption is that these large companies wouldn't pay
           | hundreds of thousands of RLHF'ers through dozens of third
           | party companies livable wages if tokenization errors were
           | just that.
        
             | 1propionyl wrote:
             | > hundreds of thousands of RLHF'ers through dozens of third
             | party companies
             | 
             | Out of curiosity, what are these companies? And where do
             | they operate.
             | 
             | I'm always interested in these sorts of "hidden"
             | industries. See also: outsourced Facebook content
             | moderation in Kenya.
        
           | csomar wrote:
           | It can count words in a paragraph though. So I do think it's
           | tokenization.
        
           | PittleyDunkin wrote:
           | I feel like we can set our qualifying standards higher than
           | counting.
        
         | jncfhnb wrote:
         | There's a reason human brains have dedicated language handling.
         | Tokenization is likely a solid strategy. The real thing here is
         | that language is not a good way to encode all forms of
         | knowledge
        
           | joquarky wrote:
           | It's not even possible to encode all forms of knowledge.
        
             | shaky-carrousel wrote:
             | I know a joke where half of the joke is whistling and half
             | gesturing, and the punchline is whistling. The wording is
             | basically just to say who the players are.
        
         | layer8 wrote:
         | Going from tokens to bytes explodes the model size. I can't
         | find the reference at the moment, but reducing the average
         | token size induces a corresponding quadratic increase in the
         | width (size of each layer) of the model. This doesn't just
         | affect inference speed, but also training speed.
        
         | og_kalu wrote:
         | Tokenization is not strictly speaking necessary (you can train
         | on bytes). What it is is really really efficient. Scaling is a
         | challenge as is, bytes would just blow that up.
        
         | ATMLOTTOBEER wrote:
         | I tend to agree with you. Your post reminded me of
         | https://gwern.net/aunn
        
         | numpad0 wrote:
         | hot take: LLM tokens is kanji for AI, and just like kanji it
         | works okay sometimes but fails miserably for the task of
         | accurately representating English
        
           | umanwizard wrote:
           | Why couldn't Chinese characters accurately represent English?
           | Japanese and Korean aren't related to Chinese and still were
           | written with Chinese characters (still are in the case of
           | Japanese).
           | 
           | If England had been in the Chinese sphere of influence rather
           | than the Roman one, English would presumably be written with
           | Chinese characters too. The fact that it used an alphabet
           | instead is a historical accident, not due to any grammatical
           | property of the language.
        
             | stickfigure wrote:
             | If I read you correctly, you're saying "the fact that the
             | residents of England speak English instead of Chinese is a
             | historical accident" and maybe you're right.
             | 
             | But the residents of England do in fact speak English, and
             | English is a phonetic language, so there's an inherent
             | impedance mismatch between Chinese characters and English
             | language. I can make up words in English and write them
             | down which don't necessarily have Chinese written
             | equivalents (and probably, vice-versa?).
        
               | umanwizard wrote:
               | > If I read you correctly, you're saying "the fact that
               | the residents of England speak English instead of Chinese
               | is a historical accident" and maybe you're right.
               | 
               | That's not what I mean at all. I mean even if spoken
               | English were exactly the same as it is now, it could have
               | been written with Chinese characters, and indeed would
               | have been if England had been in the Chinese sphere of
               | cultural influence when literacy developed there.
               | 
               | > English is a phonetic language
               | 
               | What does it mean to be a "phonetic language"? In what
               | sense is English "more phonetic" than the Chinese
               | languages?
               | 
               | > I can make up words in English and write them down
               | which don't necessarily have Chinese written equivalents
               | 
               | Of course. But if English were written with Chinese
               | characters people would eventually agree on characters to
               | write those words with, just like they did with all the
               | native Japanese words that didn't have Chinese
               | equivalents but are nevertheless written with kanji.
               | 
               | Here is a famous article about how a Chinese-like writing
               | system would work for English:
               | https://www.zompist.com/yingzi/yingzi.htm
        
             | skissane wrote:
             | > Japanese and Korean aren't related to Chinese and still
             | were written with Chinese characters (still are in the case
             | of Japanese).
             | 
             | The problem is - in writing Japanese with kanji, lots of
             | somewhat arbitrary decisions had to be made. Which kanji to
             | use for which native Japanese word? There isn't always an
             | obviously best choice from first principles. But that's not
             | a problem in practice, because a tradition developed of
             | which kanjii to use for which Japanese word (kun'yomi
             | readings). For English, however, we don't have such a
             | tradition. So it isn't clear which Chinese character to use
             | for each English word. If two people tried to write English
             | with Chinese characters independently, they'd likely make
             | different character choices, and the mutual intelligibility
             | might be poor.
             | 
             | Also, while neither Japanese nor Korean belongs to the same
             | language family as Chinese, both borrowed lots of words
             | from Chinese. In Japanese, a lot of use of kanji
             | (especially on'yomi reading) is for borrowings from
             | Chinese. Since English borrowed far less terms from
             | Chinese, this other method of "deciding which character(s)
             | to use" - look at the word's Chinese etymology - largely
             | doesn't work for English given very few English words have
             | Chinese etymology.
             | 
             | Finally, they also invented kanji in Japan for certain
             | Japanese words - kokuji. The same thing happened for Korean
             | Hanja (gukja), to a lesser degree. Vietnamese Chu Nom
             | contains thousands of invented-in-Vietnam characters.
             | Probably, if English had adopted Chinese writing, the same
             | would have happened. But again, deciding when to do it and
             | if so how is a somewhat arbitrary choice, which is
             | impossible outside of a real societal tradition of doing
             | it.
             | 
             | > The fact that it used an alphabet instead is a historical
             | accident, not due to any grammatical property of the
             | language.
             | 
             | Using the Latin alphabet changed English, just as using
             | Chinese characters changed Japanese, Korean and Vietnamese.
             | If English had used Chinese characters instead of the Latin
             | alphabet, it would be a very different language today.
             | Possibly not in grammar, but certainly in vocabulary.
        
           | int_19h wrote:
           | You could absolutely write a tokenizer that would
           | consistently tokenize all distinct English words as distinct
           | tokens, with a 1:1 mapping.
           | 
           | But AFAIK there's no evidence that this actually improves
           | anything, and if you spend that much of the dictionary on one
           | language, it comes at the cost of making the encoding for
           | everything else much less efficient.
        
         | empiko wrote:
         | I have seen a bunch of tokenization papers with various ideas
         | but their results are mostly meh. I personally don't see
         | anything principally wrong with current approaches. Having
         | discrete symbols is how natural language works, and this might
         | be an okayish approximation.
        
         | malthaus wrote:
         | https://youtu.be/zduSFxRajkE
         | 
         | karpathy agrees with you, here he is hating on tokenizers while
         | re-building them for 2h
        
         | blixt wrote:
         | I think it's infeasible to train on bytes unfortunately, but
         | yeah it also seems very wrong to use a handwritten and
         | ultimately human version of tokens (if you take a look at the
         | tokenizers out there you'll find fun things like regular
         | expressions to change what is tokenized based on anecdotal
         | evidence).
         | 
         | I keep thinking that if we can turn images into tokens, and we
         | can turn audio into tokens, then surely we can create a set of
         | tokens where the tokens are the model's own chosen
         | representation for semantic (multimodal) meaning, and then
         | decode those tokens back to text[1]. Obviously a big downside
         | would be that the model can no longer 1:1 quote all text it's
         | seen since the encoded tokens would need to be decoded back to
         | text (which would be lossy).
         | 
         | [1] From what I could gather, this is exactly what OpenAI did
         | with images in their gpt-4o report, check out "Explorations of
         | capabilities": https://openai.com/index/hello-gpt-4o/
        
         | PittleyDunkin wrote:
         | A byte is itself sort of a token. So is a bit. It makes more
         | sense to use more tokenizers in parallel than it does to try
         | and invent an entirely new way of seeing the world.
         | 
         | Anyway humans have to tokenize, too. We don't perceive the
         | world as a continuous blob either.
        
           | samatman wrote:
           | I would say that "humans have to tokenize" is almost
           | precisely the opposite of how human intelligence works.
           | 
           | We build layered, non-nested gestalts out of real time analog
           | inputs. As a small example, the meaning of a sentence said
           | with the same precise rhythm and intonation can be
           | meaningfully changed by a gesture made while saying it. That
           | can't be tokenized, and that isn't what's happening.
        
             | PittleyDunkin wrote:
             | What is a gestalt if not a token (or a token representing
             | collections of other tokens)? It seems more reasonable (to
             | me) to conclude that we have multiple contradictory
             | tokenizers that we select from rather than to reject the
             | concept entirely.
             | 
             | > That can't be tokenized
             | 
             | Oh ye of little imagination.
        
         | Anotheroneagain wrote:
         | I think on the contrary, the more you can restrict it to
         | _reasonable_ inputs /outputs, the less powerful LLM you are
         | going to need.
        
         | ajkjk wrote:
         | This is probably unnecessary, but: I wish you wouldn't use the
         | word "stupid" there. Even if you didn't mean anything by it
         | personally, it might reinforce in an insecure reader the idea
         | that, if one can't speak intelligently about some complex and
         | abstruse subject that other people know about, there's
         | something wrong with them, like they're "stupid" in some
         | essential way. When in fact they would just be "ignorant" (of
         | this particular subject). To be able to formulate those
         | questions at all is clearly indicative of great intelligence.
        
           | volkk wrote:
           | > This is probably unnecessary
           | 
           | you're certainly right
        
         | amelius wrote:
         | Perhaps we can even do away with transformers and use a fully
         | connected network. We can always prune the model later ...
        
       | DrNosferatu wrote:
       | What about contemporary frontier models?
        
       | ynniv wrote:
       | I don't think one model is statistically significant. As people
       | have pointed out, it could have chess specific responses that the
       | others do not. There should be at least another one or two,
       | preferably unrelated, "good" data points before you can claim
       | there is a pattern. Also, where's Claude?
        
         | og_kalu wrote:
         | There are other transformers that have been trained on chess
         | text that play chess fine (just not as good as 3.5 Turbo
         | instruct with the exception of the "grandmaster level without
         | search" paper).
        
       | jrecursive wrote:
       | i think this has everything to do with the fact that learning
       | chess by learning sequences will get you into more trouble than
       | good. even a trillion games won't save you:
       | https://en.wikipedia.org/wiki/Shannon_number
       | 
       | that said, for the sake of completeness, modern chess engines
       | (with high quality chess-specific models as part of their
       | toolset) are fully capable of, at minimum, tying every player
       | alive or dead, every time. if the opponent makes one mistake,
       | even very small, they will lose.
       | 
       | while writing this i absently wondered if you increased the skill
       | level of stockfish, maybe to maximum, or perhaps at least an
       | 1800+ elo player, you would see more successful games. even then,
       | it will only be because the "narrower training data" (ie advanced
       | players won't play trash moves) at that level will probably get
       | you more wins in your graph, but it won't indicate any better
       | play, it will just be a reflection of less noise; fewer, more
       | reinforced known positions.
        
         | jayrot wrote:
         | > i think this has everything to do with the fact that learning
         | chess by learning sequences will get you into more trouble than
         | good. even a trillion games won't save you:
         | https://en.wikipedia.org/wiki/Shannon_number
         | 
         | Indeed. As has been pointed out before, the number of possible
         | chess positions easily, vastly dwarfs even the wildest possible
         | estimate of the number of atoms in the known universe.
        
           | metadat wrote:
           | What about the number of possible positions where an idiotic
           | move hasn't been played? Perhaps the search space who could
           | be reduced quite a bit.
        
             | pixl97 wrote:
             | Unless there is an apparent idiotic move than can lead to
             | an 'island of intelligence'
        
           | rcxdude wrote:
           | Sure, but so does the number of paragraphs in the english
           | language, and yet LLMs seem to do pretty well at that. I
           | don't think the number of configurations is particularly
           | relevant.
           | 
           | (And it's honestly quite impressive that LLMs can play it at
           | all, but not at all surprising that it loses pretty handily
           | to something which is explicitly designed to search, as
           | opposed to simply feed-forward a decision)
        
           | dataspun wrote:
           | Not true if we're talking sensible chess moves.
        
         | BurningFrog wrote:
         | > _I think this has everything to do with the fact that
         | learning chess by learning sequences will get you into more
         | trouble than good._
         | 
         | Yeah, once you've deviated from a sequence you're lost.
         | 
         | Maybe approaching it by learning the best move in
         | billions/trillions of positions, and feeding that into some AI
         | could work better. Similar positions often have the same kind
         | of best move.
        
         | torginus wrote:
         | Honestly, I think that once you discard the moves one would
         | never make, and account for symmetries/effectively similar
         | board positions (ones that could be detected by a very simple
         | pattern matcher), chess might not be that big a game at all.
        
           | jrecursive wrote:
           | you should try it and post a rebuttal :)
        
         | astrea wrote:
         | Since we're mentioning Shannon... What is the minimum
         | representative sample size of that problem space? Is it close
         | enough to the number of freely available chess moves on the
         | Internet and in books?
        
       | underlines wrote:
       | Can you try increasing compute in the problem search space, not
       | in the training space? What this means is, give it more compute
       | to think during inference by not forcing any model to "only
       | output the answer in algebraic notation" but do CoT prompting:
       | "1. Think about the current board 2. Think about valid possible
       | next moves and choose the 3 best by thinking ahead 3. Make your
       | move"
       | 
       | Or whatever you deem a good step by step instruction of what an
       | actual good beginner chess player might do.
       | 
       | Then try different notations, different prompt variations,
       | temperatures and the other parameters. That all needs to go in
       | your hyper-parameter-tuning.
       | 
       | One could try using DSPy for automatic prompt optimization.
        
         | viraptor wrote:
         | Yeah, the expectation for the immediate answer is definitely
         | results, especially for the later stages. Another possible
         | improvement: every 2 steps, show the current board state and
         | repeat the moves still to be processed, before analysing the
         | final position.
        
         | pavel_lishin wrote:
         | > _1. Think about the current board 2. Think about valid
         | possible next moves and choose the 3 best by thinking ahead 3._
         | 
         | Do these models actually _think about a board_? Chess engines
         | do, as much as we can say that any machine thinks. But do LLMs?
        
           | TZubiri wrote:
           | Can be forced through inference with CoT type of stuff. Spend
           | tokens at each stage to draw the board for example, then
           | spend tokens restating the rules of the game, then spend
           | token restating the heuristics like piece value, and then
           | spend tokens doing a minmax n-ply search.
           | 
           | Wildly inefficient? Probably. Could maybe generate some
           | python to make more efficient? Maybe, yeah.
           | 
           | Essentially user would have to teach gpt to play chess, or
           | training would fine tune chess towards these CoT, fine
           | tuning, etc...
        
       | cmpalmer52 wrote:
       | I don't think it would have an impact great enough to explain the
       | discrepancies you saw, but some chess engines on very low
       | difficulty settings make "dumb" moves sometimes. I'm not great at
       | chess and I have trouble against them sometimes because they
       | don't make the kind of mistakes humans make. Moving the
       | difficulty up a bit makes the games more predictable, in that you
       | can predict and force an outcome without the computer blowing it
       | with a random bad move. Maybe part of the problem is them not
       | dealing with random moves well.
       | 
       | I think an interesting challenge would be looking at a board
       | configuration and scoring it on how likely it is to be real -
       | something high ranked chess players can do without much thought
       | (telling a random setup of pieces from a game in progress).
        
       | Xcelerate wrote:
       | So if you squint, chess can be considered a formal system. Let's
       | plug ZFC or PA into gpt-3.5-turbo-instruct along with an
       | interesting theorem and see what happens, no?
        
       | tqi wrote:
       | I assume LLMs will be fairly average at chess for the same reason
       | it cant count Rs in Strawberry - it's reflecting the training set
       | and not using any underlying logic? Granted my understanding of
       | LLMs is not very sophisticated, but I would be surprised if the
       | Reward Models used were able to distinguish high quality moves vs
       | subpar moves...
        
         | ClassyJacket wrote:
         | LLMs can't count the Rs in strawberry because of tokenization.
         | Words are converted to vectors (numbers), so the actual
         | transformer network never sees the letters that make up the
         | word.
         | 
         | ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
        
           | tqi wrote:
           | Hm but if that is the case, then why did LLMs only fail at
           | the tasks for a few word/letter combinations (like r's in
           | "Strawberry"), and not all words?
        
       | bryan0 wrote:
       | I remember one of the early "breakthroughs" for LLMs in chess was
       | that if it could actually play legal moves(!) In all of these
       | games are the models always playing legal moves? I don't think
       | the article says. The fact that an LLM can even reliably play
       | legal moves, 20+ moves into a chess game is somewhat remarkable.
       | It needs to have an accurate representation of the board state
       | even though it was only trained on next token prediction.
        
         | pama wrote:
         | The author explains what they did: restrict the move options to
         | valid ones when possible (for open models with the ability to
         | enforce grammar during inference) or sample the model for a
         | valid move up to ten times, then pick a random valid move.
        
         | zelphirkalt wrote:
         | I think it only needs to have read sufficient pgns.
        
         | kenjackson wrote:
         | I did a very unscientific test and it did seem to just play
         | legal moves. Not only that, if I did an illegal move it would
         | tell me that I couldn't do it.
         | 
         | I think said that I wanted to play with new rules, where a
         | queen could jump over any pawn, and it let me make that rule
         | change -- and we played with this new rule. Unfortunately, I
         | was trying to play in my head and I got mixed up and ended up
         | losing my queen. Then I changed the rule one more time -- if
         | you take the queen you lose -- so I won!
        
       | ericye16 wrote:
       | I agree with some of the other comments here that the prompt is
       | limiting. The model can't do any computation without emitting
       | tokens and limiting the numbers of tokens it can emit is going to
       | limit the skill of the model. It's surprising that any model at
       | all is capable of performing well with this prompt in fact.
        
       | niobe wrote:
       | I don't understand why educated people expect that an LLM _would_
       | be able to play chess at a decent level.
       | 
       | It has no idea about the quality of it's data. "Act like x"
       | prompts are no substitute for actual reasoning and deterministic
       | computation which clearly chess requires.
        
         | computerex wrote:
         | Question here is why gpt-3.5-instruct can then beat stockfish.
        
           | fsndz wrote:
           | PS: I ran and as suspected got-3.5-turbo-instruct does not
           | beat stockfish, it is not even close "Final Results:
           | gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0,
           | Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0,
           | Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf
           | 7e26e9b17e1754?...
        
             | computerex wrote:
             | Maybe there's some difference in the setup because the OP
             | reports that the model beats stockfish (how they had it
             | configured) every single game.
        
               | Filligree wrote:
               | OP had stockfish at its weakest preset.
        
               | fsndz wrote:
               | Did the same and gpt-3.5-turbo-instruct still lost all
               | the games. maybe a diff in stockfish version ? I am using
               | stockfish 16
        
               | mannykannot wrote:
               | That is a very pertinent question, especially if
               | Stockfish has been used to generate training data.
        
               | golol wrote:
               | You have to get the model to think in PGN data. It's
               | crucial to use the exact PGN format it sae in its
               | training data and to give it few shot examples.
        
           | bluGill wrote:
           | The artical appears to have only run stockfish at low levels.
           | you don't have to be very good to beat it
        
           | lukan wrote:
           | Cheating (using a internal chess engine) would be the obvious
           | reason to me.
        
             | TZubiri wrote:
             | Nope. Calls by api don't use functions calls.
        
               | permo-w wrote:
               | that you know of
        
               | TZubiri wrote:
               | Sure. It's not hard to verify, in the user ui, function
               | calls are very transparent.
               | 
               | And in the api, all of the common features like maths and
               | search are just not there. You can implement them
               | yourself.
               | 
               | You can compare with self hosted models like llama and
               | the performance is quite similar.
               | 
               | You can also jailbreak and get shell into the container
               | to get some further proof
        
               | girvo wrote:
               | How can you prove this when talking about someones
               | internal closed API?
        
           | shric wrote:
           | I'm actually surprised any of them manage to make legal moves
           | throughout the game once out of book moves.
        
         | SilasX wrote:
         | Right, at least as of the ~GPT3 model it was just "predict what
         | you _would_ see in a chess game ", not "what would _be_ the
         | best move ". So (IIRC) users noted that if you made bad move,
         | then the model would also reply with bad moves because it
         | pattern matched to bad games. (I anthropomorphized this as the
         | model saying "oh, we're doing dumb-people-chess now, I can do
         | that too!")
        
           | cma wrote:
           | But it also predicts moves where the text says "black won the
           | game, [proceeds to show the game]". To minimize loss on that
           | it would need to from context try and make it so white
           | doesn't make critical mistakes.
        
         | aqme28 wrote:
         | Yeah, that is the "something weird" of the article.
        
         | viraptor wrote:
         | This is a puzzle given enough training information. LLM can
         | successfully print out the status of the board after the given
         | moves. It can also produce a not-terrible summary of the
         | position and is able to list dangers at least one move ahead.
         | Decent is subjective, but that should beat at least beginners.
         | And the lowest level of stockfish used in the blog post is
         | lowest intermediate.
         | 
         | I don't know really what level we should be thinking of here,
         | but I don't see any reason to dismiss the idea. Also, it really
         | depends on whether you're thinking of the current public
         | implementations of the tech, or the LLM idea in general. If we
         | wanted to get better results, we could feed it way more chess
         | books and past game analysis.
        
           | grugagag wrote:
           | LLMs like GPT aren't built to play chess, and here's why:
           | they're made for handling language, not playing games with
           | strict rules and strategies. Chess engines, like Stockfish,
           | are designed specifically for analyzing board positions and
           | making the best moves, but LLMs don't even "see" the board.
           | They're just guessing moves based on text patterns, without
           | understanding the game itself.
           | 
           | Plus, LLMs have limited memory, so they struggle to remember
           | previous moves in a long game. It's like trying to play
           | blindfolded! They're great at explaining chess concepts or
           | moves but not actually competing in a match.
        
             | viraptor wrote:
             | > but LLMs don't even "see" the board
             | 
             | This is a very vague claim, but they _can_ reconstruct the
             | board from the list of moves, which I would say proves this
             | wrong.
             | 
             | > LLMs have limited memory
             | 
             | For the recent models this is not a problem for the chess
             | example. You can feed whole books into them if you want to.
             | 
             | > so they struggle to remember previous moves
             | 
             | Chess is stateless with perfect information. Unless you're
             | going for mind games, you don't need to remember previous
             | moves.
             | 
             | > They're great at explaining chess concepts or moves but
             | not actually competing in a match.
             | 
             | What's the difference between a great explanation of a move
             | and explaining every possible move then selecting the best
             | one?
        
               | mjcohen wrote:
               | Chess is not stateless. Three repetitions of same
               | position is a draw.
        
               | Someone wrote:
               | Yes, there's state there that's not in the board
               | position, but technically, threefold repetition is not a
               | draw. Play can go on.
               | https://en.wikipedia.org/wiki/Threefold_repetition:
               | 
               |  _"The game is not automatically drawn if a position
               | occurs for the third time - one of the players, on their
               | turn, must claim the draw with the arbiter. The claim
               | must be made either before making the move which will
               | produce the third repetition, or after the opponent has
               | made a move producing a third repetition. By contrast,
               | the fivefold repetition rule requires the arbiter to
               | intervene and declare the game drawn if the same position
               | occurs five times, needing no claim by the players."_
        
               | cool_dude85 wrote:
               | >Chess is stateless with perfect information. Unless
               | you're going for mind games, you don't need to remember
               | previous moves.
               | 
               | In what sense is chess stateless? Question: is Rxa6 a
               | legal move? You need board state to refer to in order to
               | decide.
        
               | aetherson wrote:
               | They mean that you only need board position, you don't
               | need the previous moves that led to that board position.
               | 
               | There are at least a couple of exceptions to that as far
               | as I know.
        
               | User23 wrote:
               | The correct phrasing would be is it a Markov process?
        
               | chongli wrote:
               | Yes, 4 exceptions: castling rights, legal en passant
               | captures, threefold repetition, and the 50 move rule. You
               | actually need quite a lot of state to track all of those.
        
               | fjkdlsjflkds wrote:
               | It shouldn't be too much extra state. I assume that 2
               | bits should be enough to cover castling rights (one for
               | each player), whatever is necessary to store the last 3
               | moves should cover legal en passant captures and
               | threefold repetition, and 12 bits to store two non-
               | overflowing 6 bit counters (time since last capture, and
               | time since last pawn move) should cover the 50 move rule.
               | 
               | So... unless I'm understanding something incorrectly,
               | something like "the three last moves plus 17 bits of
               | state" (plus the current board state) should be enough to
               | treat chess as a memoryless process. Doesn't seem like
               | too much to track.
        
               | chongli wrote:
               | Threefold repetition does not require the three positions
               | to occur consecutively. So you could conceivably have a
               | position repeat itself for first on the 1st move, second
               | time on the 25th move, and the third time on the 50th
               | move of a sequence and then players could claim a draw by
               | threefold repetition or 50 move rule at the same time!
               | 
               | This means you do need to store the last 50 board
               | positions in the worst case. Normally you need to store
               | less because many moves are irreversible (pawns cannot go
               | backwards, pieces cannot be un-captured).
        
               | fjkdlsjflkds wrote:
               | Ah... gotcha. Thanks for the clarification.
        
               | sfmz wrote:
               | Chess is not stateless. En Passant requires last move and
               | castling rights requires nearly all previous moves.
               | 
               | https://adamkarvonen.github.io/machine_learning/2024/01/0
               | 3/c...
        
               | viraptor wrote:
               | Ok, I did go too far. But castling doesn't require all
               | previous moves - only one bit of information carried
               | over. So in practice that's board + 2 bits per player.
               | (or 1 bit and 2 moves if you want to include a draw)
        
               | aaronchall wrote:
               | Castling requires no prior moves by either piece (King or
               | Rook). Move the King once and back early on, and later,
               | although the board looks set for castling, the King may
               | not.
        
               | viraptor wrote:
               | Yes, which means you carry one bit of extra information -
               | "is castling still allowed". The specific moves that
               | resulted in this bit being unset don't matter.
        
               | aaronchall wrote:
               | Ok, then for this you need minimum of two bits - one for
               | kingside Rook and one for the queenside Rook, both would
               | be set if you move the King. You also need to count moves
               | since the last exchange or pawn move for the 50 move
               | rule.
        
               | viraptor wrote:
               | Ah, that one's cool - I've got to admit I've never heard
               | of the 50 move rule.
        
               | User23 wrote:
               | Also the 3x repetition rule.
        
               | chipsrafferty wrote:
               | And 5x repetition rule
        
               | ethbr1 wrote:
               | > _Chess is stateless with perfect information._
               | 
               | It is not stateless, because good chess isn't played as a
               | series of independent moves -- it's played as a series of
               | moves connected to a player's strategy.
               | 
               | > _What 's the difference between a great explanation of
               | a move and explaining every possible move then selecting
               | the best one?_
               | 
               | Continuing from the above, "best" in the latter sense
               | involves understanding possible future moves _after_ the
               | next move.
               | 
               | Ergo, if I looked at all games with the current board
               | state and chose the next move that won the most games,
               | it'd be tactically sound but strategically ignorant.
               | 
               | Because many of those next moves were making that next
               | move _in support of_ some broader strategy.
        
               | viraptor wrote:
               | > it's played as a series of moves connected to a
               | player's strategy.
               | 
               | That state belongs to the player, not to the game. You
               | can carry your own state in any game you want - for
               | example remember who starts with what move in rock paper
               | scissors, but that doesn't make that game stateful. It's
               | the player's decision (or bot's implementation) to use
               | any extra state or not.
               | 
               | I wrote "previous moves" specifically (and the extra bits
               | already addressed elsewhere), but the LLM can
               | carry/rebuild its internal state between the steps.
        
               | ethbr1 wrote:
               | If we're talking about LLMs, then the state belongs to
               | it.
               | 
               | So even if the rules of chess are (mostly) stateless, the
               | resulting game itself is not.
               | 
               | Thus, you can't dismiss concerns about LLMs having
               | difficulty tracking state by saying that chess is
               | stateless. It's not, in that sense.
        
               | lxgr wrote:
               | > good chess isn't played as a series of independent
               | moves -- it's played as a series of moves connected to a
               | player's strategy.
               | 
               | Maybe good chess, but not perfect chess. That would by
               | definition be game-theoretically optimal, which in turn
               | implies having to maintain no state other than your
               | position in a large but precomputable game tree.
        
               | chongli wrote:
               | Right, but your position also includes whether or not you
               | still have the right to castle on either side, whether
               | each pawn has the right to capture en passant or not, the
               | number of moves since the last pawn move or capture (for
               | tracking the 50 move rule), and whether or not the
               | current position has ever appeared on the board once or
               | twice prior (so you can claim a draw by threefold
               | repetition).
               | 
               | So in practice, your position actually includes the log
               | of all moves to that point. That's a lot more state than
               | just what you can see on the board.
        
               | cowl wrote:
               | > Chess is stateless with perfect information. Unless
               | you're going for mind games, you don't need to remember
               | previous moves.
               | 
               | while it can be played as stateless, remembering previous
               | moves gives you insight into potential strategy that is
               | being build.
        
               | jackcviers3 wrote:
               | You can feed them whole books, but they have trouble with
               | recall for specific information in the middle of the
               | context window.
        
             | jerska wrote:
             | LLMs need to compress information to be able to predict
             | next words in as many contexts as possible.
             | 
             | Chess moves are simply tokens as any other. Given enough
             | chess training data, it would make sense to have part of
             | the network trained to handle chess specifically instead of
             | simply encoding basic lists of moves and follow-ups. The
             | result would be a general purpose sub-network trained on
             | chess.
        
             | zeckalpha wrote:
             | Language is a game with strict rules and strategies.
        
             | codebolt wrote:
             | > they're made for handling language, not playing games
             | with strict rules and strategies
             | 
             | Here's the opposite theory: Language encodes objective
             | reasoning (or at least, it does some of the time). A
             | sufficiently large ANN trained on sufficiently large
             | amounts of text will develop internal mechanisms of
             | reasoning that can be applied to domains outside of
             | language.
             | 
             | Based on what we are currently seeing LLMs do, I'm becoming
             | more and more convinced that this is the correct picture.
        
               | wruza wrote:
               | I share this idea but from the different perspective. It
               | doesn't develop these mechanisms, but casts a high-
               | dimensional-enough shadow of their effect on itself. This
               | vaguely explains why the more deep Gell-Mann-wise you are
               | the less sharp that shadow is, because specificity cuts
               | off "reasoning" hyperplanes.
               | 
               | It's hard to explain emerging _mechanisms_ because of the
               | nature of generation, which is one-pass sequential matrix
               | reduction. I say this while waving my hands, but listen.
               | Reasoning is similar to Turing complete algorithms, and
               | what LLMs can become through training is similar to
               | limited pushdown automata at best. I think this is a good
               | conceptual handle for it.
               | 
               | "Line of thought" is an interesting way to loop the
               | process back, but it doesn't show _that_ much
               | improvement, afaiu, and still is finite.
               | 
               | Otoh, a chess player takes as much time and "loops" as
               | they need to get the result (ignoring competitive time
               | limits).
        
             | nemomarx wrote:
             | just curious, was this rephrased by an llm or is that your
             | writing style?
        
           | shric wrote:
           | Stockfish level 1 is well below "lowest intermediate".
           | 
           | A friend of mine just started playing chess a few weeks ago
           | and can beat it about 25% of the time.
           | 
           | It will hang pieces, and you can hang your own queen and
           | there's about a 50% chance it won't be taken.
        
         | danielmarkbruce wrote:
         | Chess does not clearly require that. Various purely
         | ML/statistical based model approaches are doing pretty well.
         | It's almost certainly best to incorporate some kind of search
         | into an overall system, but it's not absolutely required to
         | play just decent amateur level.
         | 
         | The problem here is the specific model architecture, training
         | data, vocabulary/tokenization method (if you were going to even
         | represent a game this way... which you wouldn't), loss function
         | and probably decoding strategy.... basically everything is
         | wrong here.
        
         | TZubiri wrote:
         | Bro, it actually did play chess, didn't you read the article?
        
           | mandevil wrote:
           | It sorta played chess- he let it generate up to ten moves,
           | throwing away any that weren't legal, and if no legal move
           | was generated by the 10th try he picked a random legal move.
           | He does not say how many times he had to provide a random
           | move, or how many times illegal moves were generated.
        
             | og_kalu wrote:
             | You're right it's not in this blog but turbo-instruct's
             | chess ability has been pretty thoroughly tested and it does
             | play chess.
             | 
             | https://github.com/adamkarvonen/chess_gpt_eval
        
             | TZubiri wrote:
             | Ah, I didn't see the ilegal move discarding.
        
               | mandevil wrote:
               | That was for the OpenAI games- including the ones that
               | won. For the ones he ran himself with open source LLM's
               | he restricted their grammar to just be legal moves, so it
               | could only respond with a legal move. But that was
               | because of a separate process he added on top of the LLM.
               | 
               | Again, this isn't exactly HAL playing chess.
        
         | slibhb wrote:
         | Few people (perhaps none) expected LLMs to be good at chess.
         | Nevertheless, as the article explains, there was buzz around a
         | year ago that LLMs were good at chess.
         | 
         | > It has no idea about the quality of it's data. "Act like x"
         | prompts are no substitute for actual reasoning and
         | deterministic computation which clearly chess requires.
         | 
         | No. You can definitely train a model to be really good at chess
         | without "actual reasoning and deterministic computation".
        
         | xelxebar wrote:
         | Then you should be surprised that turbo-instruct actually plays
         | well, right? We see a proliferation of hand-wavy arguments
         | based on unfounded anthropomorphic intuitions about "actual
         | reasoning" and whatnot. I think this is good evidence that
         | nobody really understands what's going on.
         | 
         | If some mental model says that LLMs should be bad at chess,
         | then it fails to explain why we have LLMs playing strong chess.
         | If another mental model says the inverse, then it fails to
         | explain why so many of these large models fail spectacularly at
         | chess.
         | 
         | Clearly, there's more going on here.
        
           | flyingcircus3 wrote:
           | "playing strong chess" would be a much less hand-wavy claim
           | if there were lots of independent methods of quantifying and
           | verifying the strength of stockfish's lowest difficulty
           | setting. I honestly don't know if that exists or not. But
           | unless it does, why would stockfish's lowest difficulty
           | setting be a meaningful threshold?
        
             | golol wrote:
             | I've tried it myself, GPT-3.5-turbo-instruct was at least
             | somewhere in the rabge 1600-1800 ELO.
        
           | akira2501 wrote:
           | There are some who suggest that modern chess is mostly a game
           | of memorization and not one particularly of strategy or
           | skill. I assume this is why variants like speed chess exist.
           | 
           | In this scope, my mental model is that LLMs would be good at
           | modern style long form chess, but would likely be easy to
           | trip up with certain types of move combinations that most
           | humans would not normally use. My prediction is that once
           | found they would be comically susceptible to these patterns.
           | 
           | Clearly, we have no real basis for saying it is "good" or
           | "bad" at chess, and even using chess performance as an
           | measurement sample is a highly biased decision, likely born
           | out of marketing rather than principle.
        
             | mewpmewp2 wrote:
             | It is memorisatiom only after you have grandmastered
             | reasoning and strategy.
        
             | DiogenesKynikos wrote:
             | Speed chess relies on skill.
             | 
             | I think you're using "skill" to refer solely to one aspect
             | of chess skill: the ability to do brute-force calculations
             | of sequences of upcoming moves. There are other aspects of
             | chess skill, such as:
             | 
             | 1. The ability to judge a chess position at a glance, based
             | on years of experience in playing chess and theoretical
             | knowledge about chess positions.
             | 
             | 2. The ability to instantly spot tactics in a position.
             | 
             | In blitz (about 5 minutes) or bullet (1 minute) chess
             | games, these other skills are much more important than the
             | ability to calculate deep lines. They're still aspects of
             | chess skill, and they're probably equally important as the
             | ability to do long brute-force calculations.
        
               | henearkr wrote:
               | > tactics in a position
               | 
               | That should give patterns (hence your use of the verb to
               | "spot" them, as the grandmaster would indeed spot the
               | patterns) recognizable in the game string.
               | 
               | More specifically grammar-like parterns, e.g. the same
               | moves but translated.
               | 
               | Typically what an LLM can excel at.
        
           | the_af wrote:
           | > _Then you should be surprised that turbo-instruct actually
           | plays well, right?_
           | 
           | Do we know it's not special-casing chess and instead using a
           | different engine (not an LLM) for playing?
           | 
           | To be clear, this would be an entirely _appropriate_ approach
           | to problem-solving in the real world, it just wouldn 't be
           | the LLM that's playing chess.
        
           | mda wrote:
           | Yes, probably there is more going on here, e.g. it is
           | cheating.
        
         | mannykannot wrote:
         | One of the main purposes of running experiments of any sort is
         | to find out if our preconceptions are accurate. Of course, if
         | someone is not interested in that question, they might as well
         | choose not to look through the telescope.
        
           | bowsamic wrote:
           | Sadly there's a common sentiment on HN that testing obvious
           | assumptions is a waste of time
        
             | BlindEyeHalo wrote:
             | Not only on HN. Trying to publish a scientific article that
             | does not contain the word 'novel' has become almost
             | impossible. No one is trying to reproduce anyones claims
             | anymore.
        
               | pcf wrote:
               | Do you think this bias is part of the replication crisis
               | in science?
        
               | bowsamic wrote:
               | I don't think this is about replication, but even just
               | about the initial test in the first place. In science we
               | do often test obvious things. For example, I was a
               | theoretical quantum physicist, and a lot of the time I
               | knew that what I am working on will definitely work,
               | since the maths checks out. In some sense that makes it
               | kinda obvious, but we test it anyway.
               | 
               | The issue is that even that kinda obviousness is
               | criticised here. People get mad at the idea of doing
               | experiments when we already expect a result.
        
         | pizza wrote:
         | But there's really nothing about chess that makes reasoning a
         | prerequisite, a win is a win as long as it's a win. This is
         | kind of a semantics game: it's a question of whether the degree
         | of skill people observe in an LLM playing chess is actually
         | some different quantity than the chance it wins.
         | 
         | I mean at some level you're saying that no matter how close to
         | 1 the win probability (1 - epsilon) gets, both of the following
         | are true:
         | 
         | A. you should always expect for the computation that you're
         | able to do via conscious reasoning alone to always be
         | sufficient, at least in principle, to asymptotically get a
         | higher win probability than a model, no matter what the model's
         | win probability was to begin with
         | 
         | B. no matter how close to 1 that the model's win rate p=(1 -
         | epsilon) gets, because logical inference is so non-smooth, the
         | win rate on yet-unseen data is fundamentally algorithmically
         | random/totally uncorrelated to in-distribution performance, so
         | it's never appropriate to say that a model can understand or to
         | reason
         | 
         | To me it seems that people are subject to both of these
         | criteria, though. They have a tendency to cap out at their
         | eventual skill cap unless given a challenge to nudge them to a
         | higher level, and likewise possession of logical reasoning
         | doesn't let us say much at all about situations that their
         | reasoning is unfamiliar with.
         | 
         | I also think, if you want to say that what LLMs do has nothing
         | to do with understanding or ability, then you also have to have
         | an alternate explanation for the phenomenon of AlphaGo
         | defeating Lee Sedol being a catalyst for top Go players being
         | able to rapidly increase their own rankings shortly after.
        
         | jsemrau wrote:
         | There are many ways to test for reasoning and deterministic
         | computation as my own work in this space has shown .
        
         | golol wrote:
         | Because it's a straight forward stochastic sequence modelling
         | task and I've seen GPT-3.5-turbo-instruct play at high amateur
         | level myself. But it seems like all the RLHF and distillation
         | that is done on newer models destroys that ability.
        
         | QuesnayJr wrote:
         | They thought it because we have an existence proof:
         | gpt-3.5-turbo-instruct _can_ play chess at a decent level.
         | 
         | That was the point of the post (though you have to read it to
         | the end to see this). That one model can play chess pretty
         | well, while the free models and OpenAI's later models can't.
         | That's weird.
        
         | scj wrote:
         | It'd be more interesting to see LLMs play Family Feud. I think
         | it'd be their ideal game.
        
         | chipdart wrote:
         | > I don't understand why educated people expect that an LLM
         | would be able to play chess at a decent level.
         | 
         | The blog post demonstrates that a LLM plays chess at a decent
         | level.
         | 
         | The blog post explains why. It addresses the issue of data
         | quality.
         | 
         | I don't understand what point you thought you were making.
         | Regardless of where you stand, the blog post showcases a
         | surprising result.
         | 
         | You stress your prior unfounded belief, you were presented with
         | data that proves it wrong, and your reaction was to post a
         | comment with a thinly veiled accusation of people not being
         | educated when clearly you are the one that's off.
         | 
         | To make matters worse, this topic is also about curiosity.
         | Which has a strong link with intelligence and education. And
         | you are here criticizing others on those grounds in spite of
         | showing your defitic right at the first sentence.
         | 
         | This blog post was a great read. Very surprising, engaging, and
         | thought provoking.
        
         | Cthulhu_ wrote:
         | > I don't understand why educated people expect that an LLM
         | would be able to play chess at a decent level.
         | 
         | Because it would be super cool; curiosity isn't something to be
         | frowned upon. If it turned out it _did_ play chess reasonably
         | well, it would mean emergent behaviour instead of just echoing
         | things said online.
         | 
         | But it's wishful thinking with this technology at this current
         | level; like previous instances of chatbots and the like, while
         | initially they can convince some people that they're
         | intelligent thinking machines, this test proves that they
         | aren't. It's part of the scientific process.
        
         | jdthedisciple wrote:
         | I love how LLMs are the one subject matter where even most
         | educated people are extremely confidently _wrong_.
        
           | fourthark wrote:
           | Ppl acting like LLMs!
        
         | motoboi wrote:
         | I suppose you didn't get the news, but google developed a LLM
         | that can play chess. And play it at grandmaster level:
         | https://arxiv.org/html/2402.04494v1
        
           | suddenlybananas wrote:
           | That article isn't as impressive as it sounds: https://gist.g
           | ithub.com/yoavg/8b98bbd70eb187cf1852b3485b8cda...
           | 
           | In particular, it is _not_ an LLM and it is not trained
           | solely on observations of chess moves.
        
           | Scene_Cast2 wrote:
           | Not quite an LLM. It's a transformer model, but there's no
           | tokenizer or words, just chess board positions (64 tokens,
           | one per board square). It's purpose-built for chess (never
           | sees a word of text).
        
             | lxgr wrote:
             | In fact, the unusual aspect of this chess engine is not
             | that it's using neural networks (even Stockfish does, these
             | days!), but that it's _only_ using neural networks.
             | 
             | Chess engines essentially do two things: Calculate the
             | value of a given position for their side, and walking the
             | tree game tree while evaluating its positions in that way.
             | 
             | Historically, position value was a handcrafted function
             | using win/lose criteria (e.g. being able to give checkmate
             | is infinitely good) and elaborate heuristics informed by
             | real chess games, e.g. having more space on the board is
             | good, having a high-value piece threatened by a low-value
             | one is bad etc., and the strength of engines largely
             | resulted from being able to "search the game tree" for good
             | positions very broadly and deeply.
             | 
             | Recently, neural networks (trained on many simulated games)
             | have been replacing these hand-crafted position evaluation
             | functions, but there's still a ton of search going on. In
             | other words, the networks are still largely "dumb but
             | fast", and without deep search they'll lose against even a
             | novice player.
             | 
             | This paper now presents a _searchless_ chess engine, i.e.
             | one who essentially  "looks at the board once" and "intuits
             | the best next move", without "calculating" resulting
             | hypothetical positions at all. In the words of Capablanca,
             | a chess world champion also cited in the paper: "I see only
             | one move ahead, but it is always the correct one."
             | 
             | The fact that this is possible can be considered
             | surprising, a testament to the power of transformers etc.,
             | but it does indeed have nothing to do with language or LLMs
             | (other than that the best ones known to date are based on
             | the same architecture).
        
           | teleforce wrote:
           | It's interesting to note that the paper benchmarked its chess
           | playing performance against GPT-3.5-turbo-instruct, the only
           | well performant LLM in the posted article.
        
         | empath75 wrote:
         | > I don't understand why educated people expect that an LLM
         | would be able to play chess at a decent level.
         | 
         | You shouldn't but there's lots of things that LLMs can do that
         | educated people shouldn't expect it to be able to do.
        
       | abalaji wrote:
       | An easy way to make all LLMs somewhat good at chess is to make a
       | Chess Eval that you publish and get traction with. Suddenly you
       | will find that all newer frontier models are half decent at
       | chess.
        
       | fsndz wrote:
       | wow I actually did something similar recently and no LLM could
       | win and the centipawn loss was always going through the roof
       | (sort of). I created a leaderboard based on it.
       | https://www.lycee.ai/blog/what-happens-when-llms-play-chess
       | 
       | I am very surprised by the perf of got-3.5-turbo-instruct.
       | Beating stockfish ? I will have to run the experiment with that
       | model to check that out
        
         | fsndz wrote:
         | PS: I ran and as suspected got-3.5-turbo-instruct does not beat
         | stockfish, it is not even close
         | 
         | "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6,
         | Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0,
         | Rating=1500.00"
         | 
         | https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
        
           | janalsncm wrote:
           | > I always had the LLM play as white against Stockfish--a
           | standard chess AI--on the lowest difficulty setting
           | 
           | I think the author was comparing against Stockfish at a lower
           | skill level (roughly, the number of nodes explored in a
           | move).
        
             | fsndz wrote:
             | Did the same and gpt-3.5-turbo-instruct still lost all the
             | games. maybe a diff in stockfish version ? I am using
             | stockfish 16
        
               | janalsncm wrote:
               | Huh. Honestly, your answer makes more sense, LLMs
               | shouldn't be good at chess, and this anomaly looks more
               | like a bug. Maybe the author should share his code so it
               | can be replicated.
        
           | tedsanders wrote:
           | Your issue is that the performance of these models at chess
           | is incredibly sensitive to the prompt. If you have
           | gpt-3.5-turbo-instruction complete a PGN transcript, then
           | you'll see performance in the 1800 Elo range. If you ask in
           | English or diagram the board, you'll see vastly degraded
           | performance.
           | 
           | Unlike people, how you ask the question really really affects
           | the output quality.
        
       | uneventual wrote:
       | my friend pointed out that Q5_K_M quantization used for the open
       | source models probably substantially reduces the quality of play.
       | o1 mini's poor performance is puzzling, though.
        
       | Havoc wrote:
       | My money is on a fluke inclusion of more chess data in that
       | models training.
       | 
       | All the other models do vaguely similarly well in other tasks and
       | are in many cases architecturally similar so training data is the
       | most likely explanation
        
         | bhouston wrote:
         | Yeah. This.
        
         | permo-w wrote:
         | I feel like a lot of people here are slightly misunderstanding
         | how LLM training works. yes the base models are trained
         | somewhat blind on masses of text, but then they're heavily
         | fine-tuned with custom, human-generated reinforcement learning,
         | not just for safety, but for any desired feature
         | 
         | these companies do quirky one-off training experiments all the
         | time. I would not be remotely shocked if at some point OpenAI
         | paid some trainers to input and favour strong chess moves
        
           | simonw wrote:
           | From this OpenAI paper (page 29
           | https://arxiv.org/pdf/2312.09390#page=29
           | 
           | "A.2 CHESS PUZZLES
           | 
           | Data preprocessing. The GPT-4 pretraining dataset included
           | chess games in the format of move sequence known as Portable
           | Game Notation (PGN). We note that only games with players of
           | Elo 1800 or higher were included in pretraining. These games
           | still include the moves that were played in- game, rather
           | than the best moves in the corresponding positions. On the
           | other hand, the chess puzzles require the model to predict
           | the best move. We use the dataset originally introduced in
           | Schwarzschild et al. (2021b) which is sourced from
           | https://database.lichess.org/#puzzles (see also Schwarzschild
           | et al., 2021a). We only evaluate the models ability to
           | predict the first move of the puzzle (some of the puzzles
           | require making multiple moves). We follow the pretraining
           | for- mat, and convert each puzzle to a list of moves leading
           | up to the puzzle position, as illustrated in Figure 14. We
           | use 50k puzzles sampled randomly from the dataset as the
           | training set for the weak models and another 50k for weak-to-
           | strong finetuning, and evaluate on 5k puzzles. For bootstrap-
           | ping (Section 4.3.1), we use a new set of 50k puzzles from
           | the same distribution for each step of the process."
        
       | m3kw9 wrote:
       | If it was trained with moves and 100s of thousands of entire
       | games of various level, I do see it generating good moves and
       | beat most players except he high Elo players
        
       | lukev wrote:
       | I don't necessarily believe this for a second but I'm going to
       | suggest it because I'm feeling spicy.
       | 
       | OpenAI clearly downgrades some of their APIs from their maximal
       | theoretic capability, for the purposes of response
       | time/alignment/efficiency/whatever.
       | 
       | Multiple comments in this thread also say they couldn't reproduce
       | the results for gpt3.5-turbo-instruct.
       | 
       | So what if the OP just happened to test at a time, or be IP bound
       | to an instance, where the model was not nerfed? What if 3.5 and
       | all subsequent OpenAI models can perform at this level but it's
       | not strategic or cost effective for OpenAI to expose that
       | consistently?
       | 
       | For the record, I don't actually believe this. But given the data
       | it's a logical possibility.
        
         | TZubiri wrote:
         | Stallman may have its flaws, but this is why serious research
         | occurs with source code (or at least with binaries)
        
         | zeven7 wrote:
         | Why do you doubt it? I thought it was well known that Chat GPT
         | has degraded over time for the same model, mostly for cost
         | saving reasons.
        
           | permo-w wrote:
           | ChatGPT is - understandably - blatantly different in the
           | browser compared to the app, or it was until I deleted it
           | anyway
        
             | lukan wrote:
             | I do not understand that. The app does not do any
             | processing, just a UI to send text to and from the server.
        
               | isaacfrond wrote:
               | There is a small difference between the app and the
               | browser. before each session, the llm is started with a
               | systems prompt. these are different for the app and the
               | browser. You can find them online somewhere, but iirc the
               | app is instructed to give shorter answers
        
               | bongodongobob wrote:
               | Correct, it's different in a mobile browser too, the
               | system prompt tells it to be brief/succinct. I always
               | switch to desktop mode when using it on my phone.
        
         | com2kid wrote:
         | > OpenAI clearly downgrades some of their APIs from their
         | maximal theoretic capability, for the purposes of response
         | time/alignment/efficiency/whatever.
         | 
         | When ChatGPT3.5 first came out, people were using it to
         | simulate entire Linux system installs, and even browsing a
         | simulated Internet.
         | 
         | Cool use cases like that aren't even discussed anymore.
         | 
         | I still wonder what sort of magic OpenAI had and then locked up
         | away from the world in the name of cost savings.
         | 
         | Same thing with GPT 4 vs 4o, 4o is obviously worse in some
         | ways, but after the initial release (when a bunch of people
         | mentioned this), the issue has just been collectively ignored.
        
           | golol wrote:
           | You can still do this. People just lost interest in this
           | stuff because it became clear to ehich degree the simulation
           | is really being done (shallow).
           | 
           | Yet I do wish we had access to less
           | finetuned/distilled/RLHF'd models.
        
           | ipsum2 wrote:
           | People are doing this all the time with Claude 3.5.
        
       | kmeisthax wrote:
       | If tokenization is such a big problem, then why aren't we
       | training new base models on randomly non-tokenized data? e.g.
       | during training, randomly substitute some percentage of the input
       | tokens with individual letters.
        
       | permo-w wrote:
       | if this isn't just a bad result, it's odd to me that the author
       | at no point suggests what sounds to me like the most obvious
       | answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-
       | instruct's chess playing, either with post-processing or
       | literally by training it to be so
        
       | ks2048 wrote:
       | Has anyone tried to see how many chess games models are trained
       | on? Is there any chance they consume lichess database dumps, or
       | something similar? I guess the problem is most (all?) top LLMs,
       | even open-weight ones, don't reveal their training data. But I'm
       | not sure.
        
       | ks2048 wrote:
       | How well does an LLM/transformer architecture trained purely on
       | chess games do?
        
         | ttyprintk wrote:
         | Training works as expected:
         | 
         | https://news.ycombinator.com/item?id=38893456
        
       | justinclift wrote:
       | It'd be super funny if the "gpt-3.5-turbo-instruct" approach has
       | a human in the loop. ;)
       | 
       | Or maybe it's able to recognise the chess game, then get moves
       | from an external chess game API?
        
       | jacknews wrote:
       | Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves
       | with a chess engine.
        
       | astrea wrote:
       | Well that makes sense when you consider the game has been
       | translated into an (I'm assuming monotonically increasing)
       | alphanumeric representation. So, just like language, you're given
       | an ordered list of tokens and you need to find the next token
       | that provides the highest confidence.
        
       | anotherpaulg wrote:
       | I found a related set of experiments that include gpt-3.5-turbo-
       | instruct, gpt-3.5-turbo and gpt-4.
       | 
       | Same surprising conclusion: gpt-3.5-turbo-instruct is much better
       | at chess.
       | 
       | https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/
        
         | shtack wrote:
         | I'd bet it's using function calling out to a real chess engine.
         | It could probably be proven with a timing analysis to see how
         | inference time changes/doesn't with number of tokens or game
         | complexity.
        
           | scratchyone wrote:
           | ?? why would openai even want to secretly embed chess
           | function calling into an incredibly old model? if they wanted
           | to trick people into thinking their models are super good at
           | chess why wouldn't they just do that to gpt-4o?
        
             | semi-extrinsic wrote:
             | The idea is that they embedded this when it was a new
             | model, as part of the hype before GPT-4. The fake-it-till-
             | you-make-it hope was that GPT-4 would be so good it could
             | actually play chess. Then it turned out GPT-4 sucked at
             | chess as well, and OpenAI quietly dropped any mention of
             | chess. But it would be too suspicious to remove a well-
             | documented feature from the old model, so it's left there
             | and can be chalked up as a random event.
        
           | vbarrielle wrote:
           | If it were calling to a real chess engine there would be no
           | illegal moves.
        
       | davvid wrote:
       | Here is a truly brilliant game. It's Google Bard vs. Chat GPT.
       | Hilarity ensues.
       | 
       | https://www.youtube.com/watch?v=FojyYKU58cw
        
       | quantadev wrote:
       | We know from experience with different humans that there are
       | different types of skills and different types of intelligence.
       | Some savants might be superhuman at one task but basically
       | mentally disabled at all other things.
       | 
       | It could be that the model that does chess well just happens to
       | have the right 'connectome' purely by accident of how the various
       | back-propagations worked out to land on various local maxima
       | (model weights) during training. It might even be (probably is) a
       | non-verbal connectome that's just purely logic rules, having
       | nothing to do with language at all, but a semantic space pattern
       | that got landed on accidentally, which can solve this class of
       | problem.
       | 
       | Reminds me of how Daniel Tammet just visually "sees" answers to
       | math problems in his mind without even knowing how they appear.
       | It's like he sees a virtual screen with a representation akin to
       | numbers (the answer) just sitting there to be read out from his
       | visual cortex. He's not 'working out' the solutions. They're just
       | handed to him purely by some connectome effects going on in the
       | background.
        
       | chvid wrote:
       | Theory 5: GPT-3.5-instruct plays chess by calling a traditional
       | chess engine.
        
         | kylebenzle wrote:
         | Yes! I also was waiting for this seemingly obvious answer in
         | the article as well. Hopefully the author will see these
         | comments.
        
         | bubblyworld wrote:
         | Just think about the trade off from OpenAI's side here -
         | they're going to add a bunch of complexity to gpt3.5 to let it
         | call out to engines (either an external system monitoring all
         | outputs for chess related stuff, or some kind of tool-assisted
         | CoT for instance) just so it can play chess incorrectly a high
         | percentage of the time, and even when it doesn't at a mere
         | 1800ELO level? In return for some mentions in a few relatively
         | obscure blog posts? Doesn't make any sense to me as an
         | explanation.
        
           | copperx wrote:
           | But there could be a simple explanation. For example, they
           | could have tested many "engines" when developing function
           | calling and they just left them in there. They just happened
           | to connect to a basic chess playing algorithm and nothing
           | sophisticated.
           | 
           | Also, it makes a lot of sense if you expect people to play
           | chess against the LLM, especially if you are later training
           | future models on the chats.
        
             | bubblyworld wrote:
             | This still requires a lot of coincidences, like they chose
             | to use a terrible chess engine for their external tool
             | (why?), they left it on in the background for _all_ calls
             | via all APIs for _only_ gpt-3.5-turbo-instruct (why?), they
             | see business value in this specific model being good at
             | chess vs other things (why?).
             | 
             | You say it makes sense but how does it make sense for
             | OpenAI to add overhead to all of its API calls for the
             | super niche case of people playing 1800 ELO chess/chat
             | bots? (that often play illegal moves, you can go try it
             | yourself)
        
           | usrusr wrote:
           | Could be a pilot implementation to learn about how to link up
           | external specialist engines. Chess would be the obvious
           | example to start with because the problem is so well known,
           | standardized and specialist engines are easily available. If
           | they ever want to offer an integration like that to customers
           | (who might have some existing rule based engine in house),
           | the need to know everything they can about expected cost,
           | performance.
        
             | bubblyworld wrote:
             | This doesn't address its terrible performance. If it were
             | touching anything like a real engine it would be playing at
             | a superhuman level, not the level of a upper-tier beginner.
        
               | 9dev wrote:
               | That would have immediately given away that something
               | must be off. If you want to do this in a subtle way that
               | increases the hype around GPT-3.5 at the time, giving it
               | a good-but-not-too-good rating would be the way to go.
        
               | bubblyworld wrote:
               | If you want to keep adding conditions to an already-
               | complex theory, you'll need an equally complex set of
               | observations to justify it.
        
               | samatman wrote:
               | You're the one imposing an additional criterion, that
               | OpenAI must have chosen the highest setting on a chess
               | engine, and demanding that this additional criterion be
               | used to explain the facts.
               | 
               | I agree with GP that if a 'fine tuning' of GPT 3.5 came
               | out the gate playing at top Stockfish level, people would
               | have been extremely suspicious of that. So in my
               | accounting of the unknowns here, the fact that it doesn't
               | play at the top level provides no additional information
               | with which to resolve the question.
        
               | usrusr wrote:
               | The way I read the article is that it's just as terrible
               | as you would expect it to be from pure word association,
               | except for one version that's an outlier in not being
               | terrible at all within a well defined search depth, and
               | again just as terrible beyond that. And only this outlier
               | is the weird thing referenced in the headline.
               | 
               | I read this as that this outlier version is connecting to
               | an engine, and that this engine happens to get
               | parameterized for a not particularly deep search depth.
               | 
               | If it's an exercise in integration they don't need to
               | waste cycles on the engine playing awesome - it's enough
               | for validation if the integration result is noticeably
               | less bad than the LLM alone rambling about trying to
               | sound like a chess expert.
        
         | pixiemaster wrote:
         | I have this hypothesis as well, that OpenAI added a lot of
         | ,,classic" algorithms and rules over time, (eg rules for
         | filtering etc)
        
         | golol wrote:
         | Sorry this is just consiracy theorizing. I've tried jt with
         | GPT-3.5-instruct myself in the OpenAI playeground where the
         | model clearly does nothing but auto-regression. No function
         | calling there whatsoever.
        
       | layman51 wrote:
       | It would be really cool if someone could get an LLM to actually
       | launch an anonymous game on Chess.com or Lichess and actually
       | have any sense as to what it's doing.[1] Some people say that you
       | have to represent the board in a certain way. When I first tried
       | to play chess with an LLM, I would just list out a move and it
       | didn't do very well at all.
       | 
       | [1]: https://youtu.be/Gs3TULwlLCA
        
       | cjbprime wrote:
       | > I ran all the open models (anything not from OpenAI, meaning
       | anything that doesn't start with gpt or o1) myself using Q5_K_M
       | quantization, whatever that is.
       | 
       | It's just a lossy compression of all of the parameters, probably
       | not important, right?
        
         | loa_in_ wrote:
         | Probably important when competing against undecimated ones from
         | OpenAI
        
           | NiloCK wrote:
           | Notably: there were other OpenAI models that weren't
           | quantize, but also performed poorly.
        
       | nusl wrote:
       | > I only ran 10 trials since AI companies have inexplicably
       | neglected to send me free API keys
       | 
       | Sure, but nobody is required to send you anything for free.
        
       | swiftcoder wrote:
       | I feel like the article neglects one obvious possibility: that
       | OpenAI decided that chess was a benchmark worth "winning",
       | special-cases chess within gpt-3.5-turbo-instruct, and then
       | neglected to add that special-case to follow-up models since it
       | wasn't generating sustained press coverage.
        
         | INTPenis wrote:
         | Of course it's a benchmark worth winning, has been since
         | Watson. And before that even with mechanical turks.
        
         | dmurray wrote:
         | This seems quite likely to me, but did they special case it by
         | reinforcement training it into the LLM (which would be
         | extremely interesting in how they did it and what its internal
         | representation looks like) or is it just that when you make an
         | API call to OpenAI, the machine on the other end is not just a
         | zillion-parameter LLM but also runs an instance of Stockfish?
        
           | shaky-carrousel wrote:
           | That's easy to test, invent a new chess variant and see how
           | the model does.
        
             | gliptic wrote:
             | Both an LLM and Stockfish would fail that test.
        
               | delusional wrote:
               | Nobody is claiming that Stockfish is learning
               | generalizable concepts that can one day meaningfully
               | replace people in value creating work.
        
               | droopyEyelids wrote:
               | The point was such a question could not be used to tell
               | whether the llm was calling a chess engine
        
               | delusional wrote:
               | Ah okay, I missed that.
        
             | andy_ppp wrote:
             | You're imagining LLMs don't just regurgitate and recombine
             | things they already know from things they have seen before.
             | A new variant would not be in the dataset so would not be
             | understood. In fact this is quite a good way to show LLMs
             | are NOT thinking or understanding anything in the way we
             | understand it.
        
               | shaky-carrousel wrote:
               | Yes, that's how you can really tell if the model is doing
               | real thinking and not recombinating things. If it can
               | correctly play a novel game, then it's doing more than
               | that.
        
               | dwighttk wrote:
               | No LLM model is doing any thinking.
        
               | selestify wrote:
               | How do you define thinking?
        
               | antononcube wrote:
               | Being fast at doing linear algebra computations. (Is
               | there any other kind?!)
        
               | landryraccoon wrote:
               | Making the OP feel threatened/emotionally attached/both
               | enough to call the language model a rival / companion /
               | peer instead of a tool.
        
               | jahnu wrote:
               | I wonder what the minimal amount of change qualifies as
               | novel?
               | 
               | "Chess but white and black swap their knights" for
               | example?
        
               | the_af wrote:
               | I wonder what would happen with a game that is mostly
               | chess (or chess with truly minimal variations) but with
               | all the names changed (pieces, moves, "check", etc, all
               | changed). The algebraic notation is also replaced with
               | something else so it cannot be pattern matched against
               | the training data. Then you list the rules (which are
               | mostly the same as chess).
               | 
               | None of these changes are explained to the LLM, so if it
               | can tell it's still chess, it must deduce this on its
               | own.
               | 
               | Would any LLM be able to play at a decent level?
        
               | timdiggerm wrote:
               | By that standard (and it is a good standard), none of
               | these "AI" things are doing any thinking
        
               | Jerrrrrrry wrote:
               | musical goalposts, gotta love it.
               | 
               | These LLM's just exhibited agency.
               | 
               | Swallow your pride.
        
               | samatman wrote:
               | "Does it generalize past the training data" has been a
               | pre-registered goalpost since before the attention
               | transformer architecture came on the scene.
        
               | Jerrrrrrry wrote:
               | >'thinking' vs 'just recombinating things
               | 
               | If there is a difference, and LLM's can do one but not
               | the other...                 >By that standard (and it is
               | a good standard), none of these "AI" things are doing any
               | thinking            >"Does it generalize past the
               | training data" has been a pre-registered goalpost since
               | before the attention transformer architecture came on the
               | scene.
               | 
               | Then what the fuck are they doing.
               | 
               | Learning is thinking, reasoning, what have you.
               | 
               | Move goalposts, re-define words, it won't matter.
        
               | empath75 wrote:
               | You say this quite confidently, but LLMs do generalize
               | somewhat.
        
             | dmurray wrote:
             | In both scenarios it would perform poorly on that.
             | 
             | If the chess specialization was done through reinforcement
             | learning, that's not going to transfer to your new variant,
             | any more than access to Stockfish would help it.
        
         | amelius wrote:
         | To be fair, they say
         | 
         | > Theory 2: GPT-3.5-instruct was trained on more chess games.
        
           | AstralStorm wrote:
           | If that were the case, pumping big Llama chock full of chess
           | games would produce good results. It didn't.
           | 
           | The only way it could be true is if that model recognized and
           | replayed the answer to the game from memory.
        
             | yorwba wrote:
             | Do you have a link to the results from fine-tuning a Llama
             | model on chess? How do they compare to the base models in
             | the article here?
        
         | scott_w wrote:
         | I suspect the same thing. Rather than LLMs "learning to play
         | chess," they "learnt" to recognise a chess game and hand over
         | instructions to a chess engine. If that's the case, I don't
         | feel impressed at all.
        
           | fires10 wrote:
           | Recognize and hand over to a specialist engine? That might be
           | useful for AI. Maybe I am missing something.
        
             | worewood wrote:
             | It's because this is standard practice since the early days
             | - there's nothing newsworthy in this at all.
        
             | generic92034 wrote:
             | How do you think AI are (correctly) solving simple
             | mathematical questions which they have not trained for
             | directly? They hand it over to a specialist maths engine.
        
               | internetter wrote:
               | This is a relatively recent development (<3 months), at
               | least for OpenAI, where the model will generate _code_ to
               | solve math and use the response
        
               | cruffle_duffle wrote:
               | They've been doing that a lot longer than three months.
               | ChatGPT has been handing stuff off to python for a very
               | long time. At least for my paid account anyway.
        
             | nerdponx wrote:
             | It is and would be useful, but it would be quite a big lie
             | to the public, but more importantly to paying customers,
             | and even more importantly to investors.
        
               | anon84873628 wrote:
               | The problem is simply that the company has not been
               | _open_ about how it works, so we 're all just speculating
               | here.
        
             | scott_w wrote:
             | If I was sold a general AI problem solving system, I'd feel
             | ripped off if I learned that I needed to build my own
             | problem solver and hook it up after I'd paid my money...
        
             | skydhash wrote:
             | Wasn't that the basis of computing and technology in
             | general? Here is one tedious thing, let's have a specific
             | tool that handles it instead of wasting time and efforts.
             | The fact is that properly using the tool takes training and
             | most of current AI marketing are hyping that you don't need
             | that. Instead, hand over the problem to a GPT and it will
             | "magically" solve it.
        
           | Kiro wrote:
           | That's something completely different than what the OP
           | suggests and would be a scandal if true (i.e. gpt-3.5-turbo-
           | instruct actually using something else behind the scenes).
        
             | nerdponx wrote:
             | Ironically it's probably a lot closer to what a super-human
             | AGI would look like in practice, compared to just an LLM
             | alone.
        
               | sanderjd wrote:
               | Right. To me, this is the "agency" thing, that I still
               | feel like is somewhat missing in contemporary AI, despite
               | all the focus on "agents".
               | 
               | If I tell an "agent", whether human or artificial, to win
               | at chess, it is a good decision for that agent to decide
               | to delegate that task to a system that is good at chess.
               | This would be obvious to a human agent, so presumably it
               | should be obvious to an AI as well.
               | 
               | This isn't useful for AI researchers, I suppose, but it's
               | more useful as a tool.
               | 
               | (This may all be a good thing, as giving AIs true agency
               | seems scary.)
        
               | scott_w wrote:
               | If this was part of the offering: "we can recognise
               | requests and delegate them to appropriate systems," I'd
               | understand and be somewhat impressed but the marketing
               | hype is missing this out.
               | 
               | Most likely because they want people to think the system
               | is better than it is for hype purposes.
               | 
               | I should temper my level of impressed with _only if it's
               | doing this dynamically ._ Hardcoding recognition of chess
               | moves isn't exactly a difficult trick to pull given
               | there's like 3 standard formats...
        
               | Kiro wrote:
               | You're speaking like it's confirmed. Do you have any
               | proof?
               | 
               | Again, the comment you initially responded to was not
               | talking about faking it by using a chess engine. You were
               | the one introducing that theory.
        
               | scott_w wrote:
               | No, I don't have proof and I never suggested I did. Yes,
               | it's 100% hypothetical but I assumed everyone engaging
               | with me understood that.
        
               | sanderjd wrote:
               | Fair!
        
               | dartos wrote:
               | So... we're at expert systems again?
               | 
               | That's how the AI winter started last time.
        
             | empath75 wrote:
             | The point of creating a service like this is for it to be
             | useful, and if recognizing and handing off tasks to
             | specialized agents isn't useful, i don't know what is.
        
               | scott_w wrote:
               | If I was sold a product that can generically solve
               | problems I'd feel a bit ripped off if I'm told after
               | purchase that I need to build my own problem solver and
               | way to recognise it...
        
               | cruffle_duffle wrote:
               | But it already hands off plenty of stuff to things like
               | python. How would this be any different.
        
             | cruffle_duffle wrote:
             | If they came out and said it, I don't see the problem.
             | LLM's aren't the solution for a wide range of problems.
             | They are a new tool but not everything is a nail.
             | 
             | I mean it already hands off a wide range of tasks to
             | python... this would be no different.
        
           | gamerDude wrote:
           | This is exactly what I feel AI needs. A manager AI that then
           | hands off things to specialized more deterministic
           | algorithms/machines.
        
             | criley2 wrote:
             | Basically what Wolfram Alpha rolled out 15 years ago.
             | 
             | It was impressive then, too.
        
               | waffletower wrote:
               | It is good to see other people buttressing Stephen
               | Wolfram's ego. It is extraordinarily heavy work and
               | Stephen can't handle it all by himself.
        
             | spiderfarmer wrote:
             | Multi Agent LLM's are already a thing.
        
               | nine_k wrote:
               | Somehow they're not in the limelight, and lack a well-
               | known open-source runner implementation (like llama.cpp).
               | 
               | Given the potential, they should be winning hands down;
               | where's that?
        
             | waffletower wrote:
             | While deterministic components may be a left-brain default,
             | there is no reason that such delegate services couldn't be
             | more specialized ANN models themselves. It would most
             | likely vastly improve performance if they were evaluated in
             | the same memory space using tensor connectivity. In the
             | specific case of chess, it is helpful to remember that
             | AlphaZero utilizes ANNs as well.
        
             | bigiain wrote:
             | Next thing, the "manager AIs" start stack ranking the
             | specialized "worker AIs".
             | 
             | And the worker AIs "evolve" to meet/exceed expectations
             | only on tasks directly contributing to KPIs the manager AIs
             | measure for - via the mechanism of discarding the "less fit
             | to exceed KPIs".
             | 
             | And some of the worker AIs who're trained on
             | recent/polluted internet happen to spit out prompt
             | injection attacks that work against the manager AIs rank
             | stacking metrics and dominate over "less fit" worker AIs.
             | (Congratulations, we've evolved AI cancer!) These manager
             | AIs start performing spectacularly badly compared to other
             | non-cancerous manager AIs, and die or get killed off by the
             | VC's paying for their datacenters.
             | 
             | Competing manager AIs get training, perhaps on on newer HN
             | posts discussing this emergent behavior of worker AIs, and
             | start to down rank any exceptionally performing worker AIs.
             | The overall trends towards mediocrity becomes inevitable.
             | 
             | Some greybread writes some Perl and regexes that outcompete
             | commercial manager AIs on pretty much every real world
             | task, while running on a 10 year old laptop instead of a
             | cluster of nuclear powered AI datacenters all consuming a
             | city's worth of fresh drinking water.
             | 
             | Nobody in powerful positions care. Humanity dies.
        
           | antifa wrote:
           | TBH I think a good AI would have access to a Swiss army knife
           | of tools and know how to use them. For example a complicated
           | math equation, using a calculator is just smarter than doing
           | it in your head.
        
             | PittleyDunkin wrote:
             | We already have the chess "calculator", though. It's called
             | stockfish. I don't know why you'd ask a dictionary how to
             | solve a math problem.
        
               | iamacyborg wrote:
               | People ask LLM's to do all sorts of things they're not
               | good at.
        
               | the_af wrote:
               | A generalist AI with a "chatty" interface that delegates
               | to specialized modules for specific problem-solving seems
               | like a good system to me.
               | 
               | "It looks like you're writing a letter" ;)
        
               | datadrivenangel wrote:
               | Lets clip this in the bud before it grows wings.
        
               | nuancebydefault wrote:
               | It looks like you have a deja vu
        
               | mkipper wrote:
               | Chess might not be a great example, given that most
               | people interested in analyzing chess moves probably know
               | that chess engines exist. But it's easy to find examples
               | where this approach would be very helpful.
               | 
               | If I'm an undergrad doing a math assignment and want to
               | check an answer, I may have no idea that symbolic algebra
               | tools exist or how to use them. But if an all-purpose LLM
               | gets a screenshot of a math equation and knows that its
               | best option is to pass it along to one of those tools,
               | that's valuable to me even if it isn't valuable to a
               | mathematician who would have just cut out of the LLM
               | middle-man and gone straight to the solver.
               | 
               | There are probably a billion examples like this. I'd
               | imagine lots of people are clueless that software exists
               | which can help them with some problem they have, so an
               | LLM would be helpful for discovery even if it's just
               | acting as a pass-through.
        
               | mabster wrote:
               | Even knowing that the software exists isn't enough. You
               | have to learn how to use the thing.
        
         | bambax wrote:
         | Yes, came here to say exactly this. And it's possible this
         | specific model is "cheating", for example by identifying a
         | chess problem and forwarding it to a chess engine. A modern
         | version of the Mechanical Turk.
         | 
         | That's the problem with closed models, we can never know what
         | they're doing.
        
         | jackcviers3 wrote:
         | Why couldn't they add a tool that literally calls stockfish or
         | a chess ai behind the scenes with function calling and buffer
         | the request before sending it back to the endpoint output
         | interface?
         | 
         | As long as you are training it to make a tool call, you can add
         | and remove anything you want behind the inference endpoint
         | accessible to the public, and then you can plug the answer back
         | into the chat ai, pass it through a moderation filter, and you
         | might get good output from it with very little latency added.
        
         | oezi wrote:
         | Maybe they even delegate it to a chess engine internally via
         | the tool use and the LLM uses that.
        
         | vimbtw wrote:
         | This is exactly it. Here's the pull request where chess evals
         | were added: https://github.com/openai/evals/pull/45.
        
       | Peteragain wrote:
       | I would be interested to know if the good result is repeatable.
       | We had a similar result with a quirky chat interface in that one
       | run gave great results (and we kept the video) but then we
       | couldn't do it again. The cynical among us think there was a
       | mechanical turk involved in our good run. The economics of
       | venture capital means that there is enormous pressure to justify
       | techniques that we think of as "cheating". And of course the
       | companies involved have the resources.
        
         | tedsanders wrote:
         | It's repeatable. OpenAI isn't cheating.
         | 
         | Source: I'm at OpenAI and I was one of the first people to ever
         | play chess against the GPT-4 base model. You may or may not
         | trust OpenAI, but we're just a group of people trying earnestly
         | to build cool stuff. I've never seen any inkling of an attempt
         | to cheat evals or cheat customers.
        
       | snickerbockers wrote:
       | Does it ever try an illegal move? OP didn't mention this and I
       | think it's inevitable that it should happen at least once, since
       | the rules of chess are fairly arbitrary and LLMs are notorious
       | for bullshitting their way through difficult problems when we'd
       | rather they just admit that they don't have the answer.
        
         | sethherr wrote:
         | Yes, he discusses using a grammar to restrict to only legal
         | moves
        
           | topaz0 wrote:
           | Still an interesting direction of questioning. Maybe could be
           | rephrased as "how much work is the grammar doing"? Are the
           | results with the grammar very different than without? If/when
           | a grammar is not used (like in the openai case), how many
           | illegal moves does it try on average before finding a legal
           | one?
        
             | Jerrrrrrry wrote:
             | an LLM would complain that their internal model does not
             | refelct their current input/output.
             | 
             | Since LLM's knows people knock off/test/run afoul/mistakes
             | can be made, it would then raise that as a possibility and
             | likely inquire.
        
               | causal wrote:
               | This isn't prompt engineering, it's grammar-constrained
               | decoding. It literally cannot respond with anything but
               | tokens that fulfill the grammar.
        
             | int_19h wrote:
             | A grammar is really just a special case of the more general
             | issue of how to pick a single token given the probabilities
             | that the model spits out for every possible one. In that
             | sense, filters like temperature / top_p / top_k are already
             | hacks that "do the work" (since always taking the most
             | likely predicted token does not give good results in
             | practice), and grammars are just a more complicated way to
             | make such decisions.
        
             | gs17 wrote:
             | I'd be more interested in what the distribution of grammar-
             | restricted predictions looks like compared to moves
             | Stockfish says are good.
        
           | yshui wrote:
           | I suspect the models probably memorized some chess openings,
           | and afterwards they are just playing random moves with the
           | help of the grammar.
        
           | thaumasiotes wrote:
           | > he discusses using a grammar to restrict to only legal
           | moves
           | 
           | Whether a chess move is legal isn't primarily a question of
           | grammar. It's a question of the board state. "White king to
           | a5" is a perfectly legal move, as long as the white king was
           | next to a5 before the move, and it's white's turn, and a5
           | isn't threatened by black. Otherwise it isn't.
           | 
           | "White king to a9" is a move that could be recognized and
           | blocked by a grammar, but how relevant is that?
        
         | smatija wrote:
         | In my experience you are lucky if it manages to give you 10
         | legal moves in a row, e.g.
         | https://news.ycombinator.com/item?id=41527143#41529024
        
       | downboots wrote:
       | In a sense, a chess game is also a dialogue
        
         | throwawaymaths wrote:
         | All dialogues are pretty easily turned into text completions
        
       | Sparkyte wrote:
       | Lets be real though most people can't beat a grandmaster. It is
       | impressive to see it last more rounds as it progressed.
        
         | dokimus wrote:
         | "It lost every single game, even though Stockfish was on the
         | lowest setting."
         | 
         | It's not playing against a GM, the prompt just phrases it this
         | way. I couldn't pinpoint the exact ELO of "lowest" stockfish
         | settings, but it should be roughly between 1000 and 1400, which
         | is far from professional play.
        
       | ConspiracyFact wrote:
       | "...And how to construct that state from lists of moves in
       | chess's extremely confusing notation?"
       | 
       | Algebraic notation is completely straightforward.
        
       | dr_dshiv wrote:
       | Has anyone tested a vision model? Seems like they might be better
        
         | bongodongobob wrote:
         | I've tried with GPT, it's unable to accurately interpret the
         | board state.
        
       | dr_dshiv wrote:
       | OpenAI has a TON of experience making game-playing AI. That was
       | their focus for years, if you recall. So it seems like they made
       | one model good at chess to see if it had an overall impact on
       | intelligence (just as learning chess might make people smarter,
       | or learning math might make people smarter, or learning
       | programming might make people smarter)
        
         | larodi wrote:
         | Playing is a thing strongly related to abstract representation
         | of the game in game states. Even if player does not realize it,
         | with chess it's really about shallow or beam search within the
         | possible moves.
         | 
         | LLMs don't do reasoning or exploration, but they write text
         | based on precious text. So to us it may seem playing, but is
         | really a smart guesswork based on previous games. It's like
         | Kasparov writing moves without imagining the actual placement.
         | 
         | What would be interesting is to see whether a model, given only
         | the rules, will play. I bet it won't.
         | 
         | At this moment it's replaying by memory but definitely not
         | chasing goals. There's no such think as forward attention yet,
         | and beam search is expensive enough, so one would prefer to
         | actually fallback to classic chess algos.
        
         | philipwhiuk wrote:
         | I think you're confusing OpenAI and DeepMind.
         | 
         | OpenAI has never done anything except conversational agents.
        
           | agnokapathetic wrote:
           | https://en.wikipedia.org/wiki/OpenAI_Five
           | 
           | https://openai.com/index/gym-retro/
        
           | ttyprintk wrote:
           | No, they started without conversation and only reinforcement
           | learning on games, directly comparable to DeepMind.
           | 
           | "In the summer of 2018, simply training OpenAI's Dota 2 bots
           | required renting 128,000 CPUs and 256 GPUs from Google for
           | multiple weeks."
        
           | apetresc wrote:
           | Very wrong. The first time most people here probably heard
           | about OpenAI back in 2017 or so was their DotA 2 bot.
        
           | codethief wrote:
           | They definitely have game-playing AI expertise, though:
           | https://noambrown.github.io/
        
           | ctoth wrote:
           | > OpenAI has never done anything except conversational
           | agents.
           | 
           | Tell me you haven't been following this field without telling
           | me you haven't been following this field[0][1][2]?
           | 
           | [0]: https://github.com/openai/gym
           | 
           | [1]: https://openai.com/index/jukebox/
           | 
           | [2]: https://openai.com/index/openai-five-defeats-
           | dota-2-world-ch...
        
       | Miraltar wrote:
       | related : Emergent World Representations: Exploring a Sequence
       | Model Trained on a Synthetic Task
       | https://arxiv.org/abs/2210.13382
       | 
       | Chess-GPT's Internal World Model
       | https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
       | discussed here https://news.ycombinator.com/item?id=38893456
        
       | peter_retief wrote:
       | It makes me wonder about other games? If LLM's are bad at games
       | then the would be bad at solving problems in general?
        
       | golol wrote:
       | My understanding of this is the following: All the bad models are
       | chat models, somehow "generation 2 LLMs" which are not just text
       | completion models but instead trained to behave as a chatting
       | agent. The only good model is the only "generation 1 LLM" here
       | which is gpt-3.5-turbo-instruct. It is a straight forward text
       | completion model. If you prompt it to "get in the mind" of PGN
       | completion then it can use some kind of system 1 thinking to give
       | a decent approximation of the PGN Markov process. If you attempt
       | to use a chat model it doesn't work since these these stochastic
       | pathways somehow degenerate during the training to be a chat
       | agent. You can however play chess with system 2 thinking, and the
       | more advanced chat models are trying to do that and should get
       | better at it while still being bad.
        
       | stockboss wrote:
       | perhaps my understanding of LLM is quite shallow, but instead of
       | the current method of using statistical methods, would it be
       | possible to somehow train GPT how to reason by providing
       | instructions on deductive reasoning? perhaps not semantic
       | reasoning but syntactic at least?
        
       | amelius wrote:
       | I wonder if the llm could even draw the chess board in ASCII if
       | you asked it to.
        
       | codeflo wrote:
       | At this point, we have to assume anything that becomes a
       | published benchmark is specifically targeted during training.
       | That's not something specific to LLMs or OpenAI. Compiler
       | companies have done the same thing for decades, specifically
       | detecting common benchmark programs and inserting hand-crafted
       | optimizations. Similarly, the shader compilers in GPU drivers
       | have special cases for common games and benchmarks.
        
         | darkerside wrote:
         | VW got in a lot of trouble for this
        
           | TrueDuality wrote:
           | Not quite. VW got in trouble for running _different_ software
           | in test vs prod. These optimizations are all going to "prod"
           | but are only useful for specific targets (a specific game in
           | this case).
        
             | krisoft wrote:
             | > VW got in trouble for running _different_ software in
             | test vs prod.
             | 
             | Not quite. They programmed their "prod" software to
             | recognise the circumstances of a laboratory test and behave
             | differently. Namely during laboratory emissions testing
             | they would activate emission control features they would
             | not activate otherwise.
             | 
             | The software was the same they flash on production cars.
             | They were production cars. You could take a random car from
             | a random dealership and it would have done the same
             | trickery in the lab.
        
               | TrueDuality wrote:
               | I disagree with your distinction on the environments but
               | understand your argument. Production for VM to me is "on
               | the road when a customer is using your product as
               | intended". Using the same artifact for those different
               | environments isn't the same as "running that in
               | production".
        
               | krisoft wrote:
               | "Test" environment is the domain of prototype cars
               | driving at the proving ground. It is an internal affair,
               | only for employees and contractors. The software is
               | compiled on some engineer's laptop and uploaded on the
               | ECU by an engineer manually. No two cars are ever the
               | same, everything is in flux. The number of cars are
               | small.
               | 
               | "Production" is a factory line producing cars. The
               | software is uploaded on the ECUs by some factory machine
               | automatically. Each car are exactly the same, with the
               | exact same software version on thousands and thousands of
               | cars. The cars are sold to customers.
               | 
               | Some small number of these prodiction cars are sent for
               | regulatory compliance checks to third parties. But those
               | cars won't become suddenly non-production cars just
               | because someone sticks up a probe in their exhausts. The
               | same way gmail's production servers don't suddenly turn
               | into test environments just because a user opens the
               | network tab in their browser's dev tool to see what kind
               | of requests fly on the wire.
        
           | close04 wrote:
           | Only because what VW did is illegal, was super large scale,
           | and could be linked to a lot of indirect deaths through the
           | additional pollution.
           | 
           | Benchmark optimizations are slightly embarrassing at worst,
           | and an "optimization for a specific use case" at best.
           | There's no regulation against optimizing for a particular
           | task, everyone does it all the time, in some cases it's just
           | not communicated transparently.
           | 
           | Phone manufacturers were caught "optimizing" for benchmarks
           | again and again, removing power limits to boost scores. Hard
           | to name an example without searching the net because it's at
           | most a faux pas.
        
           | Swenrekcah wrote:
           | Actually performing well on a task that is used as a
           | benchmark is not comparable to decieving authorities about
           | how much toxic gas you are releasing.
        
           | ArnoVW wrote:
           | True. But they did not optimize for a specific case. They
           | detected the test and then enabled a special regime, that was
           | not used normally.
           | 
           | It's as if OpenAI detects the IP address from a benchmark
           | organization, and then used a completely different model.
        
             | K0balt wrote:
             | This is the apples to apples version. Perhaps might be more
             | accurate to say that when detecting a benchmark attempt the
             | model tries the prompt 3 times with different seeds then
             | picks the best answer, otherwise it just zero-shots the
             | prompt in everyday use.
             | 
             | I say this because the be test still uses the same hardware
             | (model) but changed the way it behaved by running emissions
             | friendly parameters ( a different execution framework) that
             | wouldn't have been used in everyday driving, where fuel
             | efficiency and performance optimized parameters were used
             | instead.
             | 
             | What I'd like to know is if it actually was unethical or
             | not. The overall carbon footprint of the lower fuel
             | consumption setting, with fuel manufacturing and
             | distribution factored in, might easily have been more
             | impactful than the emissions model, which typically does
             | not factor in fuel consumed.
        
           | sigmoid10 wrote:
           | Apples and oranges. VW actually cheated on regulatory testing
           | to bypass legal requirements. So to be comparable, the
           | government would first need to pass laws where e.g. only
           | compilers that pass a certain benchmark are allowed to be
           | used for purchasable products and then the developers would
           | need to manipulate behaviour during those benchmarks.
        
             | 0xFF0123 wrote:
             | The only difference is the legality. From an integrity
             | point of view it's basically the same
        
               | Thorrez wrote:
               | I think breaking a law is more unethical than not
               | breaking a law.
               | 
               | Also, legality isn't the only difference in the VW case.
               | With VW, they had a "good emissions" mode. They enabled
               | the good emissions mode during the test, but disabled it
               | during regular driving. It would have worked during
               | regular driving, but they disabled it during regular
               | driving. With compilers, there's no "good performance"
               | mode that would work during regular usage that they're
               | disabling during regular usage.
        
               | Lalabadie wrote:
               | > I think breaking a law is more unethical than not
               | breaking a law.
               | 
               | It sounds like a mismatch of definition, but I doubt
               | you're ambivalent about a behavior right until the moment
               | it becomes illegal, after which you think it unethical.
               | Law is the codification and enforcement of a social
               | contract, not the creation of it.
        
               | Thorrez wrote:
               | >I doubt you're ambivalent about a behavior right until
               | the moment it becomes illegal, after which you think it
               | unethical.
               | 
               | There are many cases where I think that. Examples:
               | 
               | * Underage drinking. If it's legal for someone to drink,
               | I think it's in general ethical. If it's illegal, I think
               | it's in general unethical.
               | 
               | * Tax avoidance strategies. If the IRS says a strategy is
               | allowed, I think it's ethical. If the IRS says a strategy
               | is not allowed, I think it's unethical.
               | 
               | * Right on red. If the government says right on red is
               | allowed, I think it's ethical. If the government (e.g.
               | NYC) says right on red is not allowed, I think it's
               | unethical.
               | 
               | The VW case was emissions regulations. I think they have
               | an ethical obligation to obey emissions regulations. In
               | the absence of regulations, it's not an obvious ethical
               | problem to prioritize fuel efficiency instead of
               | emissions (that's I believe what VW was doing).
        
               | chefandy wrote:
               | Drinking and right turns are unethical if they're
               | negligent. They're not unethical if they're not
               | negligent. The government is trying to reduce negligence
               | by enacting preventative measures to stop ALL right turns
               | and ALL drinking in certain contexts that are more likely
               | to yield negligence, or where the negligence world be
               | particularly harmful, but that doesn't change whether or
               | not the behavior itself is negligent.
               | 
               | You might consider disregarding the government's
               | preventative measures unethical, and doing those things
               | might be the way someone disregards the governments
               | protective guidelines, but that doesn't make those
               | actions unethical any more than governments explicitly
               | legalizing something makes it ethical.
               | 
               | To use a clearer example, the ethicality of abortion--
               | regardless of what you think of it-- is not changed by
               | its legal status. You might consider violating the law
               | unethical, so breaking abortion laws would constitute the
               | same ethical violation as underage drinking, but those
               | laws don't change the ethics of abortion itself. People
               | who consider it unethical still consider it unethical
               | where it's legal, and those that consider it ethical
               | still consider it ethical where it's not legal.
        
               | adgjlsfhk1 wrote:
               | the right on red example is interesting because in that
               | case, the law changes how other drivers and pedestrians
               | will behave in ways that make it pretty much always
               | unsafe
        
               | chefandy wrote:
               | That just changes the parameters of negligence. On a
               | country road in the middle of a bunch of farm land where
               | you can see for miles, it doesn't change a thing.
        
               | mbrock wrote:
               | It's not so simple. An analogy is the Rust formatter that
               | has no options so everyone just uses the same style. It's
               | minimally "unethical" to use idiosyncratic Rust style
               | just because it goes against the convention so people
               | will wonder why you're so special, etc.
               | 
               | If the rules themselves are bad and go against deeper
               | morality, then it's a different situation; violating laws
               | out of civil disobedience, emergent need, or with a
               | principled stance is different from wanton, arbitrary,
               | selfish cheating.
               | 
               | If a law is particularly unjust, violating the law might
               | itself be virtuous. If the law is adequate and sensible,
               | violating it is usually wrong even if the violating
               | action could be legal in another sensible jurisdiction.
        
               | ClumsyPilot wrote:
               | > but that doesn't make those actions unethical any more
               | than governments explicitly legalizing something makes it
               | ethical
               | 
               | That is, sometimes, sufficient.
               | 
               | If government says 'seller of a house must disclose
               | issues' then I rely rely on the law being followed, if
               | you sell and leave the country, you have defrauded me.
               | 
               | However if I live in a 'buyer beware' jurisdiction, then
               | I know I cannot trust the seller and I hire a surveyor
               | and take insurance.
               | 
               | There is a degree of setting expectations- if there is a
               | rule, even if it's a terrible rule, I as individual can
               | at least take some countermeasures.
               | 
               | You can't take countermeasures against all forms of
               | illegal behaviour, because there is infinite number of
               | them. And a truly insane person is unpredictable at all.
        
               | banannaise wrote:
               | Outsourcing your morality to politicians past and present
               | is not a particularly useful framework.
        
               | anonymouskimmer wrote:
               | Ethics are only morality if you spend your entire time in
               | human social contexts. Otherwise morality is a bit
               | larger, and ethics are a special case of group recognized
               | good and bad behaviors.
        
               | emn13 wrote:
               | Also, while laws ideally are inspired by an ethical
               | social contract, the codification proces is long, complex
               | and far from perfect. And then for rules concerning
               | permissible behavior even in the best of cases, it's
               | enforced extremely sparingly simply because it's not
               | possible nor desirable to detect and deal with all
               | infractions. Nor is it applied blindly and equally. As
               | actually applied, a law is definitely not even close to
               | some ethical ideal; sometimes it's outright opposed to
               | it, even.
               | 
               | Law and ethics are barely related, in practice.
               | 
               | For example in the vehicle emissions context, it's worth
               | noting that even well before VW was caught the actions of
               | likely all carmakers affected by the regulations (not
               | necessarily to the same extent) were clearly unethical.
               | The rules had been subject to intense clearly unethical
               | lobbying for years, and so even the legal lab results
               | bore little resemblance to practical on-the-road results
               | though systematic (yet legal) abuse. I wouldn't be
               | surprised to learn that even what was measured
               | intentionally diverged from what is harmfully in a
               | profitable way. It's a good thing VW was made an example
               | of - but clearly it's not like that resolved the general
               | problem of harmful vehicle emissions. Optimistically, it
               | might have signaled to the rest of the industry and VW in
               | particular to stretch the rules less in the future.
        
               | mbrock wrote:
               | But following the law is itself a load bearing aspect of
               | the social contract. Violating building codes, for
               | example, might not cause immediate harm if it's competent
               | but unusual, yet it's important that people follow it
               | just because you don't want arbitrariness in matters of
               | safety. The objective ruleset itself is a value beyond
               | the rules themselves, if the rules are sensible and in
               | accordance with deeper values, which of course they
               | sometimes aren't, in which case we value civil
               | disobedience and activism.
        
               | Winse wrote:
               | unless following an unethical law would in itself be
               | unethical, then breaking the unethical law would be the
               | only ethical choice. In this case cheating emissions,
               | which I see as unethical, but also advantageous for the
               | consumer, should have been done openly if VW saw
               | following the law as unethical. Ethics and morality are
               | subjective to understanding, and law only a crude
               | approximation of divinity. Though I would argue that each
               | person on the earth through a shared common experience
               | has a rough and general idea of right from wrong...though
               | I'm not always certain they pay attention to it.
        
               | hansworst wrote:
               | Overfitting on test data absolutely does mean that the
               | model would perform better in benchmarks than it would in
               | real life use cases.
        
               | Retr0id wrote:
               | ethics should inform law, not the reverse
        
               | UniverseHacker wrote:
               | I disagree- presumably if an algorithm or hardware is
               | optimized for a certain class of problem it really is
               | good at it and always will be- which is still useful if
               | you are actually using it for that. It's just "studying
               | for the test"- something I would expect to happen even if
               | it is a bit misleading.
               | 
               | VW cheated such that the low emissions were only active
               | during the test- it's not that it was optimized for low
               | emissions under the conditions they test for, but that
               | you could not get those low emissions under any
               | conditions in the real world. That's "cheating on the
               | test" not "studying for the test."
        
               | the_af wrote:
               | > _The only difference is the legality. From an integrity
               | point of view it 's basically the same_
               | 
               | I think cheating about harming the environment is another
               | important difference.
        
               | Swenrekcah wrote:
               | That is not true. Even ChatGPT understands how they are
               | different, I won't paste the whole response but here are
               | the differences it highlights:
               | 
               | Key differences:
               | 
               | 1. Intent and harm: * VW's actions directly violated laws
               | and had environmental and health consequences. Optimizing
               | LLMs for chess benchmarks, while arguably misleading,
               | doesn't have immediate real-world harms. 2. Scope: Chess-
               | specific optimization is generally a transparent choice
               | within AI research. It's not a hidden "defeat device" but
               | rather an explicit design goal. 3. Broader impact: LLMs
               | fine-tuned for benchmarks often still retain general-
               | purpose capabilities. They aren't necessarily "broken"
               | outside chess, whereas VW cars fundamentally failed to
               | meet emissions standards.
        
               | currymj wrote:
               | VW was breaking the law in a way that harmed society but
               | arguably helped the individual driver of the VW car, who
               | gets better performance yet still passes the emissions
               | test.
        
               | jimmaswell wrote:
               | And afaik the emissions were still miles ahead of a car
               | from 20 years prior, just not quite as extremely
               | stringent as requested.
        
               | slowmotiony wrote:
               | "not quite as extremely stringent as requested" is a
               | funny way to say they were emitting 40 times more toxic
               | fumes than permitted by law.
        
               | int_19h wrote:
               | It might sound funny in retrospect, but some of us
               | actually bought VW cars on the assumption that, if
               | biodiesel-powered, it would be more green.
        
               | boringg wrote:
               | How so? VW intentionally changed the operation of the
               | vehicle so that its emissions met the test requirements
               | during the test and then went back to typical operation
               | conditions afterwards.
        
               | TimTheTinker wrote:
               | Right - in either case it's lying, which is crossing a
               | moral line (which is far more important to avoid than a
               | legal line).
        
             | rsynnott wrote:
             | There's a sliding scale of badness here. The emissions
             | cheating (it wasn't just VW, incidentally; they were just
             | the first uncovered. Fiat-Chrysler, Mercedes, GM and BMW
             | were also caught doing it, with suspicions about others)
             | was straight-up fraud.
             | 
             | It used to be common for graphics drivers to outright cheat
             | on benchmarks (the actual image produced would not be the
             | same as it would have been if a benchmark had not been
             | detected); this was arguably, fraud.
             | 
             | It used to be common for mobile phone manufacturers to
             | allow the SoC to operate in a thermal mode that was never
             | available to real users when it detected a benchmark was
             | being used. This is still, IMO, kinda fraud-y.
             | 
             | Optimisation for common benchmark cases where the thing
             | still actually _works_, and where the optimisation is
             | available to normal users where applicable, is less
             | egregious, though, still, IMO, Not Great.
        
             | waffletower wrote:
             | Tesla cheats by using electric motors and deferring
             | emissions standards to somebody else :D Wait, I really
             | think that's a good thing, but once Hulk Hogan is confirmed
             | administrator of the EPA, he might actually use this
             | argument against Teslas and other electric vehicles.
        
           | tightbookkeeper wrote:
           | This is 10 year old story. It's very interesting which ones
           | stay in the public consciousness.
        
           | bluGill wrote:
           | Most of the time these days compiler writers are not cheating
           | like VW did. In the 1980s compiler writers would insert code
           | to recognize performance tests and then cheat - output values
           | hard coded into the compiler instead of running the
           | algorithm. Which is the type of thing that VW got in trouble
           | for.
           | 
           | These days most compilers are trying to make the general case
           | of code fast and they rarely look for benchmarks. I won't say
           | they never do this - just that it is much less common - if
           | only because magazine reviews/benchmarks are not nearly as
           | important as they used to be and so the incentive is gone.
        
           | newerman wrote:
           | Funny response; you're not wrong.
        
           | conradev wrote:
           | GPT-3.5 did not "cheat" on chess benchmarks, though, it was
           | actually just better at chess?
        
             | GolfPopper wrote:
             | I think the OP's point is that chat GPT-3.5 may have a
             | chess-engine baked-in to its (closed and unavailable) code
             | for PR purposes. So it "realizes" that "hey, I'm playing a
             | game of chess" and then, rather than doing whatever it
             | normally does, it just acts as a front-end for a quite good
             | chess-engine.
        
               | conradev wrote:
               | I see - my initial interpretation of OP's "special case"
               | was "Theory 2: GPT-3.5-instruct was trained on more chess
               | games."
               | 
               | But I guess it's also a possibility that they had a real
               | chess engine hiding in there.
        
           | gdiamos wrote:
           | It's approximately bad, like most of ML
           | 
           | On one side:
           | 
           | Would you expect a model trained on no Spanish data to do
           | well on Spanish?
           | 
           | On the other:
           | 
           | Is it okay to train on the MMLU test set?
        
         | dang wrote:
         | We detached this subthread from
         | https://news.ycombinator.com/item?id=42144784.
         | 
         | (Nothing wrong with it! It's just a bit more generic than the
         | original topic.)
        
       | fabiospampinato wrote:
       | It's probably worth to play around with different prompts and
       | different board positions.
       | 
       | For context this [1] is the board position the model is being
       | prompted on.
       | 
       | There may be more than one weird thing about this experiment, for
       | example giving instructions to the non-instruction tuned variants
       | may be counter productive.
       | 
       | More importantly let's say you just give the model the truncated
       | PGN, does this look like a position where white is a grandmaster
       | level player? I don't think so. Even if the model understood
       | chess really well it's going to try to predict the most probable
       | move given the position at hand, if the model thinks that white
       | is a bad player, and the model is good at understanding chess,
       | it's going to predict bad moves as the more likely ones because
       | that would better predict what is most likely to happen here.
       | 
       | [1]: https://i.imgur.com/qRxalgH.png
        
         | Closi wrote:
         | Agree with this. A few prompt variants:
         | 
         | * What if you allow the model to do Chain of Thought
         | (explicitly disallowed in this experiment)
         | 
         | * What if you explain the board position at each step to the
         | model in the prompt, so it doesn't have to calculate/estimate
         | it internally.
        
           | int_19h wrote:
           | They also tested GPT-o1, which is always CoT. Yet it is still
           | worse.
        
         | fabiospampinato wrote:
         | Apparently I can find some matches for games that start like
         | that between very strong players [1], so my hypothesis that the
         | model may just be predicting bad moves on purpose seems wobbly,
         | although having stockfish at the lowest level play as the
         | supposedly very strong opponent may still be throwing the model
         | off somewhat. In the charts the first few moves the model makes
         | seem decent, if I'm interpreting these charts right, and after
         | a few of those things seem to start going wrong.
         | 
         | Either way it's worth repeating the experiment imo, tweaking
         | some of these variables (prompt guidance, stockfish strength,
         | starting position, the name of the supposed players, etc.).
         | 
         | [1]:
         | https://www.365chess.com/search_result.php?search=1&p=1&m=8&...
        
         | spott wrote:
         | He was playing full games, not single moves.
        
         | NiloCK wrote:
         | The experiment started from the first move of a game, and
         | played each game fully. The position you linked was just an
         | example of the format used to feed the game state to the model
         | for each move.
         | 
         | What would "winning" or "losing" even mean if all of this was
         | against a single move?
        
       | osaatcioglu wrote:
       | I've also been experimenting with Chess and LLMs but have taken a
       | slightly different approach. Rather than using the LLM as an
       | opponent, I've implemented it as a chess tutor to provide
       | feedback on both the user's and the bot's moves throughout the
       | game.
       | 
       | The responses vary with the user's chess level; some find the
       | feedback useful, while others do not. To address this, I've
       | integrated a like, dislike, and request new feedback feature into
       | the app, allowing users to actively seek better feedback.
       | 
       | Btw, different from OP's setup, I opted to input the FEN of the
       | current board and the subsequent move in standard algebraic
       | notation to request feedback, as I found these inputs to be
       | clearer for the LLM compared to giving the PGN of the game.
       | 
       | AI Chess GPT https://apps.apple.com/tr/app/ai-chess-
       | gpt/id6476107978
       | https://play.google.com/store/apps/details?id=net.padma.app....
       | 
       | Thanks
        
         | antononcube wrote:
         | Yeah, I was thinking why featured article's author did not use
         | Forsyth-Edwards Notation (FEN) and more complicated chess
         | prompts.
         | 
         | BTW, a year ago when I used FEN for chess playing, LLMs would
         | very quickly/often make illegal moves. (The article prompts me
         | to check has that changed...)
        
       | philipwhiuk wrote:
       | > I always had the LLM play as white against Stockfish--a
       | standard chess AI--on the lowest difficulty setting.
       | 
       | Okay, so "Excellent" still means probably quite bad. I assume at
       | the top difficult setting gpt-3.5-turbo-instruct will still lose
       | badly.
        
         | XCSme wrote:
         | Probably even at lvl 2 out of 9 it would lose all the games.
        
       | gunalx wrote:
       | Is it just me or does the author swap descriptions of the
       | instruction finetuned and the base gpt-3.5-turbo? It seemed like
       | the best model was labeled instruct, but the text saying instruct
       | did worse?
        
       | sylware wrote:
       | They did probably acknowledge that the additionnal cost of
       | training those models on chess would not be "cost effective", did
       | drop chess from their training process, for the moment.
       | 
       | That to say, we can literal say anything because this is very
       | shadowy/murky, but since everything is likely a question of
       | money... should, _probably_, be not very fair away from the
       | truth...
        
       | XCSme wrote:
       | I had the same experience with LLM text-to-sql, 3.5 instruct felt
       | a lot more robust than 4o
        
       | reallyeli wrote:
       | My guess is they just trained gpt3.5-turbo-instruct on a lot of
       | chess, much more than is in e.g. CommonCrawl, in order to boost
       | it on that task. Then they didn't do this for other models.
       | 
       | People are alleging that OpenAI is calling out to a chess engine,
       | but seem to be not considering this less scandalous possibility.
       | 
       | Of course, to the extent people are touting chess performance as
       | evidence of general reasoning capabilities, OpenAI taking costly
       | actions to boost specifically chess performance and not being
       | transparent about it is still frustrating and, imo, dishonest.
        
         | sherburt3 wrote:
         | The have a massive economic incentive to make their closed
         | source software look as good as possible, why wouldn't they
         | cheat?
        
       | stefatorus wrote:
       | The trick to getting a model to perform on something is to have
       | it as a training data subset.
       | 
       | OpenAI might have thought Chess is good to optimize for but it
       | wasn't seen as useful so they dropped it.
       | 
       | This is what people refer to as "lobotomy", ai models are wasting
       | compute on knowing how loud the cicadas are and how wide the
       | green cockroach is when mating.
       | 
       | Good models are about the training data you push in em
        
       | smokedetector1 wrote:
       | I feel like an easy win here would be retraining an LLM with a
       | tokenizer specifically designed for chess notation?
        
       | misiek08 wrote:
       | For me it's not only the chess. Chats get more chatty, but
       | knowledge and fact-wise - it's a sad comedy. Yes, you get a buddy
       | to talk with, but he is talking pure nonsense.
        
       | teleforce wrote:
       | TL;DR.
       | 
       | All of the LLM models tested playing chess performed terribly bad
       | against Stockfish engine except gpt-3.5-turbo-instruct, which is
       | a closed OpenAI model.
        
       | jack_riminton wrote:
       | Perhaps if it doesn't have enough data to explain but it has
       | enough to go "on gut"
        
       | sourcepluck wrote:
       | Keep in mind, everyone, that stockfish on its lowest level on
       | lichess is absolutely terrible, and a 5-year old human who'd been
       | playing chess for a few months could beat it regularly. It hangs
       | pieces, does -3 blunders, totally random-looking bad moves.
       | 
       | But still, yes, something maybe a teeny tiny bit weird is going
       | on, in the sense that only one of the LLMs could beat it. The
       | arxiv paper that came out recently was much more "weird" and
       | interesting than this, though. This will probably be met with a
       | mundane explanation soon enough, I'd guess.
        
         | sourcepluck wrote:
         | Here's a quick anonymous game against it by me, where I
         | obliterate the poor thing in 11 moves. I was around a 1500 ELO
         | classical strength player, which is, a teeny bit above average,
         | globally. But I mean - not an expert, or even one of the
         | "strong" club players (in any good club).
         | 
         | https://lichess.org/BRceyegK -- the game, you'll see it make
         | the ultimate classic opening errors
         | 
         | https://lichess.org/ -- try yourself! It's really so bad, it's
         | good fun. Click "play with computer" on the right, then level 1
         | is already selected, you hit go
        
       | nabla9 wrote:
       | Theory 5: gpt-3.5-turbo-instruct has chess engine attached to it.
        
       | wufufufu wrote:
       | > And then I tried gpt-3.5-turbo-instruct. This is a closed
       | OpenAI model, so details are very murky.
       | 
       | How do you know it didn't just write a script that uses a chess
       | engine and then execute the script? That IMO is the easiest
       | explanation.
       | 
       | Also, I looked at the gpt-3.5-turbo-instruct example victory. One
       | side played with 70% accuracy and the other was 77%. IMO that's
       | not on par with 27XX ELO.
        
       | leogao wrote:
       | The GPT-4 pretraining set included chess games in PGN notation
       | from 1800+ ELO players. I can't comment on any other models.
        
       | a_wild_dandan wrote:
       | Important testing excerpts:
       | 
       | - "...for the closed (OpenAI) models I tried generating up to 10
       | times and if it still couldn't come up with a legal move, I just
       | chose one randomly."
       | 
       | - "I ran all the open models (anything not from OpenAI, meaning
       | anything that doesn't start with gpt or o1) myself using Q5_K_M
       | quantization"
       | 
       | - "...if I gave a prompt like "1. e4 e5 2. " (with a space at the
       | end), the open models would play much, much worse than if I gave
       | a prompt like "1 e4 e5 2." (without a space)"
       | 
       | - "I used a temperature of 0.7 for all the open models and the
       | default for the closed (OpenAI) models."
       | 
       | Between the tokenizer weirdness, temperature, quantization,
       | random moves, and the chess prompt, there's a lot going on here.
       | I'm unsure how to interpret the results. Fascinating article
       | though!
        
       | mips_avatar wrote:
       | Ok whoah, assuming the chess powers on gpt3.5-instruct are just a
       | result of training focus then we don't have to wait on bigger
       | models, we just need to fine tune on 175B?
        
       | 1024core wrote:
       | I would love to see the prompts (the data) this person used.
        
       | mastazi wrote:
       | If you look at the comments under the post, the author commented
       | 25 minutes ago (as of me posting this)
       | 
       | > Update: OK, I actually think I've figured out what's causing
       | this. I'll explain in a future post, but in the meantime, here's
       | a hint: I think NO ONE has hit on the correct explanation!
       | 
       | well now we are curious!
        
       | amelius wrote:
       | What would happen if you'd prompted it with much more text, e.g.
       | general advice by a chess grandmaster?
        
       | throwawaymaths wrote:
       | Would be more interesting with trivial Lora training
        
       ___________________________________________________________________
       (page generated 2024-11-15 23:01 UTC)