[HN Gopher] Manipulating Chess-GPT's World Model
       ___________________________________________________________________
        
       Manipulating Chess-GPT's World Model
        
       Author : seraine
       Score  : 71 points
       Date   : 2024-03-25 14:22 UTC (1 days ago)
        
 (HTM) web link (adamkarvonen.github.io)
 (TXT) w3m dump (adamkarvonen.github.io)
        
       | seraine wrote:
       | The code for this is located here:
       | https://github.com/adamkarvonen/chess_llm_interpretability
        
       | dang wrote:
       | Recent and related:
       | 
       |  _Chess-GPT 's Internal World Model_ -
       | https://news.ycombinator.com/item?id=38893456 - Jan 2024 (103
       | comments)
        
         | patresh wrote:
         | Another related one from last year based on the Othello game
         | (cited in the above paper) :
         | 
         |  _Do Large Language Models learn world models or just surface
         | statistics?_ - https://news.ycombinator.com/item?id=34474043 -
         | Jan 2023 (174 comments)
        
       | mxwsn wrote:
       | I'd be curious to see which moves gain or lose the most
       | probability when performing the skill intervention. Or the
       | correlation between predicted move probabilities and engine score
       | for moves, before and after skill intervention.
        
       | mewpmewp2 wrote:
       | I'm curious, what kind of setup is required to train something
       | like that? Can it be done with 4090 and how long would it take?
       | 50M or 25M parameters? And what is the most parameters 4090 can
       | do?
       | 
       | This inspires few ideas.
        
         | anotherjesse wrote:
         | More details in the previous blog post:
         | https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
         | 
         | > A 50 million parameter GPT trained on 5 million games of
         | chess learns to play at ~1300 Elo in one day on 4 RTX 3090
         | GPUs.
         | 
         | And from the paper: https://arxiv.org/abs/2403.15498
         | 
         | > The 25M parameter model took 72 hours to train on one RTX
         | 3090 GPU. The 50M parameter model took 38 hours to train on
         | four RTX 3090 GPUs.
         | 
         | definitely inspiring :)
        
         | refulgentis wrote:
         | I'm waiting for a build so I went fully down this trail. XD
         | 
         | A1) Yes it can be done with a 4090. (2a).
         | 
         | A2) 2 days. (4d/4e).
         | 
         | B) Up to you, author did both and settled on 50M which they got
         | to 1300 ELO. (note: they also did subsequent work with the same
         | model to increase perf without further training) (2a).
         | 
         | C) Mu: there's no limit as you can page in/out (3a, 3c), and
         | its trivial to store in both RAM and VRAM. 30B fits in memory
         | with 32 GB of RAM or VRAM. (3e)
         | 
         | Starting from:
         | 
         | 1. article:
         | https://adamkarvonen.github.io/machine_learning/2024/03/20/c...
         | 
         | 2. second sentence links to post on training: https://adamkarvo
         | nen.github.io/machine_learning/2024/01/03/c....
         | 
         | 2a. "A 50 million parameter GPT trained on 5 million games of
         | chess learns to play at ~1300 Elo in one day on 4 RTX 3090
         | GPUs"
         | 
         | 2b. "The 50M parameter model played at 1300 Elo with 99.8% of
         | its moves being legal within one day of training"
         | 
         | 3. re: most parameters 4090 can do:
         | 
         | 3a. understanding is there is no limitation, in that, you don't
         | _need_ to have every parameter in memory at all times, either
         | during training or inference.
         | 
         | 3b. google "are amount of parameters in llm limited by vram
         | size"
         | 
         | 3c. go to /r/LocalLLaMa link:
         | https://www.reddit.com/r/LocalLLaMA/comments/15j0mvm/what_ar...
         | (why? its a favorite of mine, believe its the closest you get
         | to local training ppl talking in open space that's not discord)
         | 
         | 3d. understanding in 3a is correct.
         | 
         | 3e. "The model must fit in your RAM or VRAM, but you can split
         | the model between them. With 32gb ram you could fit a 30b
         | model"
         | 
         | 4. 3090 vs. 4090 training speed
         | 
         | 4a. google "rtx 3090 vs. 4090 llm training perf"
         | 
         | 4b. wander through top reddit links. useful info, but a little
         | too technical to share in a way that doesn't require explaining
         | a lot.
         | 
         | 4c. down the list: lamda labs link from october 2022:
         | https://lambdalabs.com/blog/nvidia-rtx-4090-vs-rtx-3090-deep...
         | 
         | 4d. 4090 is roughly 2x the perf on TransformerXL Large, the
         | best match for transformers based large model, i.e. an LLM
         | 
         | 4e. training took 4 3090-sdays for author. So 2 days.
        
           | anotherjesse wrote:
           | Additionally instructions on training/inference on mac -
           | https://github.com/adamkarvonen/nanoGPT
           | 
           | > To sample on Mac, uncomment line 21 in sample.py. To train
           | on Mac, rename train_shakespeare_char_mac.py to
           | train_shakespeare_char.py
           | 
           | The `mac` file changed several things - I decided to try
           | running training with the original config file - changing
           | device to mps / compile to false                   iter 100:
           | loss 2.0268, time 815.43ms, mfu 3.24%         iter 200: loss
           | 1.8523, time 818.79ms, mfu 3.24%         iter 300: loss
           | 1.7799, time 823.05ms, mfu 3.23%         iter 400: loss
           | 1.6887, time 819.08ms, mfu 3.23%
           | 
           | Training is ~4x slower than the speed reported on the
           | original multi-GPU run: https://wandb.ai/adam-karvonen/chess-
           | gpt-batch/runs/zt5htyl6...
           | 
           | Not bad for an M2 studio which is running lots of other
           | workloads at the same time
        
       | Der_Einzige wrote:
       | All of the related work, such as activation/representation
       | engineering, and control/steering vectors is also really neat!
       | 
       | You can play with steering vectors within oobabooga now:
       | https://github.com/Hellisotherpeople/llm_steer-oobabooga
        
       | jprival wrote:
       | Reminds me tangentially of the study that found that strong chess
       | players were much worse at recalling randomized positions than
       | realistic ones:
       | https://link.springer.com/content/pdf/10.3758/BF03200937.pdf
       | 
       | I'm wary of inviting the too-easy "it's just like how it works
       | for people!" comparison, but the implied context and history of a
       | game state seems to be important in processing it.
        
       | elif wrote:
       | > had Chess-GPT play against Stockfish with this random
       | initialization. Its performance plummeted. The larger 50 million
       | parameter model's win rate dropped from 70% to 17%.
       | 
       | Beating stockfish 17% of the time is still incredible for any
       | engine.
       | 
       | and being trained on human moves rather than engine moves will
       | make this so much more annoying to analyze cheating...
        
         | jsmith99 wrote:
         | Article says: 'Chess-GPT also played chess well, with the best
         | model playing at approximately 1500 Elo.'
         | 
         | So I'm guessing this wasn't full strength stockfish.
        
           | noSyncCloud wrote:
           | It notes in the blog post that it was SF level 0.
        
           | jxy wrote:
           | The previous blog post [0] mentioned they were using
           | stockfish with 0.1 seconds per move.
           | 
           | [0] https://adamkarvonen.github.io/machine_learning/2024/01/0
           | 3/c...
        
         | goatlover wrote:
         | It would be impressive if it was playing Stockfish at a
         | grandmaster level. But it looks like it's lower-level. Still
         | impressive that it can learn to play decently at some level.
         | Doubt it will ever be competitive with top engine levels.
        
         | seraine wrote:
         | I updated the article to also mention this in the second blog
         | post. All games were against Stockfish level 0, with nodes
         | searched limited to 100,000 rather than a time based search
         | limit to limit variability due to different processors.
        
           | afro88 wrote:
           | I googled stockfish "level 0" and it came up with no results.
           | What do you mean by level 0?
        
             | seraine wrote:
             | The Stockfish program can be set to play at strength level
             | 0-20. Estimates of the levels' Elo is provided here:
             | https://github.com/official-
             | stockfish/Stockfish/commit/a08b8...
        
         | swyx wrote:
         | i think i dont understand what stockfish does then. doesn't it
         | play the basically perfect move at each step? how is it
         | possible to beat stockfish 70% of the time? with a naive GPT
         | type thing?
        
           | makeset wrote:
           | Chess is not a fully "solved" game, so the actual perfect
           | move is not generally known. Like any other engine, Stockfish
           | just tries its best and is not infallible.
        
       ___________________________________________________________________
       (page generated 2024-03-26 23:00 UTC)