[HN Gopher] Manipulating Chess-GPT's World Model
___________________________________________________________________
Manipulating Chess-GPT's World Model
Author : seraine
Score : 71 points
Date : 2024-03-25 14:22 UTC (1 days ago)
(HTM) web link (adamkarvonen.github.io)
(TXT) w3m dump (adamkarvonen.github.io)
| seraine wrote:
| The code for this is located here:
| https://github.com/adamkarvonen/chess_llm_interpretability
| dang wrote:
| Recent and related:
|
| _Chess-GPT 's Internal World Model_ -
| https://news.ycombinator.com/item?id=38893456 - Jan 2024 (103
| comments)
| patresh wrote:
| Another related one from last year based on the Othello game
| (cited in the above paper) :
|
| _Do Large Language Models learn world models or just surface
| statistics?_ - https://news.ycombinator.com/item?id=34474043 -
| Jan 2023 (174 comments)
| mxwsn wrote:
| I'd be curious to see which moves gain or lose the most
| probability when performing the skill intervention. Or the
| correlation between predicted move probabilities and engine score
| for moves, before and after skill intervention.
| mewpmewp2 wrote:
| I'm curious, what kind of setup is required to train something
| like that? Can it be done with 4090 and how long would it take?
| 50M or 25M parameters? And what is the most parameters 4090 can
| do?
|
| This inspires few ideas.
| anotherjesse wrote:
| More details in the previous blog post:
| https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
|
| > A 50 million parameter GPT trained on 5 million games of
| chess learns to play at ~1300 Elo in one day on 4 RTX 3090
| GPUs.
|
| And from the paper: https://arxiv.org/abs/2403.15498
|
| > The 25M parameter model took 72 hours to train on one RTX
| 3090 GPU. The 50M parameter model took 38 hours to train on
| four RTX 3090 GPUs.
|
| definitely inspiring :)
| refulgentis wrote:
| I'm waiting for a build so I went fully down this trail. XD
|
| A1) Yes it can be done with a 4090. (2a).
|
| A2) 2 days. (4d/4e).
|
| B) Up to you, author did both and settled on 50M which they got
| to 1300 ELO. (note: they also did subsequent work with the same
| model to increase perf without further training) (2a).
|
| C) Mu: there's no limit as you can page in/out (3a, 3c), and
| its trivial to store in both RAM and VRAM. 30B fits in memory
| with 32 GB of RAM or VRAM. (3e)
|
| Starting from:
|
| 1. article:
| https://adamkarvonen.github.io/machine_learning/2024/03/20/c...
|
| 2. second sentence links to post on training: https://adamkarvo
| nen.github.io/machine_learning/2024/01/03/c....
|
| 2a. "A 50 million parameter GPT trained on 5 million games of
| chess learns to play at ~1300 Elo in one day on 4 RTX 3090
| GPUs"
|
| 2b. "The 50M parameter model played at 1300 Elo with 99.8% of
| its moves being legal within one day of training"
|
| 3. re: most parameters 4090 can do:
|
| 3a. understanding is there is no limitation, in that, you don't
| _need_ to have every parameter in memory at all times, either
| during training or inference.
|
| 3b. google "are amount of parameters in llm limited by vram
| size"
|
| 3c. go to /r/LocalLLaMa link:
| https://www.reddit.com/r/LocalLLaMA/comments/15j0mvm/what_ar...
| (why? its a favorite of mine, believe its the closest you get
| to local training ppl talking in open space that's not discord)
|
| 3d. understanding in 3a is correct.
|
| 3e. "The model must fit in your RAM or VRAM, but you can split
| the model between them. With 32gb ram you could fit a 30b
| model"
|
| 4. 3090 vs. 4090 training speed
|
| 4a. google "rtx 3090 vs. 4090 llm training perf"
|
| 4b. wander through top reddit links. useful info, but a little
| too technical to share in a way that doesn't require explaining
| a lot.
|
| 4c. down the list: lamda labs link from october 2022:
| https://lambdalabs.com/blog/nvidia-rtx-4090-vs-rtx-3090-deep...
|
| 4d. 4090 is roughly 2x the perf on TransformerXL Large, the
| best match for transformers based large model, i.e. an LLM
|
| 4e. training took 4 3090-sdays for author. So 2 days.
| anotherjesse wrote:
| Additionally instructions on training/inference on mac -
| https://github.com/adamkarvonen/nanoGPT
|
| > To sample on Mac, uncomment line 21 in sample.py. To train
| on Mac, rename train_shakespeare_char_mac.py to
| train_shakespeare_char.py
|
| The `mac` file changed several things - I decided to try
| running training with the original config file - changing
| device to mps / compile to false iter 100:
| loss 2.0268, time 815.43ms, mfu 3.24% iter 200: loss
| 1.8523, time 818.79ms, mfu 3.24% iter 300: loss
| 1.7799, time 823.05ms, mfu 3.23% iter 400: loss
| 1.6887, time 819.08ms, mfu 3.23%
|
| Training is ~4x slower than the speed reported on the
| original multi-GPU run: https://wandb.ai/adam-karvonen/chess-
| gpt-batch/runs/zt5htyl6...
|
| Not bad for an M2 studio which is running lots of other
| workloads at the same time
| Der_Einzige wrote:
| All of the related work, such as activation/representation
| engineering, and control/steering vectors is also really neat!
|
| You can play with steering vectors within oobabooga now:
| https://github.com/Hellisotherpeople/llm_steer-oobabooga
| jprival wrote:
| Reminds me tangentially of the study that found that strong chess
| players were much worse at recalling randomized positions than
| realistic ones:
| https://link.springer.com/content/pdf/10.3758/BF03200937.pdf
|
| I'm wary of inviting the too-easy "it's just like how it works
| for people!" comparison, but the implied context and history of a
| game state seems to be important in processing it.
| elif wrote:
| > had Chess-GPT play against Stockfish with this random
| initialization. Its performance plummeted. The larger 50 million
| parameter model's win rate dropped from 70% to 17%.
|
| Beating stockfish 17% of the time is still incredible for any
| engine.
|
| and being trained on human moves rather than engine moves will
| make this so much more annoying to analyze cheating...
| jsmith99 wrote:
| Article says: 'Chess-GPT also played chess well, with the best
| model playing at approximately 1500 Elo.'
|
| So I'm guessing this wasn't full strength stockfish.
| noSyncCloud wrote:
| It notes in the blog post that it was SF level 0.
| jxy wrote:
| The previous blog post [0] mentioned they were using
| stockfish with 0.1 seconds per move.
|
| [0] https://adamkarvonen.github.io/machine_learning/2024/01/0
| 3/c...
| goatlover wrote:
| It would be impressive if it was playing Stockfish at a
| grandmaster level. But it looks like it's lower-level. Still
| impressive that it can learn to play decently at some level.
| Doubt it will ever be competitive with top engine levels.
| seraine wrote:
| I updated the article to also mention this in the second blog
| post. All games were against Stockfish level 0, with nodes
| searched limited to 100,000 rather than a time based search
| limit to limit variability due to different processors.
| afro88 wrote:
| I googled stockfish "level 0" and it came up with no results.
| What do you mean by level 0?
| seraine wrote:
| The Stockfish program can be set to play at strength level
| 0-20. Estimates of the levels' Elo is provided here:
| https://github.com/official-
| stockfish/Stockfish/commit/a08b8...
| swyx wrote:
| i think i dont understand what stockfish does then. doesn't it
| play the basically perfect move at each step? how is it
| possible to beat stockfish 70% of the time? with a naive GPT
| type thing?
| makeset wrote:
| Chess is not a fully "solved" game, so the actual perfect
| move is not generally known. Like any other engine, Stockfish
| just tries its best and is not infallible.
___________________________________________________________________
(page generated 2024-03-26 23:00 UTC)