https://pvdz.ee/weblog/450
Neural Chess
2024-01-21
---------------------------------------------------------------------
I've been trying to train a neural network to play chess. A few weeks
ago I got [S:nerd sniped:S], nay, inspired by this video titled "AI
plays pokemon". It's a lovely exhibit on reinforcement learning and
in particular a showcase of coding something with the goal of playing
a game without actually telling it to play a game. An emergent
gameplay. I love that stuff.
You can find the video here.
For what it's worth; I have a degree in AI. "Cognitive artificial
intelligence", to be precise. It leans more into philosophy than an
applied study but it did touch on all five corner stones of AI;
philosophy "can it be done?", psychology "how do humans do it?",
linguistics "how do you communicate with it?", computer science "how
do you code it?", and math "... cause numbers". In the end my
knowledge in that area was wide, including knowing how to roll my own
and train a neural network, but not very deep. It did shape my life
philosophy very well and gave me a good rucksack of general
knowledge. Afterwards I was a bit frustrated with not having mastered
a concrete skill. So I picked JS, dug into the spec, and ended up
mastering that completely. I haven't really worked with AI in my
career, aside from a random encounter with a finite domain constraint
solver. So tldr; I know AI, haven't done much with it.
Chess
As you can read in my previous post, I tumbled into the world of
chess recently. And being triggered by the Pokemon visualization I
kind of wanted to try and achieve a similar thing. But in what? Well,
why not chess.
[450_chess_] That turned out to be a bigger challenge than I thought.
For one, you need to codify the rules of chess. That's when you learn
that some of these rules are quite gnarly. In particular, having to
do the "does this move leave you in check" check is more cognitive
work than I initially expected. Funny how that works.
The thing I really wanted to make, inspired by that Pokemon video,
was this big multi-board visualization of the AI playing chess.
(Click here to load a fancy 2 MB multi-board gif)
Full disclosure; that big image is just playing random (but valid)
moves. Ironically, I could have saved me a couple of hours and
instead of a neural network just Math.random() brute forced my way to
get this animation. But what's the fun in that :)
Engine
So while I was codifying the rules for chess I tumbled into that
rabbit hole and ended up creating a small engine for it, Yacge (code
on github). Not highly efficient, but sufficient to cover my needs.
After finalizing the engine I could proceed to the network training
part of it.
Neural network
Some preamble on neural networks. I won't bother to explain how they
work in the nitty gritty but sufficed to say "shit is hard".
Ultimately a neural network boils down to a network of layers of
interconnected "nodes", meaning each node in one layer is connected
to each node in the layer before and after it. Each node has a value
called "activation", which is zero-to-one. Each connection has a
"weight", also zero-to-one. For output nodes, this activation would
be interpreted as some kind of probability. It is the result of
multiplying all the activation values of all input nodes and all
weights and internal nodes all the way to the output node. Neural
networks generally don't give absolute answers.
With "zero-to-one" I mean a floating point number that is between
zero and one, inclusive.
You will have a set of input nodes that describes your state, like a
game board, image, or prompt text, or whatever. Your input states
ultimately need to be converted to some kind of model of zero-to-one
numbers.
The output is a set of "predictions", again nodes of zero-to-one.
When you train a network you give it an input and the expected
outcome. Then the network applies changes to the weights of
connections and activation values of the nodes based on what you're
telling it to be expecting. The actual math on this is actually
fairly straightforward; just a lot of multiplications. A lot. The
"training" aspect boils down to changing the values such that the
activation of the output nodes is closer to the expectation as it was
before. Not an absolute match but just a bit closer.
That's not all. There's a bias, a training speed, the number of
hidden layers, the size of hidden layers, learning strategies, and so
forth. You can probably lose yourself for a month just trying
different tactics and getting a result. All without making any
changes to the actual neural network. Oof.
Modeling
Since a neural network is a mere set of floating point numbers, you
can't just throw a string to it and wait for a string out, or an
image into it and an integer or word out. No, you have to model your
problem in terms of normalized floating point values and interpret
the output values. Values between zero and one, inclusive.
For example; Images as input can be encoded as ones and zeroes (lots
of them). A tic-tac-toe board can be encoded as nine inputs that each
represents whether the square contains an x, an o, or nothing at all.
A classifier can convert a few terms and either have a single output
node for each possible term, or have a single term and divide the
floating point space up to have meaning.
There are many possibilities and it should be obvious that the design
of your model has a big influence on the learning speed of the
network. If a neural network can't figure out patterns because your
model is bad then it won't be learning anything, or worse, the wrong
thing.
Chess model
I tried a few different models for my chess network.
Since Yacge has an internal representation of bitboards (that's a
64bit integer where each bit represents the on/off value of one
square of the chess board) displayed below, I figured it could send
the status of these bitboards as inputs.
The engine has 8 of these bitboards; one for the black pieces, white
pieces, and one for each kind of piece. Together you can use bitwise
AND operation to discover the entire game. And so this is what I fed
to the network.
Code:
-- game -- --player w-- --opponent-- --kings----- --queens----
--rooks----- --bishops--- --knights--- --pawns-----
abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh
abcdefgh abcdefgh
8 8 8 00000000 8 8 11111111 8 8 00001000 8 8 00010000 8 8
10000001 8 8 00100100 8 8 01000010 8 8 00000000 8
7 7 7 00000000 7 7 11111111 7 7 00000000 7 7 00000000 7 7
00000000 7 7 00000000 7 7 00000000 7 7 11111111 7
6 6 6 00000000 6 6 00000000 6 6 00000000 6 6 00000000 6 6 00000000 6
6 00000000 6 6 00000000 6 6 00000000 6
5 5 5 00000000 5 5 00000000 5 5 00000000 5 5 00000000 5 5 00000000 5
5 00000000 5 5 00000000 5 5 00000000 5
4 4 4 00000000 4 4 00000000 4 4 00000000 4 4 00000000 4 4 00000000 4
4 00000000 4 4 00000000 4 4 00000000 4
3 3 3 00000000 3 3 00000000 3 3 00000000 3 3 00000000 3 3 00000000 3
3 00000000 3 3 00000000 3 3 00000000 3
2 2 2 11111111 2 2 00000000 2 2 00000000 2 2 00000000 2 2
00000000 2 2 00000000 2 2 00000000 2 2 11111111 2
1 1 1 11111111 1 1 00000000 1 1 00001000 1 1 00010000 1 1
10000001 1 1 00100100 1 1 01000010 1 1 00000000 1
abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh
abcdefgh abcdefgh
------------ ------------ ------------ ------------ ------------
------------ ------------ ------------ ------------
Another way I tried was to convert the board to 64 numbers and divide
the floating point space up in 13 sections, each range representing a
particular colored piece or empty.
As output I expected the network to give me a "from" and "to"
coordinate. I quickly realized that wouldn't work. So instead I
created two sets of 64 numbers where the first set represents the
index of the "from" square and the second would be the "to" square.
Given those 64 numbers I would pick the index with the highest
activation and ask the engine whether that from-to couple was a legit
move in the current state. If not, repeat with the next best pair
until there are none left. If you can find a valid move this way then
make it, otherwise consider the game over.
Consider this table of predictions for the opening move for White. At
this point the activations for the White pawns and knights are
significantly higher than the other squares. But the "to" squares are
still a bit fuzzy, for example a2 is never a valid opening target.
Code:
-- p from
--------------------------------------------------------------------------
-- p to
----------------------------------------------------------------------------
a b c d e f g h a b c d e f g h
8 0.000717 0.000720 0.000606 0.000640 0.000646 0.000686 0.000778
0.000822 8 8 0.000617 0.000584 0.000477 0.000654 0.000707 0.000597
0.000623 0.000688 8
7 0.000612 0.000639 0.000566 0.000448 0.000800 0.000533 0.000553
0.000991 7 7 0.000623 0.000878 0.000858 0.000980 0.000844 0.001139
0.000813 0.001283 7
6 0.001180 0.000864 0.000833 0.001359 0.000885 0.000957 0.002028
0.001113 6 6 0.000741 0.001303 0.001018 0.001427 0.003672 0.003032
0.003038 0.004788 6
5 0.002021 0.004815 0.002025 0.003439 0.004112 0.003263 0.006398
0.002641 5 5 0.001858 0.004551 0.008008 0.004045 0.004396 0.005853
0.004450 0.008081 5
4 0.007167 0.006546 0.007606 0.007078 0.013818 0.012468 0.009795
0.010014 4 4 0.006896 0.023927 0.022808 0.029172 0.024975 0.052587
0.016379 0.036477 4
3 0.017303 0.009698 0.014770 0.010137 0.008952 0.009595 0.008527
0.013713 3 3 0.032853 0.062769 0.034909 0.051695 0.036063 0.052277
0.051271 0.033579 3
2 0.088726 0.094049 0.072685 0.088926 0.233186 0.071909 0.101385
0.090959 2 2 0.071619 0.008169 0.008953 0.008444 0.022026 0.039267
0.007543 0.010122 2
1 0.027890 0.077474 0.030337 0.039759 0.043032 0.042445 0.117007
0.023538 1 1 0.009813 0.005936 0.014396 0.008808 0.012505 0.014423
0.013239 0.018353 1
a b c d e f g h a b c d e f g h
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
Another output model was to create 64x64 outputs; one output for each
combination of "from" and "to" square. The rest is pretty similar
insofar that the pair with the highest activation value would be
attempted first, and so on until a legal move was found. I would hope
that this makes it easier for a neural network to bootstrap the
problem since all the outputs have the same semantic, rather than
"half of them mean x and half of them mean y". But it obviously takes
much longer to train a network to get to a reliable set of outputs.
And that's still hampered by training issues outlined below. Not to
mention the visualization.
Training
In general a network is trained by giving it examples. The more
examples the better the training. Beyond the model itself you have
over fitting, under fitting, training bias, and a bunch of things to
worry about.
You can train on an existing set of inputs and outputs, like how you
can train a network on images of numbers and tell them that when you
feed the cat.png image into it, you expect the outcome to be the
"cat" label. Given enough input samples this should train your
network to classify images.
You can also train your network by giving it some inputs, evaluating
the prediction, and determining a reward for that prediction. Did the
predictions lead to a win, a loss, an advantage, a desired pattern?
In chess, you could give rewards for making progress (captures, pawn
moves), for taking opponent pieces, for winning the game.
Training chess
At first I thought I would download a chess database and train a
model using a bunch of legit games.
I found this massive database (Caissabase) with about 4M chess games
at master level. So I wanted to use that as a starting point. This
was before I learned how PGN and chess notation actually works.
Spoiler: you need a chess engine because the "standard algebraic
notation" used in chess does not tell you where the piece moved from
and so you have to disambiguate all valid moves based on the current
state of the game, for which you need an entire game engine. You
could pre-process the moves to get these coordinates, for which you
still need an engine, but only once and not at training time.
This led me to sidestep into Yacge as a necessary cornerstone of this
project.
PGN
Actually, my first thought was to just feed these pgn strings to a
network and create an LLM-esque sort of network. But I wasn't
convinced that I was going to get that anywhere within reasonable
time.
Then I realized that the thing I find most beautiful about this
project was a self-learning self-playing agent with minimal external
input. So I went with that and didn't bother with the database. Maybe
I should revisit that idea.
Self-playing
The approach to learning this way is called Reinforcement Learning
(RL). If you want to learn more about that I'd suggest you to read
this article instead.
Tldr; given a neural network, you give it an input state like the
state of your game. Then based on that it gives you a prediction, the
move to make. You apply this move and apply a heuristic that tells
you how much you liked that move. If it won you the game you like it
a lot but if it lost you the game you hate the move. You then teach
the network by telling it how much you liked the move it predicted.
This update won't immediately force the outputs to match your
expectations. That would erase the outputs for other inputs. Instead
the model is updated ever so slightly such that the prediction
becomes closer to what you were expecting, hopefully without
destroying the existing other input-output mappings.
Chess
At first I tried to create a pure self-learning network. I would take
the moves with the highest activation, throw them against the engine,
and see what sticks. If a valid move was found I would train the
network with some reward. Otherwise I'd try to teach the network not
to predict that move by giving it the worst reward.
[450_chess_] While that worked to some degree, in the end it kept
predicting the same moves, ending continuously in a threefold draw.
It kind of makes sense; a minimal change of the inputs would be prone
to lead to a very similar output. So if two pieces go back and forth
then sooner or later they'll trigger the threefold draw rule. And
overall, won't end up generating a good game.
[450_chess_] I tried to implement logic that would prevent a move if
it would trigger a threefold draw but that didn't help much. I went
one step further and disallowed a move that would end in the same
game state, and now I was left with extremely long games that
triggered the 50-move draw limit because kings kept being moved with
no real goal. Greeaat.
In the end I wasn't really able to solve this reliably. The game
would always want to repeat itself, or end in endless games. This was
a problem because it would not hit the desired end states frequently
enough to get the reward for wanting more of it. And if a network
never learns to want the good stuff then it'll never favor it.
Interesting paradox.
[450_chess_] Other things I tried include adjusting the network size,
number of nodes, number of hidden layers, learning speed, tweaking
the model, tweaking the reward function, playing with activation
functions, etc. Sadly, I wasn't able to get a meaningful result.
Now don't get me wrong, sometimes it does get a win. But these are
more of a fluke than consistent outcomes.
Tensorflow.JS implementation
For my experiment I used tensorflow.js in node.js and a simple web
interface together with some CLI debugging output.
I used chatGPT to help me setup some of the tensorflow code. It was
surprisingly helpful in that. Although if a certain prompt doesn't
give you what you want, it was hard getting it to give you a good
answer anyways. All in all, most of the time it was more useful than
google searches for similar questions. That's refreshing :)
Setup
The installation is straightforward
Code:
npm install @tensorflow/tfjs-node-gpu
Cuda doesn't need to work on your machine, just ignore the warnings.
It didn't initially for me. To get it to work I had to install a few
things
Code:
sudo apt-get install nvidia-cuda-toolkit, nvidia-cudnn
After this, nvidia-smi and nvcc would work. But on my machine, CUDA
would still not be recognized by Tensorflow.js and I had to do the
following to fix that:
Code:
for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/
numa_node; done
This is necessary after each boot. And now CUDA would work for me and
during training I could see the GPU be utilized a bit through
nvidia-smi. It wasn't massive for me, dropping an iteration from
150ms to 110ms. Utilization was about 5-15%. But that's probably
because my model and data set is tiny?
Code:
$ node train.js
...
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created
device /job:localhost/replica:0/task:0/device:GPU:0 with 5236 MB
memory: -> device: 0, name: NVIDIA GeForce RTX 2070, pci bus id:
0000:01:00.0, compute capability: 7.5
...
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
|
|
-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+=============
=========|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 51C P2 44W / 175W | 7110MiB / 8192MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
...
Anyways, after that all said and done, I could import tensorflow like
this
Code:
import * as tfd from '@tensorflow/tfjs-node-gpu';
const tf = tfd.default;
It's common to namespace the import into a tf variable, but somehow
the star import wouldn't let me so that's why I'm reassigning it.
Didn't care enough to investigate that further.
Creating a model
This code creates the model. If it was stored it can load it again.
In my limited experience here, Tensorflow is able to restore my model
accurately which is great.
Code:
async function createModel() {
let model;
let history = [];
const existingModel = fs.existsSync('./model.json') &&
fs.readFileSync('./model.json', 'utf8');
if (existingModel) {
console.log('Using existing model');
const state = JSON.parse(existingModel);
const weightData = new Uint8Array(Buffer.from(state.weightData,
"base64")).buffer;
model = await tf.loadLayersModel(tf.io.fromMemory
(state.modelTopology, state.weightSpecs, weightData));
history = JSON.parse(fs.readFileSync('./history.json', 'utf8'));
} else {
console.log('Compiling fresh model');
model = tf.sequential();
model.add(tf.layers.dense({ units: 50, inputShape: [8*64],
activation: 'sigmoid', kernelInitialize: tf.initializers.glorotNormal
() }));
//model.add(tf.layers.dense({ units: 50 }));
model.add(tf.layers.dense({ units: OUTPUT_COUNT, activation:
'sigmoid' }));
}
const optimizer = tf.train.adam(LEARNING_RATE);
model.compile({loss: 'meanSquaredError', optimizer, metrics:
['accuracy']});
model.summary();
return {model: {model, optimizer}, history};
}
Some notes on that function:
The activation "sigmoid" means the values are individual normalized
floats, valued zero to one. You will also commonly see "softmax" in
some examples, this means the sum of all predicted values sums to 1.
There are more values like that but you'll have to search for them
yourself, or ask chatGPT.
There are different loss functions and optimizer functions which
determine how training is applied. I used "adam" and
"meanSquaredError". I feel like I would have come further by
experimenting more with these. But it also feels like I could lose
myself for a month just by doing so.
I restore the history of moves etc as well but obviously that's not
required to make the network load and function.
Episodes
Each game we play is called an "episode".
On a high level an episode does the following;
[?] Create a fresh chess game
[?] Repeat until the game ends: Feed current game state into network
and pick best move from resulting predictions. Make and remember the
best move that is legal and repeat. Stop if game ends or if there are
no valid moves left.
[?] After game ends look at moves made. The move should have some meta
data available, whatever you need for the reward function. For each
move determine the reward. This could be based on outcome of the
game, evaluation of the position after the move, material advantage
after the move, or as simple as wanting to move the game forward
(rewarding a pawn move).
[?] Then for each move made, add an entry to inputs of the board state
before that move, and add an entry to outputs where the targeted
"from" and "to" index is somehow updated with the reward value for
that move.
[?] Train the network on the resulting set.
First create a new game. Using Yacme that looks like this:
Code:
let G = parseFen(FEN_NEW_GAME);
Now repeat the logic until game ends. The game either ends with a
draw or because there are no more moves. If there are no more valid
moves left the game ended either in a stalemate (also a draw) or a
checkmate. The no more moves case is unfortunate because it still
means testing 64*64=4k moves before be able to assert that
conclusion. But it's either that or over-testing for a check- or
stalemate after every move. I opted for this more implied outcome
(most of those 4k tests are cheap since you have 16 pieces max rather
than 64 and the "is square your piece"-test is cheap, but still).
In my code I convert the game state to an input model and then send
that to tensorflow to get a result back:
Code:
const normalizedBoardState = toNormalizedBoardStateInput(G);
const rawPredictions = tf.tidy(() => {
const tensor = tf.tensor([boardState], [1, 64*8], 'int32');
const p = model.predict(tensor);
return p.arraySync()[0]; // Only has one
});
The .tidy() part is because these "tensors" (an atomic data model)
are retained until told otherwise. The tidy callback will destroy any
of these objects at the end of the callback. Otherwise you'd have to
call .dispose() on them, which is also fine.
So now you have a "prediction"; in my case a set of 128 floats. In my
code I interpret this as two sets of 64 floats. And each of these
maps to a square on the board. The first set represents the square
the move from, the second set represents the square to move to.
Together they should provide me with a valid move. This looks
somewhat like this:
Code:
--p from, best =
b1,a1,c3,e1,h1,d1,b2,e2,g1,c1--------------------------------------
--p to, best =
c1,b1,d1,g3,e3,f3,h3,f2,a2,d3----------------------------------------
a b c d e f g h a b c d e f g h
8 0.000466 0.000474 0.000458 0.000480 0.000474 0.000484 0.000449
0.000467 8 8 0.000458 0.000457 0.000469 0.000473 0.000457 0.000484
0.000454 0.000449 8
7 0.000480 0.000464 0.000461 0.000465 0.000472 0.000450 0.000477
0.000465 7 7 0.000474 0.000476 0.000460 0.000462 0.000494 0.000505
0.000479 0.000462 7
6 0.000462 0.000464 0.000465 0.000466 0.000455 0.000462 0.000481
0.000464 6 6 0.000453 0.000458 0.000464 0.000452 0.000486 0.000456
0.000446 0.000462 6
5 0.000457 0.000535 0.000486 0.000945 0.000471 0.000453 0.000475
0.000478 5 5 0.000477 0.000492 0.000505 0.000514 0.000529 0.000458
0.000458 0.000439 5
4 0.001363 0.000537 0.000596 0.000444 0.000469 0.000468 0.000469
0.000484 4 4 0.000490 0.000485 0.000494 0.000505 0.000484 0.000498
0.000480 0.000478 4
3 0.000900 0.001112 0.011174 0.001514 0.001808 0.002027 0.001140
0.001363 3 3 0.000466 0.000521 0.000530 0.000550 0.000800 0.000768
0.000928 0.000606 3
2 0.008865 0.009094 0.008909 0.008958 0.009090 0.008887 0.008944
0.008994 2 2 0.000553 0.000448 0.000468 0.000470 0.000513 0.000557
0.000454 0.000475 2
1 0.018268 0.019505 0.009009 0.009096 0.009101 0.008827 0.009093
0.009096 1 1 0.000481 0.016113 0.018434 0.000995 0.000470 0.000459
0.000531 0.000453 1
a b c d e f g h a b c d e f g h
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
In this example, the agent would start by trying to move from b1 to
c1, b1 to b1 (:facepalm:), b1 to d1, b1 to g3, etc until finding b1
to a3 or e3 is a valid (knight) move.
I think, ultimately, a network with 4k outputs (one for each
combination of from-to square) would be less confusing to the
network. But it's also much more time consuming to train that and my
attempt didn't seem to lead to anything. And a bit hard to visualize
in a CLI :)
Having these two sets you sort them by activation and map their index
to the square, then you try to evaluate moves based on descending
activation order.
For example:
Code:
[0.999638, 0.983368, 0.999746, 0.999099, ... 0.993414, 0.989944,
0.954925, 0.943932, ...]
// Breaks down into
[0.999638, 0.983368, 0.999746, 0.999099, ...]
[0.993414, 0.989944, 0.954925, 0.943932, ...]
And then we map to an [activation, index] pair, and .sort(([ a ],[ b
]) => b-a) them.
These arrays are walked in order. The idea being that the highest
activation for the first array and second array should yield the most
likely correct move. Where it falls apart, I think, is assuming the
network can even predict this association this way. Although training
has shown some positive results in this. Then again, those may just
as well have been coincidence and confirmation bias on my part.
For each generated pair, I would first test whether the "from" square
contains a piece of the current player. Likewise, I would test
whether the "to" square does not contain a piece from the current
player (because you can never move your piece to a square containing
your own piece). Passing those two simple tests we apply the much
more expensive and pervasive canMove check, which confirms that the
move is valid (in a chess game), valid in the context of the current
game, but also asserts that the king is not left in check after
making this move, since that would mean the move is illegal as well.
Once it passes all those tests, the move is accepted, applied, and
stored for training later.
If a certain "from" square had no valid moves at all then the
iteration simply goes to the next "from" square. If none of the
"from" cells have a valid move then we have exhausted the possible
from-to pairs and the game must have ended with the player being
unable to make a move without leaving its king in check. This means a
checkmate if it starts checked or a stalemate if it was not checked
at the start of the turn.
Alternatively, there are some other reasons why the game may end but
that's only after making a move. For example, threefold repeat,
fifty-move limit, and a insufficient material all lead to draws.
The end result of an "episode" is a list of moves and some meta data
for those moves. This meta data includes piece and capture
information, board state information, and also network efficiency
details.
Rewards heuristic
Once a game has been generated we have to determine the rewards based
on the outcome of this game.
There are so many ways to generate rewards that it's actually
annoying. For example:
[?] win / lose
[?] capturing/losing material
[?] pawn promotion
[?] game progression (preventing 50-move draw)
[?] network efficiency
[?] game duration (move count)
[?] invalid moves
[450_chess_] And what are the actual values for this reward?
Ultimately the goal is to reward the network for things you want to
see more of. You want the game to progress then you reward pawn moves
(they can't be reversed). However, certain rewards can introduce a
massive bias. For example, a pawn move is only likely at the start
and then less likely as the game progresses. Rewarding pawn moves
means both players start with a pawn storm like there's no tomorrow.
Is that going to teach a network how to play chess?
In the end I wasn't able to reliably come up with a reward heuristic
that adequately taught the network to start with a pawn or knight,
ignoring other pieces. And this kind of makes sense; how would a
network ever learn a good move when it only flukes one every once in
a while, making a majority of bad moves otherwise. The good learning
step is a drop in the bad learning step ocean. So either your bad
move reward has to drop significantly to make the good moves more
effective, or you have to train the good moves more frequent to
"engrain" this good move reward more thoroughly. Neither approach
seemed to work for me.
Of course rewards aren't just a matter of coming up with the right
value. It's also about the correct loss function, the correct amount
of training on the result, applying a fixed/linear/gradual reward
(later moves are more likely to contribute to a game outcome than
opening moves), learning speed, neural network composition, etc etc.
So many different parameters to twist and turn. And of course, for a
pattern as complex as chess, it may take hours or even days before
the desired patterns emerge. How do you know you're on the right path
when your feedback loop is that long?
Well, it's frustrating. And I don't have the time for it,
unfortunately.
This is roughly what my heuristic code looks like;
Code:
const rewards = history.map((step, i) => {
if (i % 2 === 1) return; // Train only on white moves?
let r = 0;
//if (mated) r += 1; // Prefer a proper ending to a 50-move /
threefold draw
//if (piece === 'P') r += 1; // Promote pawn moves as they prevent
certain draws
if (capture) r += {Q:8, R:5, N:3, B:3, P:1}[capture] || 0; // Promote
captures
//if (promo) r += 1; // Promote pawn promotions
if (winner === 'white') r = 100; // Reward wins
else (winner !== 'black' && pointsWhite > pointsBlack) r = 100; //
Consider a material advantage a win too, in case of a draw
// For each prediction of this particular move, map the square index
to its reward
const moveReward = predictions.map((p, i) => {
if (isFromOrTo(i, fromi, toi)) return r; // Apply reward to picked
square pair
if (i <= 63 && ownCells[ i ]) return 1; // Prevent own pieces from
going to zero (even if they have no moves)
return 0; // This could also return p, leaving the prediction alone
});
// Generate one set of input/output to use with training. The input
and output sets must be of equal size.
inputBoards.push(boardStateForMove);
outputSets.push(moveReward);
});
So we magically come up with some kind of reward. I tried a few
things here but ultimately I'm not convinced I did this right. I
tried giving a range of values like above, but I also tried to give
normalzied values (between zero and one). But I didn't see a lot of
difference between either approach.
Here's an example with what a reward could look like for one opening
knight move:
Code:
-- new from: N b1 ---------------------------------- -- new to: N c3
------------------------------------
a b c d e f g h a b c d e f g h
8 0 0 0 0 0 0 0 0 8 8 0 0 0 0 0 0 0 0 8
7 0 0 0 0 0 0 0 0 7 7 0 0 0 0 0 0 0 0 7
6 0 0 0 0 0 0 0 0 6 6 0 0 0 0 0 0 0 0 6
5 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 0 0 0 5
4 0 0 0 0 0 0 0 0 4 4 0 0 0 0 0 0 0 0 4
3 0 0 0 0 0 0 0 0 3 3 0 0 1 0 0 0 0 0 3
2 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 2 2 0 0 0 0 0 0 0 0 2
1 0.01 1 0.01 0.01 0.01 0.01 0.01 0.01 1 1 0 0 0 0 0 0 0 0 1
a b c d e f g h a b c d e f g h
----------------------------------------------------
----------------------------------------------------
Where own "from" squares move closer to 0.01, the valid "from" square
goes to 1, the valid "to" square goes to 1, and the other squares go
to 0.
Applying the reward
The last step of the episode is to apply the reward to the network.
This is the teaching moment, yay.
Code:
// Convert data to tensor objects
const inputBoardTensor = tf.tensor2d(inputBoards);
const expectedOutput = tf.tensor2d(outputSets);
// Shuffle should improve training quality.
// Batch size doesn't matter for our problem.
// Epochs is arbitrary number of times the samples are trained.
await modelmodel.model.fit(inputBoardTensor, expectedOutput, {epochs:
20, verbose: 0, batchSize: 20000, shuffle: true});
You can also teach it one-by-one but it's just so much slower.
There's probably a lot of overhead to be paid for preparing the
training data. So batching the moves together and running them
multiple times is just cheaper than doing it move-by-move.
CLI output
Here's what the CLI shows after each episode. It'll print the game
state, prediction, and rewards of the Nth move, in this case the
opening move.
It also shows which cells have been asserted to be invalid, like when
there's no piece or when the engine said so. The final line shows a
summary that includes how the game ended, the material points for
each side, the time taken to run the engine and time taken to train,
and how often the prediction was wrong before getting a valid move.
This was a random sample while training it with a pawn storm reward.
Code:
------- move 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--game w u-- --player w-- --opponent-- --kings----- --queens----
--rooks----- --bishops--- --knights--- --pawns-----
abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh
abcdefgh abcdefgh
8 8 8 00000000 8 8 11111111 8 8 00001000 8 8 00010000 8 8
10000001 8 8 00100100 8 8 01000010 8 8 00000000 8
7 7 7 00000000 7 7 11111111 7 7 00000000 7 7 00000000 7 7
00000000 7 7 00000000 7 7 00000000 7 7 11111111 7
6 6 6 00000000 6 6 00000000 6 6 00000000 6 6 00000000 6 6 00000000 6
6 00000000 6 6 00000000 6 6 00000000 6
5 5 5 00000000 5 5 00000000 5 5 00000000 5 5 00000000 5 5 00000000 5
5 00000000 5 5 00000000 5 5 00000000 5
4 4 4 00000000 4 4 00000000 4 4 00000000 4 4 00000000 4 4 00000000 4
4 00000000 4 4 00000000 4 4 00000000 4
3 3 3 00000000 3 3 00000000 3 3 00000000 3 3 00000000 3 3 00000000 3
3 00000000 3 3 00000000 3 3 00000000 3
2 2 2 11111111 2 2 00000000 2 2 00000000 2 2 00000000 2 2
00000000 2 2 00000000 2 2 00000000 2 2 11111111 2
1 1 1 11111111 1 1 00000000 1 1 00001000 1 1 00010000 1 1
10000001 1 1 00100100 1 1 01000010 1 1 00000000 1
abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh
abcdefgh abcdefgh
------------ ------------ ------------ ------------ ------------
------------ ------------ ------------ ------------
--p from, best =
a2,d2,b2,h3,c2,e2,d3,f1,b3,g2--------------------------------------
a b c d e f g h
8 _.__2195 _.__1443 _.__1997 _.__1731 _.__1862 _.__1614 _.__1113
_.__1429 8
7 _.__1381 _.__1736 _.__2202 _.__1505 _.__1523 _.__2002 _.__1815
_.__1794 7
6 _.__1631 _.__2232 _.__1279 _.__1614 _.__1673 _.__1571 _.__1827
_.__1952 6
5 _.__1369 _.__1592 _.__1795 _.__1014 _.__1303 _.__1655 _.__2127
_.__1591 5
4 _._11357 _.__6727 _.__1848 _._14977 _.__2664 _.__2028 _.__1814
_.__8190 4
3 _._20234 _._23869 _.__8546 _._29754 _._13037 _.__8765 _.__6992
_._37838 3
2 _.161353 _._59909 _._35909 _._76755 _._32026 _._14384 _._23037
_._21860 2
1 _._10155 _._16474 _._16182 _._18081 _._22670 _._24611 _._12363
_._22117 1
a b c d e f g h
------------------------------------------------------------------------------------
--p to, best =
b3,c3,e3,a2,d3,f3,b4,h3,e4,b2----------------------------------------
a b c d e f g h
8 _.__2027 _.__1786 _.__1499 _.__2023 _.__1303 _.___998 _.__1396
_.__2029 8
7 _.__1200 _.__2022 _.__1542 _.__1290 _.__1431 _.__1646 _.__1465
_.__1346 7
6 _.__1351 _.__1454 _.__1818 _.__1176 _.__1810 _.__1528 _.__2420
_.__1155 6
5 _.__1970 _.__1765 _.__2400 _.__1571 _.__3485 _.__1760 _.__1707
_.__1645 5
4 _.__1411 _._19643 _._11618 _.__2896 _._18135 _.__5058 _.__2284
_.__2911 4
3 _._16914 _._80085 _._56358 _._30357 _._51646 _._30339 _._12861
_._19281 3
2 _._33478 _._17652 _.__8320 _.__8741 _.__8156 _.__9200 _.__1591
_.__1473 2
1 _.__3255 _.__2344 _.__1724 _.__2002 _.__1199 _.__1449 _.__1345
_.__1991 1
a b c d e f g h
------------------------------------------------------------------------------------
--game w u-- --player w-- --opponent-- --kings----- --queens----
--rooks----- --bishops--- --knights--- --pawns----- --action-f-- --
action-t--
abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh
abcdefgh abcdefgh abcdefgh abcdefgh
8 8 8 00000000 8 8 11111111 8 8 00001000 8 8 00010000 8 8
10000001 8 8 00100100 8 8 01000010 8 8 00000000 8 8 DDDDDDDD 8 8 8
7 7 7 00000000 7 7 11111111 7 7 00000000 7 7 00000000 7 7
00000000 7 7 00000000 7 7 00000000 7 7 11111111 7 7 DDDDDDDD 7 7 7
6 6 6 00000000 6 6 00000000 6 6 00000000 6 6 00000000 6 6 00000000 6
6 00000000 6 6 00000000 6 6 00000000 6 6 DDDDDDDD 6 6 6
5 5 5 00000000 5 5 00000000 5 5 00000000 5 5 00000000 5 5 00000000 5
5 00000000 5 5 00000000 5 5 00000000 5 5 DDDDDDDD 5 5 5
4 4 4 00000000 4 4 00000000 4 4 00000000 4 4 00000000 4 4 00000000 4
4 00000000 4 4 00000000 4 4 00000000 4 4 DDDDDDDD 4 4 c c 4
3 3 3 10000000 3 3 00000000 3 3 00000000 3 3 00000000 3 3 00000000
3 3 00000000 3 3 00000000 3 3 10000000 3 3 DDDDDDDD 3 3 !ccccc c 3
2 2 2 01111111 2 2 00000000 2 2 00000000 2 2 00000000 2 2
00000000 2 2 00000000 2 2 00000000 2 2 01111111 2 2 T 2 2 ssSSSSSS 2
1 1 1 11111111 1 1 00000000 1 1 00001000 1 1 00010000 1 1
10000001 1 1 00100100 1 1 01000010 1 1 00000000 1 1 T 1 1 SSSSSSSS 1
abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh abcdefgh
abcdefgh abcdefgh abcdefgh abcdefgh
------------ ------------ ------------ ------------ ------------
------------ ------------ ------------ ------------ ------------
------------
--m1, new from, actual: P a2, trained: a2, better: neither
-------------------------
a b c d e f g h
8 0 0 0 0 0 0 0 0 8
7 0 0 0 0 0 0 0 0 7
6 0 0 0 0 0 0 0 0 6
5 0 0 0 0 0 0 0 0 5
4 0 0 0 0 0 0 0 0 4
3 0 0 0 0 0 0 0 0 3
2 _.1 _._1 _._1 _._1 _._1 _._1 _._1 _._1 2
1 _._1 _._1 _._1 _._1 _._1 _._1 _._1 _._1 1
a b c d e f g h
------------------------------------------------------------------------------------
--m1, new to, actual: P a3, trained: P a3,
-----------------------------------------
a b c d e f g h
8 0 0 0 0 0 0 0 0 8
7 0 0 0 0 0 0 0 0 7
6 0 0 0 0 0 0 0 0 6
5 0 0 0 0 0 0 0 0 5
4 0 0 0 0 0 0 0 0 4
3 0 _.1 0 0 0 0 0 0 3
2 0 0 0 0 0 0 0 0 2
1 0 0 0 0 0 0 0 0 1
a b c d e f g h
------------------------------------------------------------------------------------
Games: 5430 Moves: 50 Reason: 100x 0 0 . 59ms 77ms hundred, 546
canMove checks (10.92/move): 9/(1/11) 8/(1/9) 6/(2/6) 6/(2/6) 13/(2/
19) 1/(3/1) 3/(4/3) 2/(3/3) 3/(6/4) 13/(4/19) ...
It looks more fluid when the numbers update after each episode but
that's gonna be a bit hard to do here :) And expensive as a gif/
movie, so this'll have to do.
Gifs
The gifs in this blog were simply created through gif.js (years old,
still works fine in the browser out of the box) and optimized through
ezgif.com which cut them in half or more.
Here is the code, given an array of games represented as string
snapshots of the board. Just drop into the html:
Code:
Conclusion
I think I've rambled on long enough here.
I got inspired by somebody teaching a neural network to play Pokemon,
as an emergent behavior. I wanted to do the same to chess.
Ultimately, I didn't have enough time to bring that over a meaningful
finishline. I had my fun and now I need to move on from it since I
never intended for anything serious to manifest here.
Kudos if you got this far :)
* Weblog
* Notes
* Projects
* About
* Sprite animations
* Factini
* Neural Chess
* Road to 1500 elo
* Elimination of continue
* Normy JS IR AST
* Normy JS p3
* Normy JS p2
* Normy JS p1
* Simpler JS
* Normalizing while loops
* for scoping
* Normalizing patterns
* Preval prologue
* Mope experiment
* Hardware choices
* Tenko Retro
* Meniere
* Pack token types in 32bit
* Xcode my js1k demo
* import/export in browser AND node
* Hearthstone draw simulator
* Parsing `async` is hard
* Sunsetting qFox
* Feels like 7 months
* JS10k!
* Perfect linux featherlight
* Overzealous regexes
* Secret Parent Club 2.0
* Finite class distribution
* Facebook
* The hardware talk
* Regex Parsing
* Hearthstone suggestions
* Hearthstone Legend
* Pandemic Quest Log
* Parsing ES6: Arrows
* Simple Mastodon Oauth
* Fuzzing UglifyJS
* Meaningless vars
* The Pirate Warrior
* Lovely stones
* New Ajax Fetching
* Dull Fires
* Kaladesh Moon booster box stats
* Eldritch Moon booster box stats
* Origins booster box stats
* Innistrad booster box stats
* The Magic of precons
* Finite Domain Constraint Solving
* Secret Parent Club
* 16.04 on a zenbook
* Brain puzzle solver
* ReturnTrue spoilers
* The Language Agnostic Web
* The Gasher
* Autopush to ghpages
* Diff tooling
* asm.js primer
* JS1k demo post-mortem
* A bit less complexity
* Introducing Drew
* Modules with ES7 and Babel
* Writing a JS1k demo
* Isaac's Salt
* Yo Ho Ho A Streamers Life
* How to get smaller
* Keyboard: Roccat vs Logitech
* JS1k 2016: eleMental
* CoffeeScript trap
* JavaScript for all the things?
* Played.Today
* Declarative html templates
* A new GSSParser
* Convert double quoted strings
* ffmpeg for streamers
* Logitech MX Master rocks
* Input polling
* Achievement unlocked
* Matrix learning toy
* JS1k community award
* options = options || false
* VR in 2d canvas
* Raycasting 2.5d on canvas
* Casting rays on 2d canvas
* My JS1k 2015 demo
* JS1k 2015, what changed
* Sata cables from hell
* First Oculus experience
* Barrel distortion in WebGL
* WebVR wishlist
* GearVR as CardBoard
* GearVR first impression
* VR is the future
* How strcut saved me
* JS1k demo page v4.1
* JS1k demo page v4
* Disk recovery in linux p1
* Linux woes
* Isaac 1.5
* Touching keyboards
* Trust ErgoTrack
* The Streamer
* Defying some logic
* Building ze parser
* Priorities
* Dell UltraSharp U2414H
* Switched gnome3 for xfce
* Parser hinting
* Determinisitc benchmarking
* The battle of JIT
* Generators redux: UnYield
* Deferred document.write
* Translating yield to ES5
* JS1k 2014 behind the scenes
* Vertical alignment in css
* External style tag
* Fuzzing a parser
* Heatfiler v2
* Finding memory leaks with JS
* The problems of JIT
* Token start stats for JS
* JS Punctuator stats
* New heatfiler in the making
* Dynamic style setting and vendor prefixes
* The weakest link
* The impossible pretty solution
* Overzealous parsing
* Rewriting oddities
* Set using Bonsai
* Just a list of games
* ES6 is JS2
* Generating source maps
* Dynamic source maps
* css all:clear
* type=code for textareas
* CFS JS1K'13
* Tag literal example
* JS1K 2013: Spring
* Tight testing
* JSDocs primer
* Case for dynamic pseudos
* Rewriting (js) code
* Heatmap profiler slash code coverage tool
* Flash in my life
* New parser tactics
* The Error type
* Going unicode without hurting perf
* JSpath
* JS gps ability for tracking memleaks
* Unplugging
* Between regex and if-else
* Parser perf research
* Regex unicode escaped flag
* OS UI basics for power users
* Windows online drivers
* html7
* Do whatever
* Native compiler api
* On __proto__
* The tipping point
* Repeat flow
* Tag literal syntax for js
* Significant Whitespace
* Graceful syntax errors in js
* Prefixed to death
* Wishfull thinking: classes
* Linux three months in
* Dash/Dart
* Book: HTML5 guidelines for web developers
* Book reports
* for-in grammar
* Applying regex on substring
* Zeon talk at jsconf.us
* Switching to linux
* My JS Quiz: the answers
* My JS Quiz: the questions
* Creating a SMB game
* Tokenizing JavaScript
* Transpilers and the future of js
* Past month
* Fuuuut
* Simple minify rules
* Carets and tabs
* No webgl 1k
* Prime js golf
* Where does the regex go
* WhiteSpace
* jQuery context parameter
* Pre-compiler directives
* AGI Scripting
* Optimizing the js1k site
* gmaps v3 Cluster Manager
* Proposal for safe jsonp v2
* Proposal for safe jsonp
* javascript coercion tool
* html5 game jam
* ES5 parser in JS
* CFG parser in JS
* Creating a 1k brainfuck demo
* One weeks worth of tweets
* #js1k
* JS Coding style
* World cup particles
* Asciideo
* On semi-colons in js
* HTC Desire Review
* Video support
* Web 3.0
* JS Syntactic Parser
* Pragmatic HTML
* Pre ballot browser stats
* ES5 spec choices
* Webdev circles
* As the world turns
* Re: Simplest jQuery slideshow
* Bespin
* #Fronteers09
* Won ticket
* Fronteers
* The future of the web
* The Firefox Problem
* Console Bot Contest
* Proper PHP IDE?
* PathSharing opened
* 6 maanden blog
* GPS fun: the Q1200
* Future of FPS
* Neural information model
* Botnet at work
* UO
* Geleerd dit jaar
* Twee weken...
* IP Battle
* Pinging
* Golden opportunity in fonts
* RSS beschikbaar
* MegaMan 3,4,5,6 quadrun
* BVH examen
* BHV training
* Checkers opgelost!
* Nieuwe weblog!