https://karpathy.ai/zero-to-hero.html

Neural Networks: Zero to Hero

A course by Andrej Karpathy on building neural networks, from
scratch, in code.
We start with the basics of backpropagation and build up to modern
deep neural networks, like GPT. In my opinion language models are an
excellent place to learn deep learning, even if your intention is to
eventually go to other areas like computer vision because most of
what you learn will be immediately transferable. This is why we dive
into and focus on languade models.
Prerequisites: solid programming (Python), intro-level math (e.g.
derivative, gaussian).

Learning is easier with others, come say hi in our Discord channel:
[3zy8kqD9Cp]

Syllabus

2h25m
The spelled-out intro to neural networks and backpropagation:
building micrograd
This is the most step-by-step spelled-out explanation of
backpropagation and training of neural networks. It only assumes
basic knowledge of Python and a vague recollection of calculus from
high school.
1h57m
The spelled-out intro to language modeling: building makemore
We implement a bigram character-level language model, which we will
further complexify in followup videos into a modern Transformer
language model, like GPT. In this video, the focus is on (1)
introducing torch.Tensor and its subtleties and use in efficiently
evaluating neural networks and (2) the overall framework of language
modeling that includes model training, sampling, and the evaluation
of a loss (e.g. the negative log likelihood for classification).
1h15m
Building makemore Part 2: MLP
We implement a multilayer perceptron (MLP) character-level language
model. In this video we also introduce many basics of machine
learning (e.g. model training, learning rate tuning, hyperparameters,
evaluation, train/dev/test splits, under/overfitting, etc.).
1h55m
Building makemore Part 3: Activations & Gradients, BatchNorm
We dive into some of the internals of MLPs with multiple layers and
scrutinize the statistics of the forward pass activations, backward
pass gradients, and some of the pitfalls when they are improperly
scaled. We also look at the typical diagnostic tools and
visualizations you'd want to use to understand the health of your
deep network. We learn why training deep neural nets can be fragile
and introduce the first modern innovation that made doing so much
easier: Batch Normalization. Residual connections and the Adam
optimizer remain notable todos for later video.
1h55m
Building makemore Part 4: Becoming a Backprop Ninja
We take the 2-layer MLP (with BatchNorm) from the previous video and
backpropagate through it manually without using PyTorch autograd's
loss.backward(): through the cross entropy loss, 2nd linear layer,
tanh, batchnorm, 1st linear layer, and the embedding table. Along the
way, we get a strong intuitive understanding about how gradients flow
backwards through the compute graph and on the level of efficient
Tensors, not just individual scalars like in micrograd. This helps
build competence and intuition around how neural nets are optimized
and sets you up to more confidently innovate on and debug modern
neural networks.
56m
Building makemore Part 5: Building a WaveNet
We take the 2-layer MLP from previous video and make it deeper with a
tree-like structure, arriving at a convolutional neural network
architecture similar to the WaveNet (2016) from DeepMind. In the
WaveNet paper, the same hierarchical architecture is implemented more
efficiently using causal dilated convolutions (not yet covered).
Along the way we get a better sense of torch.nn and what it is and
how it works under the hood, and what a typical deep learning
development process looks like (a lot of reading of documentation,
keeping track of multidimensional tensor shapes, moving between
jupyter notebooks and repository code, ...).
1h56m
Let's build GPT: from scratch, in code, spelled out.
We build a Generatively Pretrained Transformer (GPT), following the
paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk
about connections to ChatGPT, which has taken the world by storm. We
watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) .
I recommend people watch the earlier makemore videos to get
comfortable with the autoregressive language modeling framework and
basics of tensors and PyTorch nn, which we take for granted in this
video.
ongoing...