[HN Gopher] Coding Self-Attention, Multi-Head Attention, Cross-A...
___________________________________________________________________
Coding Self-Attention, Multi-Head Attention, Cross-Attention,
Causal-Attention
Author : rasbt
Score : 110 points
Date : 2024-01-14 14:29 UTC (8 hours ago)
(HTM) web link (magazine.sebastianraschka.com)
(TXT) w3m dump (magazine.sebastianraschka.com)
| atticora wrote:
| conscious, kon'sh@s, adjective -- Characterized by or having an
| awareness of one's environment and one's own existence,
| sensations, and thoughts. synonym: aware.
|
| Self-attention seems to be at least a proxy for "awareness of ...
| one's own existence." If that closed loop is the thing that
| converts sensibility into sentience, then maybe it's the source
| of LLM's leverage too. Is this language comprehension algorithm a
| sort of consciousness algorithm?
| dlkf wrote:
| It's debatable to what degree "attention" in LLMs relates to
| "attention" in psychology. See Cosma Shalizi's note on this
| http://bactra.org/notebooks/nn-attention-and-transformers.ht...
| sk11001 wrote:
| ML attention is nothing like human attention. I think it's
| madness to attempt to map concepts from one field we barely
| understand to another field we also barely understand just
| because they use overlapping language.
| jampekka wrote:
| Having done some research into human attention, I have to
| agree with Hommel et al: No one knows what attention is [1].
|
| In current ANNs "attention" is quite well defined: how to
| weigh some variables based on other variables. But
| anthropomorphizing such concepts indeed muddies things more
| than it clarifies. Including calling interconnected summation
| units with non-linear transformations "neural networks".
|
| But such (wrong) intuition pumping terminology does attract,
| well, attention, so they get adopted.
|
| [1]
| https://link.springer.com/article/10.3758/s13414-019-01846-w
| ben_w wrote:
| Careful, depending on who you ask there's 40 different
| definitions of the term. Any given mind, natural or artificial,
| may well pass some of these without passing all of them.
| kmeisthax wrote:
| No. Self-attention is more akin to kernel smoothing[0] on
| memorized training data that spits out a weighted probability
| graph. As for consciousness, LLMs are not particularly well
| aware of their own strengths and limitations, at least not
| unless you finetune them to know what they are and aren't good
| at. They also don't have sensors, so awareness of any
| environment is not possible.
|
| If you trained a neural network with an attention mechanism
| using data obtained from, say, robotics sensors; then it
| _might_ be able to at least have environmental awareness. The
| problem is that current LLM training approaches rely on large
| amounts of training data - easy to obtain for text, nonexistent
| for sensor input. I _suspect_ awareness of one 's own
| existence, sensations, and thoughts would additionally require
| some kind of continuous weight update[1], but I have no proof
| for that yet.
|
| [0] https://en.wikipedia.org/wiki/Kernel_smoother
|
| [1] Neural network weights are almost always trained in one big
| run, occasionally updated with fine-tuning, and almost never
| modified during usage of the model. All of ChatGPT's ability to
| learn from prior input comes from in-context learning which
| does not modify weights. This is also why it tends to forget
| during long conversations.
| f38zf5vdt wrote:
| As mentioned, these are all toy implementations and you should
| not use them in production. If you want to the fast, easy, and
| extremely optimized way of doing things, use
| torch.nn.MultiheadAttention or
| torch.nn.functional.scaled_dot_product_attention so that you get
| the optimal implementations. You can use xformers scaled dot
| product attention if you want the bleeding edge of performance.
|
| > (Note that the code presented in this article is intended for
| illustrative purposes. If you plan to implement self-attention
| for training LLMs, I recommend considering optimized
| implementations like Flash Attention, which reduce memory
| footprint and computational load.)
|
| Flash attention is already part of torch's kernels as of torch 2,
| but the latest versions and optimizations land in xformers first.
| radarsat1 wrote:
| It seems that there are some popular attention methods such as
| relative embeddings and rotary embeddings (rope embeddings?)
| that are not possible to implement using pytorch's
| implementation, if I understand correctly. Do these then
| require the "slow path" versions that can be more easily
| modified?
| rasbt wrote:
| Yes, totally agree. These implementations are meant for
| educational purposes. You could in theory use them to train a
| model though (GPT-2 also had a from-scratch implementation if I
| recall correctly). In practice, you probably want to use
| FlashAttention though. You use it through
| `torch.nn.functional.scaled_dot_product_attention` etc.
___________________________________________________________________
(page generated 2024-01-14 23:01 UTC)