[HN Gopher] Coding Self-Attention, Multi-Head Attention, Cross-A...
       ___________________________________________________________________
        
       Coding Self-Attention, Multi-Head Attention, Cross-Attention,
       Causal-Attention
        
       Author : rasbt
       Score  : 110 points
       Date   : 2024-01-14 14:29 UTC (8 hours ago)
        
 (HTM) web link (magazine.sebastianraschka.com)
 (TXT) w3m dump (magazine.sebastianraschka.com)
        
       | atticora wrote:
       | conscious, kon'sh@s, adjective -- Characterized by or having an
       | awareness of one's environment and one's own existence,
       | sensations, and thoughts. synonym: aware.
       | 
       | Self-attention seems to be at least a proxy for "awareness of ...
       | one's own existence." If that closed loop is the thing that
       | converts sensibility into sentience, then maybe it's the source
       | of LLM's leverage too. Is this language comprehension algorithm a
       | sort of consciousness algorithm?
        
         | dlkf wrote:
         | It's debatable to what degree "attention" in LLMs relates to
         | "attention" in psychology. See Cosma Shalizi's note on this
         | http://bactra.org/notebooks/nn-attention-and-transformers.ht...
        
         | sk11001 wrote:
         | ML attention is nothing like human attention. I think it's
         | madness to attempt to map concepts from one field we barely
         | understand to another field we also barely understand just
         | because they use overlapping language.
        
           | jampekka wrote:
           | Having done some research into human attention, I have to
           | agree with Hommel et al: No one knows what attention is [1].
           | 
           | In current ANNs "attention" is quite well defined: how to
           | weigh some variables based on other variables. But
           | anthropomorphizing such concepts indeed muddies things more
           | than it clarifies. Including calling interconnected summation
           | units with non-linear transformations "neural networks".
           | 
           | But such (wrong) intuition pumping terminology does attract,
           | well, attention, so they get adopted.
           | 
           | [1]
           | https://link.springer.com/article/10.3758/s13414-019-01846-w
        
         | ben_w wrote:
         | Careful, depending on who you ask there's 40 different
         | definitions of the term. Any given mind, natural or artificial,
         | may well pass some of these without passing all of them.
        
         | kmeisthax wrote:
         | No. Self-attention is more akin to kernel smoothing[0] on
         | memorized training data that spits out a weighted probability
         | graph. As for consciousness, LLMs are not particularly well
         | aware of their own strengths and limitations, at least not
         | unless you finetune them to know what they are and aren't good
         | at. They also don't have sensors, so awareness of any
         | environment is not possible.
         | 
         | If you trained a neural network with an attention mechanism
         | using data obtained from, say, robotics sensors; then it
         | _might_ be able to at least have environmental awareness. The
         | problem is that current LLM training approaches rely on large
         | amounts of training data - easy to obtain for text, nonexistent
         | for sensor input. I _suspect_ awareness of one 's own
         | existence, sensations, and thoughts would additionally require
         | some kind of continuous weight update[1], but I have no proof
         | for that yet.
         | 
         | [0] https://en.wikipedia.org/wiki/Kernel_smoother
         | 
         | [1] Neural network weights are almost always trained in one big
         | run, occasionally updated with fine-tuning, and almost never
         | modified during usage of the model. All of ChatGPT's ability to
         | learn from prior input comes from in-context learning which
         | does not modify weights. This is also why it tends to forget
         | during long conversations.
        
       | f38zf5vdt wrote:
       | As mentioned, these are all toy implementations and you should
       | not use them in production. If you want to the fast, easy, and
       | extremely optimized way of doing things, use
       | torch.nn.MultiheadAttention or
       | torch.nn.functional.scaled_dot_product_attention so that you get
       | the optimal implementations. You can use xformers scaled dot
       | product attention if you want the bleeding edge of performance.
       | 
       | > (Note that the code presented in this article is intended for
       | illustrative purposes. If you plan to implement self-attention
       | for training LLMs, I recommend considering optimized
       | implementations like Flash Attention, which reduce memory
       | footprint and computational load.)
       | 
       | Flash attention is already part of torch's kernels as of torch 2,
       | but the latest versions and optimizations land in xformers first.
        
         | radarsat1 wrote:
         | It seems that there are some popular attention methods such as
         | relative embeddings and rotary embeddings (rope embeddings?)
         | that are not possible to implement using pytorch's
         | implementation, if I understand correctly. Do these then
         | require the "slow path" versions that can be more easily
         | modified?
        
         | rasbt wrote:
         | Yes, totally agree. These implementations are meant for
         | educational purposes. You could in theory use them to train a
         | model though (GPT-2 also had a from-scratch implementation if I
         | recall correctly). In practice, you probably want to use
         | FlashAttention though. You use it through
         | `torch.nn.functional.scaled_dot_product_attention` etc.
        
       ___________________________________________________________________
       (page generated 2024-01-14 23:01 UTC)