[HN Gopher] Visualizing Attention, a Transformer's Heart [video]
       ___________________________________________________________________
        
       Visualizing Attention, a Transformer's Heart [video]
        
       Author : rohitpaulk
       Score  : 824 points
       Date   : 2024-04-14 23:38 UTC (23 hours ago)
        
 (HTM) web link (www.3blue1brown.com)
 (TXT) w3m dump (www.3blue1brown.com)
        
       | promiseofbeans wrote:
       | His previous post 'But what is a GPT?' is also really good:
       | https://www.3blue1brown.com/lessons/gpt
        
       | nostrebored wrote:
       | Working in a closely related space and this instantly became part
       | of my team's onboarding docs.
       | 
       | Worth noting that a lot of the visualization code is available in
       | Github.
       | 
       | https://github.com/3b1b/videos/tree/master/_2024/transformer...
        
         | sthatipamala wrote:
         | Sounds interesting; what else is part of those onboarding docs?
        
       | jiggawatts wrote:
       | It always blows my mind that Grant Sanderson can explain complex
       | topics in such a clear, understandable way.
       | 
       | I've seen several tutorials, visualisations, and blogs explaining
       | Transformers, but I didn't fully understand them until this
       | video.
        
         | chrishare wrote:
         | His content and impact is phenomenal
        
       | bilsbie wrote:
       | I finally understand this! Why did every other video make it so
       | confusing!
        
         | chrishare wrote:
         | It is confusing, 3b1b is just that good.
        
           | visarga wrote:
           | At the same time it feels extremely simple
           | 
           | attention(Q,K,V) = softmax (Q K^T [?] dK ) @ V
           | 
           | is just half a row; the multi-head, masking and positional
           | stuff just toppings
           | 
           | we have many basic algorithms in CS that are more involved,
           | it's amazing we get language understanding from such simple
           | math
        
             | bilsbie wrote:
             | For me I never had too much trouble understanding the
             | algorithm. But this is the first time I can see why it
             | works.
        
         | Solvency wrote:
         | Because:
         | 
         | 1. good communication requires an intelligence that most people
         | sadly lack
         | 
         | 2. because the type of people who are smart enough to invent
         | transformers have zero incentive to make them easily
         | understandable.
         | 
         | most documents are written by authors subconsciously desperate
         | to mentally flex on their peers.
        
           | penguin_booze wrote:
           | Pedagogy requires empathy, to know what it's like to not know
           | something. They'll often draw on experiences the listener is
           | already familiar with, and then bridge the gap. This skill is
           | orthogonal to the mastery of the subject itself, which I
           | think is the reason most descriptions sound confusing,
           | inadequate, and/or incomprehensible.
           | 
           | Often, the disseminating medium is a one-sided, like a video
           | or a blog post, which doesn't help, either. A conversational
           | interaction would help the expert sense why someone outside
           | the domain find the subject confusing ("ah, I see what you
           | mean"...), discuss common pitfalls ("you might think it's
           | like this... but no, it's more like this...") etc.
        
           | WithinReason wrote:
           | 2. It's not malice. The longer you have understood something
           | the harder it is to explain it, since you already forgot what
           | it was like to not understand it.
        
         | thomasahle wrote:
         | I'm someone who would love to get better at making educational
         | videos/content. 3b1b is obviously the gold standard here.
         | 
         | I'm curious what things other videos did worse compared to
         | 3b1b?
        
           | bilsbie wrote:
           | I think he had a good, intuitive understanding that he wanted
           | to communicate and he made it come through.
           | 
           | I like how he was able to avoid going into the weeds and stay
           | focused on leading you to understanding. I remember another
           | video where I got really hung up on positional encoding and I
           | felt like I could t continue until I understood that. Or
           | other videos that overfocus on matrix operations or softmax,
           | etc.
        
         | ur-whale wrote:
         | > Why did every other video make it so confusing!
         | 
         | In my experience, with very few notable exceptions (e.g.
         | Feynmann), researchers are the worst when it comes to clearly
         | explaining to others what they're doing.
         | 
         | I'm at the point where I'm starting believe that pedagogy and
         | research generally are mutually exclusive skills.
        
           | namaria wrote:
           | It's extraordinarily difficult to imagine how it feels not to
           | understand something. Great educators can bridge that gap. I
           | don't think it's correlated with research ability in any way.
           | It's just a very rare skill set, to be able to empathize with
           | people who don't understand what you do.
        
         | Al-Khwarizmi wrote:
         | Not sure if you mean it as rhetorical question but I think it's
         | an interesting question. I think there are at least three
         | factors why most people are confused about Transformers:
         | 
         | 1. The standard terminology is "meh" at most. The word
         | "attention" itself is just barely intuitive, "self-attention"
         | is worse, and don't get me started about "key" and "value".
         | 
         | 2. The key papers (Attention is All You Need, the BERT paper,
         | etc.) are badly written. This is probably an unpopular opinion.
         | But note that I'm not diminishing their merits. It's perfectly
         | compatible to write a hugely impactful, transformative paper
         | describing an amazing breakthrough, but just don't explain it
         | very well. And that's exactly what happened, IMO.
         | 
         | 3. The way in which these architectures were discovered was
         | largely by throwing things at the wall and seeing what sticked.
         | There is no reflection process that ended on a prediction that
         | such an architecture would work well, which was then
         | empirically verified. It's empirical all the way through. This
         | means that we don't have a full understanding of why it works
         | so well, all explanations are post hoc rationalizations (in
         | fact, lately there is some work implying that other
         | architectures may work equally well if tweaked enough). It's
         | hard to explain something that you don't even fully understand.
         | 
         | Everyone who is trying to explain transformers has to overcome
         | these three disadvantages... so most explanations are
         | confusing.
        
           | cmplxconjugate wrote:
           | >This is probably an unpopular opinion.
           | 
           | I wouldn't say so. Historically it's quite common. Maxwell's
           | EM papers used such convoluted notation it it quite difficult
           | to read. It wasn't until they were reformulated in vector
           | calculus that they became infinitely more digestible.
           | 
           | I think though your third point is the most important; right
           | now people are focused on results.
        
           | maleldil wrote:
           | > This is probably an unpopular opinion
           | 
           | There's a reason The Illustrated Transformer[1] was/is so
           | popular: it made the original paper much more digestible.
           | 
           | [1] https://jalammar.github.io/illustrated-transformer/
        
         | thinkingtoilet wrote:
         | Grant has a gift of explaining complicated things very clearly.
         | There's a good reason his channel is so popular.
        
       | YossarianFrPrez wrote:
       | This video (with a slightly different title on YouTube) helped me
       | realize that the attention mechanism isn't exactly a specific
       | function so much as it is a meta-function. If I understand it
       | correctly, Attention + learned weights effectively enables a
       | Transformer to learn a semi-arbitrary function, one which
       | involves a matching mechanism (i.e., the scaled dot-product.)
        
         | hackinthebochs wrote:
         | Indeed. The power of attention is that it searches the space of
         | functions and surfaces the best function given the constraints.
         | This is why I think linear attention will never come close to
         | the ability of standard attention, the quadratic term is a
         | necessary feature of searching over all pairs of inputs and
         | outputs.
        
       | mastazi wrote:
       | That example with the "was" token at the end of a murder novel is
       | genius (at 3:58 - 4:28 in the video) really easy for a non
       | technical person to understand.
        
         | hamburga wrote:
         | I think Ilya gets credit for that example -- I've heard him use
         | it in his interview with Jensen Huang.
        
       | spacecadet wrote:
       | Fun video. Much of my "art" lately has been dissecting models,
       | injecting or altering attention, and creating animated
       | visualizations of their inner workings. Some really fun shit.
        
         | j_bum wrote:
         | Link? Sounds fun and reminds me of this tweet [0]
         | 
         | [0] https://x.com/jaschasd/status/1756930242965606582
        
           | spacecadet wrote:
           | Nah someone down voted it. And yes, it looks like that + 20
           | others that are animated.
        
             | CamperBob2 wrote:
             | Downvotes == empty boats. If "Empty Boat parable" doesn't
             | ring a bell, Google it...
        
               | globalnode wrote:
               | unless an algorithm decides to block or devalue the
               | content, but yeah i looked it up, very interesting
               | parable, thanks for sharing.
        
               | spacecadet wrote:
               | anger is a gift
        
       | thomasahle wrote:
       | I like the way he uses a low-rank decomposition of the Value
       | matrix instead of Value+Output matrices. Much more intuitive!
        
         | imjonse wrote:
         | It is the first time I hear about the Value matrix being low
         | rank, so for me this was the confusing part. Codebases I have
         | seen also have value + output matrixes so it is clearer that
         | Q,K,V are similar sizes and there's a separate projection
         | matrix that adapts to the dimensions of the next network layer.
         | UPDATE: He mentions this in the last sections of the video.
        
       | namelosw wrote:
       | You might also want to check out other 3b1b videos on neural
       | networks since there are sort of progressions between each video
       | https://www.3blue1brown.com/topics/neural-networks
        
       | rollinDyno wrote:
       | Hold on, every predicted token is only a function of the previous
       | token? I must have something wrong. This would mean that within
       | the embedding of "was", which is of length 12,228 in this
       | example. Is it really possible that this space is so rich as to
       | have a single point in it encapsulate a whole novel?
        
         | faramarz wrote:
         | it's not about a single point encapsulating a novel, but how
         | sequences of such embeddings can represent complex ideas when
         | processed by the model's layers.
         | 
         | each prediction is based on a weighted context of all previous
         | tokens, not just the immediately preceding one.
        
           | rollinDyno wrote:
           | That weighted context is the 12228 dimensional vector, no?
           | 
           | I suppose that when you each element in the vector weighs 16
           | bits then the space is immense and capable to have a novel in
           | a point.
        
         | vanjajaja1 wrote:
         | at that point what it has is not a representation of the input,
         | its a representation of what the next output could be. ie. its
         | a lossy process and you can't extract what came in the past,
         | only the details relevant to next word prediction
         | 
         | (is my understanding)
        
           | rollinDyno wrote:
           | If the point was the presentation of only the next token, and
           | predicted tokens were a function of only the preceding token,
           | then the vector of the new token wouldn't have the
           | information to produce new tokens that kept telling the
           | novel.
        
         | jgehring wrote:
         | That's what happens in the very last layer. But at that point
         | the embedding for "was" got enriched multiple times, i.e., in
         | each attention pass, with information from the whole context
         | (which is the whole novel here). So for the example, it would
         | contain the information to predict, let's say, the first token
         | of the first name of the murderer.
         | 
         | Expanding on that, you could imagine that the intent of the
         | sentence to complete (figuring out the murderer) would have to
         | be captured in the first attention passes so that other layers
         | would then be able to integrate more and more context in order
         | to extract that information from the whole context. Also, it
         | means that the forward passes for previous tokens need to have
         | extracted enough salient high-level information already since
         | you don't re-compute all attention passes for all tokens for
         | each next token to predict.
        
         | evolvingstuff wrote:
         | You are correct, that is an error in an otherwise great video.
         | The k+1 token is not merely a function of the kth vector, but
         | rather all prior vectors (combined using attention). There is
         | nothing "special" about the kth vector.
        
       | tylerneylon wrote:
       | Awesome video. This helps to show how the Q*K matrix
       | multiplication is a bottleneck, because if you have sequence
       | (context window) length S, then you need to store an SxS size
       | matrix (the result of all queries times all keys) in memory.
       | 
       | One great way to improve on this bottleneck is a new-ish idea
       | called Ring Attention. This is a good article explaining it:
       | 
       | https://learnandburn.ai/p/how-to-build-a-10m-token-context
       | 
       | (I edited that article.)
        
         | rahimnathwani wrote:
         | He lists Ring Attention and half a dozen other techniques, but
         | they're not within the scope of this video:
         | https://youtu.be/eMlx5fFNoYc?t=784
        
         | danielhanchen wrote:
         | Oh with Flash Attention, you never have to construct the (S, S)
         | matrix ever (also in article) Since its softmax(Q @ K^T /
         | sqrt(d)) @ V, you can form the final output in tiles.
         | 
         | In Unsloth, memory usage scales linearly (not quadratically)
         | due to Flash Attention (+ you get 2x faster finetuning, 80%
         | less VRAM use + 2x faster inference). Still O(N^2) FLOPs
         | though.
         | 
         | On that note, on long contexts, Unsloth's latest release fits
         | 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K
         | context on H100.
        
       | mehulashah wrote:
       | This is one of the best explanations that I've seen on the topic.
       | I wish there was more work, however, not on how Transfomers work,
       | but why they work. We are still figuring it out, but I feel that
       | the exploration is not at all systematic.
        
       | seydor wrote:
       | I have found the youtube videos by CodeEmporium to be simpler to
       | follow https://www.youtube.com/watch?v=Nw_PJdmydZY
       | 
       | Transformer is hard to describe with analogies, and TBF there is
       | no good explanation why it works, so it may be better to just
       | present the mechanism, "leaving the interpretation to the
       | viewer". Also, it's simpler to describe dot products as vectors
       | projecting on one another
        
         | nerdponx wrote:
         | > TBF there is no good explanation why it works
         | 
         | My mental justification for attention has always been that the
         | output of the transformer is a sequence of new token vectors
         | such that each individual output token vector incorporates
         | contextual information from the surrounding input token
         | vectors. I know it's incomplete, but it's better than nothing
         | at all.
        
           | rcarmo wrote:
           | You're effectively steering the predictions based on adjacent
           | vectors (and precursors from the prompt). That mental model
           | works fine.
        
           | eurekin wrote:
           | > TBF there is no good explanation why it works
           | 
           | I thought the general consesus was: "transformers allow
           | neural networks to have adaptive weights".
           | 
           | As opposed to the previous architectures, were every edge
           | connecting two neurons always has the same weight.
           | 
           | EDIT: a good video, where it's actually explained better:
           | https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay
        
         | mjburgess wrote:
         | The explanation is just that NNs are a stat fitting alg
         | learning a conditional probability distribution,
         | P(next_word|previous_words). Their weights are a model of this
         | distribution. LLMs are a hardware innovation: they make it
         | possible for GPUs to compute this at scale across TBs of data.
         | 
         | Why does, 'mat' follow from 'the cat sat on the ...' because
         | 'mat' is the most frequent word in the dataset; and the NN is a
         | model of those frequencies.
         | 
         | Why is 'London in UK' "known" but 'London in France' isnt? Just
         | because 'UK' much more frequently occurs in the dataset.
         | 
         | The algorithm isnt doing anything other than aligning
         | computation to hardware; the computation isnt doing anything
         | interesting. The value comes from the conditional probability
         | structure in the data. -- that comes from people arranging
         | words usefully, because they're communicating information with
         | one another
        
           | nerdponx wrote:
           | I think you're downplaying the importance of the
           | attention/transformer architecture here. If it was "just" a
           | matter of throwing compute at probabilities, then we wouldn't
           | need any special architecture at all.
           | 
           | P(next_word|previous_words) is ridiculously hard to estimate
           | in a way that is actually useful. Remember how bad text
           | generation used to be before GPT? There is innovation in
           | discovering an architecture that makes it possible to learn
           | P(next_word|previous_words), in addition to the computing
           | techniques and hardware improvements required to make it
           | work.
        
             | mjburgess wrote:
             | Yes, it's really hard -- the innovation is aligning the
             | really basic dot-product similarity mechanism to hardware.
             | You can use basically any NN structure to do the same task,
             | the issue is that they're untrainable because they arent
             | parallizable.
             | 
             | There is no innovation here in the sense of a brand new
             | algorithm for modelling conditional probabilities -- the
             | innovation is in adapting the algorithm for GPU training on
             | text/etc.
        
               | bruce343434 wrote:
               | I don't know why you seem to have such a bone to pick
               | with transformers but imo it's still interesting to learn
               | about it, and reading your dismissively toned drivel of
               | "just" and "simply" makes me tired. You're barking up the
               | wrong tree man, what are you on about.
        
               | mjburgess wrote:
               | No issue with transformers -- the entire field of
               | statistical learning, decision trees to NNs, do the same
               | thing... there's no mystery here. No person with any
               | formal training in mathematical finance, applied
               | statistics, hard experimental sciences on complex
               | domains... etc. would be taken in here.
               | 
               | I'm trying my best to inform people who are interested in
               | being informed, against an entire media ecosystem being
               | played like a puppet-on-a-string by ad companies. The
               | strategy of these companies is to exploit how easy is it
               | to strap anthropomorphic interfaces over models of word
               | frequencies and have everyone lose their minds.
               | 
               | Present the same models as a statistical dashboard, and
               | few would be so adamant that their sci-fi fantasy is the
               | reality.
        
               | divan wrote:
               | Do you have blog or anything to follow?
        
               | mjburgess wrote:
               | I may start publishing academic papers in XAI as part of
               | a PhD; if I do, I'll share somehow. The problem is the
               | asymmetry of bullshit: the size of paper necessary for
               | academics to feel that claims have been evidenced is
               | book-length for critique but 2pg for "novel
               | contributions".
        
               | eutectic wrote:
               | Different models have different inductive biases. There
               | is no way you could build GPT4 with decision trees.
        
               | jameshart wrote:
               | "There's no mystery here"
               | 
               | Nobody's claiming there's 'mystery'. Transformers are a
               | well known, publicly documented architecture. This thread
               | is about a video explaining exactly how they work - that
               | they are a highly parallelizable approach that lends
               | itself to scaling back propagation training.
               | 
               | "No person with ... formal training ... would be taken in
               | here"
               | 
               | All of a sudden you're accusing someone of perpetuating a
               | fraud - I'm not sure who though. "Ad companies"?
               | 
               | Are you seriously claiming that there hasn't been a
               | qualitative improvement in the results of language
               | generation tasks as a result of applying transformers in
               | the large language model approach? Word frequencies turn
               | out to be a powerful thing to model!
               | 
               | It's ALL just hype, none of the work being done in the
               | field has produced any value, and everyone should... use
               | 'statistical dashboards' (whatever those are)?
        
               | fellendrone wrote:
               | > models of word frequencies
               | 
               | Ironically, your best effort to inform people seems to be
               | misinformed.
               | 
               | You're talking about a Markov model, not a language model
               | with trained attention mechanisms. For a start,
               | transformers can consider the entire context (which could
               | be millions of tokens) rather than simple state to state
               | probabilities.
               | 
               | No wonder you believe people are being 'taken in' and
               | 'played by the ad companies'; your own understanding
               | seems to be fundamentally misplaced.
        
               | saeranv wrote:
               | I think they are accounting for the entire context, they
               | specifically write out:
               | 
               | >> P(next_word|previous_words)
               | 
               | So the "next_word" is conditioned on "previous_words"
               | (plural), which I took to mean the joint distribution of
               | all previous words.
               | 
               | But, I think even that's too reductive. The transformer
               | is specifically not a function acting as some incredibly
               | high-dimensional lookup table of token conditional
               | probabilities. It's learning a (relatively) small amount
               | of parameters to compress those learned conditional
               | probabilities into a radically lower-dimensional
               | embedding.
               | 
               | Maybe you could describe this as a discriminative model
               | of conditional probability, but at some point, we start
               | describing that kind of information compression as
               | semantic understanding, right?
        
               | nerdponx wrote:
               | It's reductive because it obscures just how complicated
               | that `P(next_word|previous_words)` is, and it obscures
               | the fact that "previous_words" is itself a carefully-
               | constructed (tokenized & vectorized) representation of a
               | huge amount of text. One individual "state" in this
               | Markov-esque chain is on the order of an entire book, in
               | the bigger models.
        
               | mjburgess wrote:
               | It doesnt matter how big it is, it's properties dont
               | change. eg., it never says, "I like what you're wearing"
               | because it likes what I'm wearing.
               | 
               | It seems there's an entire generation of people taken-in
               | by this word, "complexity" and it's just magic sauce that
               | gets sprinkled over ad-copy for big tech.
               | 
               | We know what it means to compute P(word|words), we know
               | what it means that P("the sun is hot") > P("the sun is
               | cold") ... and we know that by computing this, you arent
               | actaully modelling the temperature of the sun.
               | 
               | It's just so disheartening how everyone becomes so
               | anthropomorphically credulous here... can we not even get
               | sun worship out of tech? Is it not possible for people to
               | understand that conditional probability structures do not
               | model mental states?
               | 
               | No model of conditional probabilities over text tokens,
               | no matter how many text tokens it models, ever says, "the
               | weather is nice in august" because it means the weather
               | is nice in august. It has never been in an august; or in
               | weahter; nor does it have the mental states for
               | preference, desire.. nor has it's text generation been
               | caused by the august weather.
               | 
               | This is extremely obvious, as in, simply refelect on why
               | the people who wrote those historical text did so.. and
               | reflect on why an LLM generates this text... and you can
               | see that even if an LLM produced word-for-word MLK's I
               | have a dream speech, it does not have a dream. It has not
               | suffered any oppression; nor organised any labour; nor
               | made demands on the moral conscience of the public.
               | 
               | This shouldnt need to be said to a crowd who can
               | presumably understand what it means to take a
               | distribution of text tokens and subset them. It doesnt
               | matter how complex the weight structure of an NN is: this
               | tells you only how compressed the conditional probability
               | distribution is over many TBs of all of text history.
        
               | drdeca wrote:
               | Perhaps you have misunderstood what the people you are
               | talking about, mean?
               | 
               | Or, if not, perhaps you are conflating what they mean
               | with something else?
               | 
               | Something doesn't need to have had a subjective
               | experience of the world in order to act as a model of
               | some parts of the world.
        
               | nerdponx wrote:
               | You're tilting at windmills here. Where in this thread do
               | you see anyone taking about the LLM as anything other
               | than a next-token prediction model?
               | 
               | Literally all of the pushback you're getting is because
               | you're trivializing the choice of model architecture,
               | claiming that it's all so obvious and simple and it's all
               | the same thing in the end.
               | 
               | Yes, of course, these models have to be well-suited to
               | run on our computers, in this case GPUs. And sure, it's
               | an interesting perspective that maybe they work well
               | because they are well-suited for GPUs and not because
               | they have some deep fundamental meaning. But you can't
               | act like everyone who doesn't agree with your perspective
               | is just an AI hypebeast con artist.
        
               | kordlessagain wrote:
               | Somebody's judgment weights need to be updated to include
               | emoji embeddings.
        
               | YetAnotherNick wrote:
               | No. This is blatantly false. The belief that recurrent
               | model can't be scaled is untrue. People have recently
               | trained MAMBA with billions of parameters. The
               | fundamental reason why transformers changed the field is
               | that they are lot more scalable context length wise, and
               | LSTM, LRU etc doesn't come close.
        
               | mjburgess wrote:
               | > they are lot more scalable context length wise
               | 
               | Sure, we're agreeing. I'm just being less specific.
        
               | YetAnotherNick wrote:
               | Scalable as in loss wise scalable, not compute wise.
        
               | HarHarVeryFunny wrote:
               | Yes, but pure Mamba doesn't perform as well as a
               | transformer (and neither did LTSMs). This is why you see
               | hybrid architectures like Jamba = Mamba + transformer.
               | The ability to attend to specific tokens is really key,
               | and what is lost in recurrent models where sequence
               | history is munged into a single state.
        
               | YetAnotherNick wrote:
               | That's my point. It doesn't perform in terms of loss,
               | even though it performs well enough in terms of compute
        
               | HarHarVeryFunny wrote:
               | > Yes, it's really hard -- the innovation is aligning the
               | really basic dot-product similarity mechanism to
               | hardware. You can use basically any NN structure to do
               | the same task, the issue is that they're untrainable
               | because they arent parallizable.
               | 
               | This is only partially true. I wouldn't say you could use
               | *any* NN architecture for sequence-to-sequence
               | prediction. You either have to model them as a
               | potentially infinite sequence with an RNN of some sort
               | (e.g. LSTM), or, depending on the sequence type, model
               | them as a hierarchy of sub-sequences, using something
               | like a multi-layered convolution or transformer.
               | 
               | The transformer is certainly well suited to current
               | massively parallel hardware architectures, and this was
               | also a large part of the motivation for the design.
               | 
               | While the transformer isn't the only way to do seq-2-seq
               | with neural nets, I think the reason it is so successful
               | is more than simply being scalable and well matched to
               | the available training hardware. Other techniques just
               | don't work as well. From the mechanistic interpretability
               | work that has been done so far, it seems that learnt
               | "induction heads", utilizing the key-based attention, and
               | layered architecture, are what give transformers their
               | power.
        
             | JeremyNT wrote:
             | > _There is innovation in discovering an architecture that
             | makes it possible to learn P(next_word|previous_words), in
             | addition to the computing techniques and hardware
             | improvements required to make it work._
             | 
             | Isn't that essentially what mjburgess said in the parent
             | post?
             | 
             | > _LLMs are a hardware innovation: they make it possible
             | for GPUs to compute this at scale across TBs of data... The
             | algorithm isnt doing anything other than aligning
             | computation to hardware_
        
               | nerdponx wrote:
               | Not really, and no. Torch and CUDA align computation to
               | hardware.
               | 
               | If it were just a matter of doing that, we would be fine
               | with fully-connected MLP. And maybe that would work with
               | orders of magnitude more data and compute than we
               | currently throw at these models. But we are already
               | pushing the cutting edge of those things to get useful
               | results out of the specialized architecture.
               | 
               | Choosing the right NN architecture is like feature
               | engineering: the exact details don't matter that much,
               | but getting the right overall structure can be the
               | difference between learning a working model and failing
               | to learn a working model, _from the same source data_
               | with the same information content. Clearly our choice of
               | inductive bias matters, and the transformer architecture
               | is clearly an improvement over other designs.
               | 
               | Surely you wouldn't argue that a CNN is "just" aligning
               | computation to hardware, right? Transformers are clearly
               | showing themselves as a reliably effective model
               | architecture for text in the same way that CNNs are
               | reliably effective for images.
        
               | mjburgess wrote:
               | Err... no. MLPs are fundamentally sequential algorithms
               | (backprop weight updating). All major innovations in NN
               | design have been to find ways of designing the
               | architecture to fit GPU compute paradigms.
               | 
               | It was an innovation, in the 80s, to map image structure
               | to weight structure that underpins CNNs. That isnt what
               | made CNNs trainable though.. that was alexnet, and just
               | go read the paper... its pretty upfront about how the NN
               | architecture is designed to fit the GPU... that's the
               | point of it
        
           | seydor wrote:
           | People specifically would like to know what the attention
           | calculations add to this learning of the distribution
        
             | ffwd wrote:
             | Just speculating but I think attention enables
             | differentiation of semantic concepts for a word or sentence
             | within a particular context. Like for any total set of
             | training data you have a lesser number of semantic concepts
             | (like let's say you have 10000 words, then it might contain
             | 2000 semantic concepts, and those concepts are defined by
             | the sentence structure and surrounding words, which is why
             | they have a particular meaning), and then attention allows
             | to differentiate those different contexts at different
             | levels (words/etc). Also the fact you can do this attention
             | at runtime/inference means you can generate the context
             | from the prompt, which enables the flexibility of variable
             | prompt/variable output but you lose the precision of giving
             | an exact prompt and getting an exact answer
        
               | ffwd wrote:
               | I'm not one to whine about downvotes but I just have to
               | say, it's a bad feeling when I can't even respond to the
               | negative feedback because there is no accompanying
               | comment. Did I misinterpret something? Did you? Who will
               | ever know when there is no information. :L
        
           | forrestthewoods wrote:
           | I find this take super weak sauce and shallow.
           | 
           | This recent $10,000 challenge is super super interesting
           | imho.
           | https://twitter.com/VictorTaelin/status/1778100581837480178
           | 
           | State of the art models are doing more than "just" predicting
           | the probability of the next symbol.
        
             | mjburgess wrote:
             | You underestimate the properties of the sequential-
             | conditional structure of human communication.
             | 
             | Consider how a clever 6yo could fake being a physicist with
             | access to a library of physics textbooks and a shredder.
             | All the work is done for them. You'd need to be a physicist
             | to spot them faking it.
             | 
             | Of course, LLMs are in a much better position than having
             | shredded physics textbooks -- they have shreddings of all
             | books. So you actually have to try to expose this process,
             | rather than just gullibly prompt using confirmation bias.
             | It's trivial to show they work this way, both formally and
             | practically.
             | 
             | The issue is, practically, gullible people aren't trying.
        
               | astrange wrote:
               | You can program algorithms into transformer networks, up
               | to the limit of how many computations you get.
               | 
               | https://srush.github.io/raspy/
               | 
               | Are you going to do computer reductionism too and say
               | computers can't do arithmetic, they just run electricity
               | through silicon?
        
               | forrestthewoods wrote:
               | I don't find your model either convincing or useful.
        
           | nextaccountic wrote:
           | > Why does, 'mat' follow from 'the cat sat on the ...'
           | because 'mat' is the most frequent word in the dataset; and
           | the NN is a model of those frequencies.
           | 
           | What about cases that are not present in the dataset?
           | 
           | The model must be doing _something_ besides storing raw
           | probabilities to avoid overfitting and enable generalization
           | (imagine that you could have a very performant model - when
           | it works - but it sometimes would spew  "Invalid input, this
           | was not in the dataset so I don't have a conditional
           | probability and I will bail out")
        
           | albertzeyer wrote:
           | You are more speaking about n-gram models here. NNs do far
           | more than that.
           | 
           | Or if you just want to say that NNs are used as a statistical
           | model here: Well, yea, but that doesn't really tell you
           | anything. Everything can be a statistical model.
           | 
           | E.g., you could also say "this is exactly the way the human
           | brain works", but it doesn't really tell you anything how it
           | really works.
        
             | mjburgess wrote:
             | My description is true of any statistical learning
             | algorithm.
             | 
             | The thing that people are looking to for answers, the NN
             | itself, does not have them. That's like looking to Newton's
             | compass to understand his general law of gravitation.
             | 
             | The reason that LLMs trained on the internet and every
             | ebook has the structure of human communication is because
             | the dataset has that structure. Why does the data have that
             | structure? this requires science, there is no explanation
             | "in the compass".
             | 
             | NNs are statistical models trained on data -- drawing
             | analogies to animals is a mystification that causes
             | people's ability to think clearly he to jump out the
             | window. No one compares stock price models to the human
             | brain; no banking regulator says, "well your volatility
             | estimates were off because your machines had the wrong
             | thoughts". This is pseudoscience.
             | 
             | Animals are not statistical learning algorithms, so the
             | reason that's uninformative is because it's false. Animals
             | are in direct causal contact with the world and uncover its
             | structure through interventional action and counterfactual
             | reasoning. The structure of animal bodies, and the general
             | learning strategies are well-known, and having nothing to
             | do with LLMs/NNs.
             | 
             | The reason that I know "The cup is in my hand" is not
             | because P("The cup is in my hand"|HistoricalTexts) > P(not
             | "The cup is in my hand"|HistoricalTexts)
        
               | Demlolomot wrote:
               | If learning in real life over 5-20 years shows the same
               | result as a LLM being trained by billions of tokens, than
               | yes it can be compared.
               | 
               | And there are a lot of people out there who do not a lot
               | of reasoning.
               | 
               | After all optical illusions exist, our brain generalizes.
               | 
               | The same thing happens with words like the riddle about
               | the doctor operating on a child were we discover that the
               | doctor is actually a female.
               | 
               | And while llms only use text, we can already see how
               | multimodal models become better, architecture gets better
               | and hardware too.
        
               | mjburgess wrote:
               | I don't know what your motivation in comparison is; mine
               | is science, ie., explanation.
               | 
               | I'm not interested that your best friend emits the same
               | words in the same order as an LLM; i'm more interested
               | that he does so because he enjoys you company whereas the
               | LLM does not.
               | 
               | Engineer's overstep their mission when they assume that
               | because you can substitute one thing for another, and
               | sell a product in doing so, that this is informative. It
               | isnt. I'm not interested in whether you can replace the
               | sky for a skybox and have no one notice -- who cares?
               | What might fool an ape is _everything_ , and what that
               | matters for science is _nothing_.
        
               | Demlolomot wrote:
               | My thinking is highly influenced by brain research.
               | 
               | We don't just talk about a LLM we talk about a neuronal
               | network architecture.
               | 
               | There is a direct link to us (neural networks)
        
               | vineyardmike wrote:
               | > The reason that I know "The cup is in my hand" is not
               | because P("The cup is in my hand"|HistoricalTexts) >
               | P(not "The cup is in my hand"|HistoricalTexts)
               | 
               | I mostly agree with your points, but I still disagree
               | with this premise. Humans (and other animals) absolutely
               | are statistical reasoning machines. They're just advanced
               | ones which can process more than "text" - they're multi-
               | modal.
               | 
               | As a super dumb-simple set of examples: Think about the
               | origin of the phrase "Cargo Cult" and similar religious
               | activities - people will absolutely draw conclusions
               | about the world based on their learned observations.
               | Intellectual "reasoning" (science!) really just relies on
               | more probabilities or correlations.
               | 
               | The reason you know the cup is in your hand is because
               | P("I see a cup and a hand"|HistoryOfEyesight) + P("I feel
               | a cylinder shape"|HistoryOfTactileFeeling) + .... >
               | P(Inverse). You can pretend it's because humans are
               | intelligent beings with deep reasoning skills (not trying
               | to challenge your smarts here!), but humans learn through
               | trial and error just like a NN with reinforcement
               | learning.
               | 
               | Close your eyes and ask a person to randomly place either
               | a cup from your kitchen in your hand or a different
               | object. You can probably tell which one is it is. Why?
               | Because you have learned what it feels like, and learned
               | countless examples of cups that are different, from years
               | of passive practice. Thats basically deep learning.
        
               | mjburgess wrote:
               | I mean something specific by "statistics": modelling
               | frequency associations in static ensembles of data.
               | 
               | Having a body which changes over time that interacts with
               | a world that changes over time makes animal learning not
               | statistical (call it, say, experimental). That animals
               | fall into skinner-box irrational behaviour can be
               | modelled as a kind of statistical learning, but it
               | actually isnt.
               | 
               | It's a failure of ecological salience mechanisms in
               | regulating the "experimental learning" that animals
               | engage in. Eg., with the cargo cults the reason they
               | adopted that view was because their society had a "big
               | man" value system based on material acquisition and
               | western waring powers seemed Very Big and so were
               | humiliating. In order to retain their status they adopted
               | (apparently irrational) theories of how the world worked
               | (gods, etc).
               | 
               | From the outside this process might seem statistical, but
               | it's the opposite. Their value system made material
               | wealth have a different causal salience which was useful
               | in their original ecology (a small island with small
               | resources), but it went haywire when faced with the whole
               | world.
               | 
               | Eventually these mechanisms update with this new
               | information, or the tribe dies off -- but what's going
               | wrong here is that the very very non-statistical learning
               | ends up describable that way.
               | 
               | This is indeed, why we should be very concerned about
               | people skinner-boxing themsleves with LLMs
        
               | vineyardmike wrote:
               | > Having a body which changes over time that interacts
               | with a world that changes over time makes animal learning
               | not statistical (call it, say, experimental).
               | 
               | The "experiment" of life is what defines the statical
               | values! Experimentation is just learning what the
               | statistical output of something is.
               | 
               | If I hand you a few dice, you'd probably be able to guess
               | the statistical probability of every number for given
               | roll. Because you've learned that through years of
               | observation building a mental model. If I hand you a
               | weighted die, suddenly your mental model is gone, and you
               | can re-learn experimentally by rolling it a bunch. How
               | can you explain experimental learning except
               | "statistically"?
               | 
               | > they adopted (apparently irrational) theories of how
               | the world worked (gods, etc)
               | 
               | They can be wrong without being irrational. Building an
               | airport doesn't make planes show up, but planes won't
               | show up without an airport. If you're an island nation
               | with little understanding of the global geopolitical
               | environment of WWII, you'd have no idea why planes
               | started showing up on your island, but they keep showing
               | up, and only at an airport. It seems rational to assume
               | they'd continue showing up to airports.
               | 
               | > that animals fall into skinner-box irrational behaviour
               | can be modelled as a kind of statistical learning, but it
               | actually isnt
               | 
               | What is it if not statistical?
               | 
               | Also skinner boxes are, in a way, perfectly rational.
               | There's no way to understand the environment, and if
               | pushing a button feeds you, then rationally you should
               | push the button when hungry. Humans like to think we're
               | smart because we've invented deductive reasoning, and we
               | quote "correlation is not causation" that we're not just
               | earning to predict the world around us from past
               | experiences.
        
               | mjburgess wrote:
               | For dice the ensemble average is the time-average: if you
               | roll the dice 1000 times the probability of getting a
               | different result doesn't change.
               | 
               | For almost everything in the world, action on it, changes
               | it. There are vanishingly few areas where this isn't the
               | case (most physics, most chemistry, etc.).
               | 
               | Imagine trying to do statistics but every time you
               | sampled from reality the distribution of your sample
               | changes not due to randomness, but because reality has
               | changed. Now, can you do statistics? No.
               | 
               | It makes all the difference in the world to have a body
               | and hold the thing you're studying. Statistics is trying
               | to guess the shape of the ice cube from the puddle;
               | animal learning is making ice cubes.
        
               | data_maan wrote:
               | > Having a body which changes over time that interacts
               | with a world that changes over time makes animal learning
               | not statistical (call it, say, experimental). That
               | animals fall into skinner-box irrational behaviour can be
               | modelled as a kind of statistical learning, but it
               | actually isnt.
               | 
               | RL is doing just this, simulating an environment. And we
               | can have an agent "learn" in that environment.
               | 
               | I think tying learning to a body is too restrictive. The
               | 
               | You strongly rely on the assumption that "something else"
               | generates the statistics we observe, but scientifically,
               | there exists little evidence whether that "something
               | else" exists (see eg the Bayesian brain).
        
             | cornholio wrote:
             | > "this is exactly the way the human brain works"
             | 
             | I'm always puzzled by such assertions. A cursory look at
             | the technical aspects of an iterated attention - perceptron
             | transformation clearly shows it's just a convoluted and
             | powerful way to query the training data, a "fancy" Markov
             | chain. The only rationality it can exhibit is that which is
             | already embedded in the dataset. If trained on nonsensical
             | data it would generate nonsense and if trained with a
             | partially non-sensical dataset it will generate an average
             | between truth and nonsense that maximizes some abstract
             | algorithmic goal.
             | 
             | There is no knowledge generation going on, no rational
             | examination of the dataset through the lens of an internal
             | model of reality that allows the rejection of invalid
             | premises. The intellectual food already chewed and digested
             | in the form of the training weights, with the model just
             | mechanically extracting the nutrients, as opposed to
             | venturing in the outside world to hunt.
             | 
             | So if it works "just like the human brain", it does so in a
             | very remote sense, just like a basic neural net works "just
             | like the human brain", i.e individual biological neurons
             | can be said to be somewhat similar.
        
               | pas wrote:
               | If a human spends the first 30 years of their life in a
               | cult they will be also speaking nonsense a lot - from our
               | point of view.
               | 
               | Sure, we have a nice inner loop, we do some pruning,
               | picking and choosing, updating, weighting things based on
               | emotions, goals, etc.
               | 
               | Who knows how complicated those things will prove to
               | model/implement...
        
           | michaelt wrote:
           | That's not really an explanation that tells people all that
           | much, though.
           | 
           | I can explain that car engines 'just' convert gasoline into
           | forward motion. But if a the person hearing the explanation
           | is hoping to learn what a cam belt or a gearbox is, or why
           | cars are more reliable now than they were in the 1970s, or
           | what premium gas is for, or whether helicopter engines work
           | on the same principle - they're going to need a more detailed
           | explanation.
        
             | mjburgess wrote:
             | It explains the LLM/NN. If you want to explain why it emits
             | words in a certain order you need to explain how reality
             | generated the dataset, ie., you need to explain how people
             | communicate (and so on).
             | 
             | There is no mystery why an NN trained on the night sky
             | would generate nightsky-like photos; the mystery is why
             | those photos have those patterns... solving that is called
             | astrophysics.
             | 
             | Why do people, in reasoning through physics problems, write
             | symbols in a certain order? Well, explain physics,
             | reasoning, mathematical notation, and so on. The ordering
             | of the symbols gives rise to a certain utility of
             | immitating that order -- but it isnt explained by that
             | order. That's circular: "LLMs generate text in the order
             | they do, because that's the order of the text they were
             | given"
        
               | michaelt wrote:
               | That leaves loads of stuff unexplained.
               | 
               | If the LLM is capable of rewording the MIT license into a
               | set of hard-hitting rap battle lyrics, but the training
               | dataset didn't contain any examples of anyone doing that,
               | is the LLM therefore capable of producing output beyond
               | the limits of its training data set?
               | 
               | Is an LLM inherently constrained to mediocrity? If an LLM
               | were writing a novel, does its design force it to produce
               | cliche characters and predictable plotlines? If applied
               | in science, are they inherently incapable of advancing
               | the boundaries of human knowledge?
               | 
               | Why transformers instead of, say, LSTMs?
               | 
               | Must attention be multi-headed? Why can't the model have
               | a simpler architecture, allowing such implementation
               | details to emerge from the training data?
               | 
               | Must they be so big that leading performance is only in
               | the hands of multi-billion-dollar corporations?
               | 
               | What's going on with language handling? Are facts learned
               | in an abstract enough way that they can cross language
               | barriers? Should a model produce different statements of
               | fact when questioned in different languages? Does France
               | need a French-language LLM?
               | 
               | Is it reasonable to expect models to perform basic
               | arithmetic accurately? What about summarising long
               | documents?
               | 
               | Why is it that I can ask questions with misspellings, but
               | get answers with largely correct spelling? If
               | misspellings were in the training data, why aren't they
               | in the output? Does the cleverness that stops LLMs from
               | learning misspellings from the training data also stop
               | them from learning other common mistakes?
               | 
               | If LLMs can be trained to be polite despite having
               | examples of impoliteness in their training data, can they
               | also be trained to not be racist, despite having examples
               | of racism in their training data?
               | 
               | Can a model learn a fact that is very rarely present in
               | the training data - like an interesting result in an
               | obscure academic paper? Or must a fact be widely known
               | and oft-repeated in order to be learned?
               | 
               | Merely saying "it predicts the next word" doesn't really
               | explain much at all.
        
               | mjburgess wrote:
               | Which conditional probability sequences can be exploited
               | for engineering utility cannot be known ahead of time;
               | nor is it explained by the NN. It's explained by
               | investigating how the data was created by people.
               | 
               | Train a NN to generate pictures of the nightsky: which
               | can be used for navigation? Who knows, ahead of time. The
               | only way of knowing is to have an explanation of how the
               | solar system works and then check the pictures are
               | accurate enough.
               | 
               | The NN which generates photos of the nightsky has nothing
               | in it that explains the solar system, nor does any aspect
               | of an NN model the solar system. The photos it was
               | trained on happened to have their pixels arranged in that
               | order.
               | 
               | Why those arrangements occur is explained by
               | astrophysics.
               | 
               | If you want to understand what ChatGPT can do, you need
               | to ask OpenAI for their training data and then perform
               | scientific investigations of its structure and how that
               | structure came to be.
               | 
               | Talking in terms of the NN model is propaganda and
               | pseudoscience: the NN didnt arrange the pixels, gravity
               | did. Likewise, the NN isnt arranging rap lyrics in that
               | order because it's rapping: singers are.
               | 
               | There is no actual mystery here. It's just we are
               | prevented form access to the data by OpenAI, and struggle
               | to explain reality which generated that data -- which
               | requires years of actual science.
        
               | pas wrote:
               | It has a lot of things already encoded regarding the
               | solar system, but it cannot really access it, it cannot -
               | as far as I know - run functions on its own internal
               | encoded data, right? If it does something like that, it's
               | because it learned that higher-level pattern based on
               | training data.
               | 
               | The problem with NN arrangements in general is that we
               | don't know if it's actually pulling out some exact
               | training data (or a useful so-far-unseen pattern from the
               | data!) or it's some distorted confabulation. (Clever Hans
               | all again. If I ask ChatGPT to code me a nodeJS IMAP
               | backup program it does, but the package it gleeful
               | imports/require()s is made up.
               | 
               | And while the typical artsy arts have loose rules, where
               | making up new shit based on what people wish for is
               | basically the only one, in other contexts that's a hard
               | no-no.
        
           | IanCal wrote:
           | This is wrong, or at least a simplification to the point of
           | removing any value.
           | 
           | > NNs are a stat fitting alg learning a conditional
           | probability distribution, P(next_word|previous_words).
           | 
           | They are trained to maximise this, yes.
           | 
           | > Their weights are a model of this distribution.
           | 
           | That doesn't really follow, but let's leave that.
           | 
           | > Why does, 'mat' follow from 'the cat sat on the ...'
           | because 'mat' is the most frequent word in the dataset; and
           | the NN is a model of those frequencies.
           | 
           | Here's the rub. If how you describe them is all they're doing
           | then a sequence of never-before-seen words would have _no_
           | valid response. All words would be equally likely. It would
           | mean that a single brand new word would result in absolute
           | gibberish following it as there 's nothing to go on.
           | 
           | Let's try:
           | 
           | Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj,
           | tell me how many kjsdhlisrnj I now have.
           | 
           | Result: You now have two kjsdhlisrnj.
           | 
           | I would wager a solid amount that kjsdhlisrnj never appears
           | in the input data. If it does pick another one, it doesn't
           | matter.
           | 
           | So we are learning something _more general_ than the
           | frequencies of sequences of tokens.
           | 
           | I always end up pointing to this but OthelloGPT is very
           | interesting https://thegradient.pub/othello/
           | 
           | While it's _trained_ on sequences of moves, what it _does_ is
           | more than just  "sequence a,b,c is followed by d most often"
        
             | mjburgess wrote:
             | Any NN "trained on" data sampled from an abstract complete
             | outcome space (eg., a game with formal rules; mathematical
             | sequences, etc) can often represent that space completely.
             | It comes down to whether you can form conditional
             | probability models of the rules, and that's usually
             | possible because that's what _abstract_ rules are.
             | 
             | > I have one kjsdhlisrnj and I add another kjsdhlisrnj,
             | tell me how many kjsdhlisrnj I now have.
             | 
             | 1. P(number-word|tell me how many...) > P(other-kinds-of-
             | words|tell me how many...)
             | 
             | 2. P(two|I have one ... I add another ...) > P(one|...) >
             | P(three|...) > others
             | 
             | This is trivial.
        
               | IanCal wrote:
               | Right, learning more abstract rules about how things work
               | is the goal and where the value comes in. Not all
               | algorithms are able to do this, even if they can do what
               | you describe in your first comment.
               | 
               | That's why they're interesting, othellogpt is interesting
               | because it builds a world model.
        
               | mjburgess wrote:
               | It builds a model of a "world" whose structure is
               | conditional probabilities, this is circular. It's like
               | saying you can use a lego model to build a model of
               | another lego model. All the papers which "show" NNs
               | building "world" models arent using any world. It's lego
               | modelling lego.
               | 
               | The lack of a world model only matters when the data NNs
               | are trained on aren't valid measures of the world that
               | data is taken to model. All the moves of a chess game are
               | a complete model of chess. All the books ever written
               | aren't a model of, well, anything -- the structure of the
               | universe isnt the structure of text tokens.
               | 
               | The only reason _all_ statistical algorithms, including
               | NNs, appear to model the actual world is because patterns
               | in data give this appearance: P(The Sun is Hot)  > P(The
               | Sun is Cold) -- there is no model of the sun here.
               | 
               | The reason P("The Sun is Hot") seems to model the sun, is
               | because we can read the english words "sun" and "hot" --
               | it is we who think the machine which generates this text
               | does so semantically.. but the people who wrote that
               | phrase in the dataset did so; the machine is just
               | generating "hot" because of that dataset.
        
               | IanCal wrote:
               | Othellogpt is fed only moves and builds a model of the
               | current board state in its activations. It never sees a
               | board.
               | 
               | > It's like saying you can use a lego model to build a
               | model of another lego model.
               | 
               | No it's like using a description of piece placements and
               | having a picture in mind about what the current model
               | looks like.
        
               | mjburgess wrote:
               | The "board" is abstract. Any game of this sort is defined
               | by a series of conditional probabilities:
               | 
               | {P(Pawn_on_sqare_blah|previous_moves) ... etc.}
               | 
               | What all statistical learning algorithms model is sets of
               | conditional probabilities. So any stat alg is a model of
               | a set of these rules... that's the "clay" of these
               | models.
               | 
               | The problem is the physical world isn't anything like
               | this. The reason I say, "I liked that TV show" is because
               | I had a series of mental states caused by the TV show
               | over time (and so on). This isnt representable as a set
               | of conditional probs in the same way.
               | 
               | You could imagine, at the end of history, there being a
               | total set of all possible conditional probabilities: P(I
               | liked show|my_mental_states, time, person, location,
               | etc.) -- this would be uncomputable, but it could be
               | supposed.
               | 
               | If you had that dataset then yes, NNs would learn the
               | entire structure of the world, because _that 's the
               | dataset_. The problem is that the world cannot be
               | represented in this fashion, not that NNs could model it
               | if it could be. A decision tree could.
               | 
               | P(I liked the TV show) doesnt follow from any dataset
               | ever collected. It follows from my mental states. So no
               | NN can ever model it. They can model frequency
               | associations of these phrases in historical text
               | documents: this isnt a model of hte world
        
               | IanCal wrote:
               | > Any game of this sort is defined by a series of
               | conditional probabilities:
               | {P(Pawn_on_sqare_blah|previous_moves) ... etc.}
               | 
               | That would always be 1 or 0, but also _that data is not
               | fed into othellogpt_. That is not the dataset. It is not
               | fed in board states at all.
               | 
               | It learns it, but it is _not the dataset_.
        
               | mjburgess wrote:
               | It is the dataset. When you're dealing with abstract
               | objects (ie., mathematical spaces), they are all
               | isomorphic.
               | 
               | It doesnt matter if you "feed in" 1+1+1+1 or 2+2 or
               | sqrt(16).
               | 
               | The rules of chess are encoded either explicit rules or
               | by contrast classes of valid/invalid games. These are
               | equivalent formulations.
               | 
               | When you're dealing with text tokens it does matter if
               | "Hot" is frequently after "The Sun is..." because reality
               | isnt an abstract space, and text tokens arent measures of
               | it.
        
               | IanCal wrote:
               | > It is the dataset.
               | 
               | No. A series of moves alone provides strictly less
               | information than a board state or state + list of rules.
        
               | mjburgess wrote:
               | If the NN learns the game, that is itself an existence
               | proof of the opposite, (by obvious information-theoretic
               | arguments).
               | 
               | Training is supervised, so you don't need bare sets of
               | moves to encode the rules; you just need a way of
               | subsetting the space into contrast classes of
               | valid/invalid.
               | 
               | It's a lie to say the "data" is the moves, the data is
               | the full outcome space: ({legal moves}, {illegal moves})
               | where the moves are indexed by the board structure
               | (necessarily, since moves are defined by the board
               | structure -- its an _abstract_ game). So there 's two
               | deceptions here: (1) supervision structures the training
               | space; and (2) the individual training rows have
               | sequential structure which maps to board structure.
               | 
               | Complete information about the game is provided to the
               | NN.
               | 
               | But let's be clear, the othellogpt still generates
               | illegal moves -- showing that it does not learn the
               | binary conditional structure of the actual game.
               | 
               | The deceptiveness of training a NN on a game whose rules
               | are conditional probability structures and then claiming
               | the very-good-quality conditional probability structures
               | it finds are "World Models" is... maddening.
               | 
               | This is all just fraud to me; frauds dressing up other
               | frauds in transparent clothing. LLMs trained on the
               | internet are being sold as approximating the actual
               | world, not 8x8 boardgames. I have nothing polite to say
               | about any of this
        
               | IanCal wrote:
               | > It's a lie to say the "data" is the moves, the data is
               | the full outcome space: ({legal moves}, {illegal moves})
               | 
               | There is nothing about illegal moves provided to
               | othellogpt as far as I'm aware.
               | 
               | > Complete information about the game is provided to the
               | NN.
               | 
               | That is not true. Where is the information that there are
               | two players provided? Or that there are two colours? Or
               | how the colours change? Where is the information about
               | invalid moves provided?
               | 
               | > But let's be clear, the othellogpt still generates
               | illegal moves -- showing that it does not learn the
               | binary conditional structure of the actual game.
               | 
               | Not perfectly, no. But that's not at all required for my
               | point, though is relevant if you try and use the fact it
               | learns to play the game as proof that moves provide all
               | information about legal board states.
        
               | mjburgess wrote:
               | How do you think the moves are represented?
               | 
               | All abstract games of this sort are just sequences of bit
               | patterns, each pattern related to the full legal space by
               | a conditional probability structure (or, equivalently, as
               | set ratios).
               | 
               | Strip away all the NN b/s and anthropomorphic language
               | and just represent it to yourself using bit sets.
               | 
               | Then ask: how hard is it to approximate the space from
               | which these bit sets are drawn using arbitrarily deep
               | conditional probability structures?
               | 
               | it's trivial
               | 
               | the problem the author sets up about causal structures in
               | the world cannot be represented as a finite sample of bit
               | set sequences -- and even if it could, that isnt the data
               | being used
               | 
               | the author hasn't understood the basics of what the
               | 'world model' problem even is
        
             | pas wrote:
             | how does it work underneath?
             | 
             | "kjsdhlisrnj" is in the context, it gets tokenized, and now
             | when the LLM is asked to predict/generate next-token
             | sequences somehow "kjsdhlisrnj" is there too. it learns
             | patterns. okay sure, they ger encoded somehow, but during
             | infernce how does this lead to application of a recalled
             | pattern on the right token(s)?
             | 
             | also, can it invent new words?
        
               | empath-nirvana wrote:
               | Yes. This is one of the first things people did with
               | chatgpt:
               | 
               | can you help me invent a language for a fictional setting
               | with talking frogs. Give me a sample sentence that frogs
               | might say in their language, along with a translation.
               | ChatGPT Creating a language for a fictional setting can
               | be a fun and imaginative task! For your setting with
               | talking frogs, we'll consider the environment and
               | characteristics that might influence their language.
               | Here's a brief outline of the linguistic features and a
               | sample sentence:
               | 
               | Linguistic Features of the Frog Language Phonetics: Since
               | frogs are known for their croaks and ribbits, their
               | language might include a lot of guttural and clicking
               | sounds, which can be represented in writing with
               | characters like "k", "g", "q", and clicks like "!" or
               | "|=".
               | 
               | Syntax: We might imagine that frogs are direct and to the
               | point, so their sentence structure could be subject-verb-
               | object (SVO), similar to English, for straightforward
               | communication.
               | 
               | Vocabulary: Words would likely relate to their daily life
               | --water, flies, lily pads, weather changes, predators,
               | etc.
               | 
               | Grammar: Simplified grammar with minimal use of tenses
               | and conjugations to reflect an emphasis on the present
               | moment, which is typical for many animalistic languages
               | in fiction.
               | 
               | Sample Language: Froglan Froglii (hello) Lok (water) Krik
               | (eat) Zraak (fly - the insect) !quu (danger) |=ogg (jump)
               | Sample Sentence in Froglan "Froglii! Lok krik zraak
               | |=ogg."
               | 
               | Translation "Hello! Jump eat fly water."
               | 
               | This sentence, structured in a somewhat English syntax
               | for ease of understanding, would literally mean that a
               | frog is greeting another and then mentioning that it will
               | jump towards water to eat flies. The sentence structure
               | and vocabulary can be further refined and expanded based
               | on how deeply you want to dive into the language
               | creation!
        
           | sirsinsalot wrote:
           | It isn't some kind of Markov chain situation. Attention
           | cross-links the abstract meaning of words, subtle
           | implications based on context and so on.
           | 
           | So, "mat" follows "the cat sat on the" where we understand
           | the entire worldview of the dataset used for training; not
           | just the next-word probability based on one or more previous
           | words ... it's based on all previous meaning probability, and
           | those meaning probablility and so on.
        
           | fellendrone wrote:
           | > Why does, 'mat' follow from 'the cat sat on the ...'
           | 
           | You're confidently incorrect by oversimplifying all LLMs to a
           | base model performing a completion from a trivial context of
           | 5 words.
           | 
           | This is tantamount to a straw man. Not only do few people use
           | untuned base models, it completely ignores in-context
           | learning that allows the model to build complex semantic
           | structures from the relationships learnt from its training
           | data.
           | 
           | Unlike base models, instruct and chat fine-tuning teaches
           | models to 'reason' (or rather, perform semantic calculations
           | in abstract latent spaces) with their "conditional
           | probability structure", as you call it, to varying extents.
           | The model must learn to use its 'facts', understand
           | semantics, and perform abstractions in order to follow
           | arbitrary instructions.
           | 
           | You're also confabulating the training metric of "predicting
           | tokens" with the mechanisms required to satisfy this metric
           | for complex instructions. It's like saying "animals are just
           | performing survival of the fittest". While technically
           | correct, complex behaviours evolve to satisfy this 'survival'
           | metric.
           | 
           | You could argue they're "just stitching together phrases",
           | but then you would be varying degrees of wrong:
           | 
           | For one, this assumes phrases are compressed into
           | semantically addressable units, which is already a form of
           | abstraction ripe for allowing reasoning beyond 'stochastic
           | parroting'.
           | 
           | For two, it's well known that the first layers perform basic
           | structural analysis such as grammar, and later layers perform
           | increasing levels of abstract processing.
           | 
           | For three, it shows a lack of understanding in how
           | transformers perform semantic computation in-context from the
           | relationships learnt by the feed-forward layers. If you're
           | genuinely interested in understanding the computation model
           | of transformers and how attention can perform semantic
           | computation, take a look here: https://srush.github.io/raspy/
           | 
           | For a practical example of 'understanding' (to use the term
           | loosely), give an instruct/chat tuned model the text of an
           | article and ask it something like "What questions should this
           | article answer, but doesn't?" This requires not just
           | extracting phrases from a source, but understanding the
           | context of the article on several levels, then reasoning
           | about what the context is not asserting. Even comparatively
           | simple 4x7B MoE models are able to do this effectively.
        
           | astrange wrote:
           | LLMs don't work on words, they work on sequences of subword
           | tokens. "It doesn't actually do anything" is a common
           | explanation that's clearly a form of cope, because you can't
           | even explain why it can form complete words, let alone
           | complete sentences.
        
           | fspeech wrote:
           | There are an infinite number of distributions that can fit
           | the training data well (e.g., one that completely memorize
           | the data and therefore replicate the frequencies). The trick
           | is to find the distributions that generalize well, and here
           | the NN architecture is critical.
        
         | blt wrote:
         | As a computer scientist, the "differentiable hash table"
         | interpretation worked for me. The AIAYN paper alludes to it by
         | using the query/key/value names, but doesn't explicitly say the
         | words "hash table". I guess some other paper introduced them?
        
       | abotsis wrote:
       | I think what made this so digestible for me were the animations.
       | The timing, how they expand/contract and unfold while he's
       | speaking.. is all very well done.
        
         | _delirium wrote:
         | That is definitely one of the things he does better than most.
         | He actually wrote his own custom animation library for math
         | animations: https://github.com/3b1b/manim
        
           | divan wrote:
           | Also check out community edition: https://www.manim.community
        
       | rayval wrote:
       | Here's a compelling visualization of the functioning of an LLM
       | when processing a simple request: https://bbycroft.net/llm
       | 
       | This complements the detailed description provided by 3blue1brown
        
         | bugthe0ry wrote:
         | When visualised this way, the scale of GPT-3 is insane. I can't
         | imagine what 4 would like here.
        
           | spi wrote:
           | IIRC, GPT-4 would actually be a bit _smaller_ to visualize
           | than GPT3. Details are not public, but from the leaks GPT-4
           | (at least, some by-now old version of it) was a mixture of
           | expert, with every model having around 110B parameters [1].
           | So, while the total number of parameters is bigger than GPT-3
           | (1800B vs. 175B), it is "just" 16 copies of a smaller (110B)
           | parameters model. So if you wanted to visualize it in any
           | meaningful way, the plot wouldn't grow bigger - or it would,
           | if you included all different experts, but they are just
           | copies of the same architecture with different parameters,
           | which is not all that useful for visualization purposes.
           | 
           | [1] https://medium.com/@daniellefranca96/gpt4-all-details-
           | leaked...
        
             | joaogui1 wrote:
             | Mixture of Experts is not just 16 copies of a network, it's
             | a single network where for the feed forward layers the
             | tokens are routed to different experts, but the attention
             | layers are still shared. Also there are interesting choices
             | around how the routing works and I believe the exact
             | details of what OpenAI is doing are not public. In fact I
             | believe someone making a visualization of that would
             | dispell a ton of myths around what are MoEs and how they
             | work
        
         | lying4fun wrote:
         | amazing visualisation
        
       | justanotherjoe wrote:
       | It seems he brushes over the positional encoding, which for me
       | was the most puzzling part of transformers. The way I understood
       | it, positional encoding is much like dates. Just like dates,
       | there are repeating minutes, hours, days, months...etc. Each of
       | these values has shorter 'wavelength' than the next. The values
       | are then used to identify the position of each tokens. Like, 'oh,
       | im seeing january 5th tokens. I'm january 4th. This means this is
       | after me'. Of course the real pos.encoding is much smoother and
       | doesn't have abrupt end like dates/times, but i think this was
       | the original motivation for positional encodings.
        
         | nerdponx wrote:
         | That's one way to think about it.
         | 
         | It's clever way to encode "position in sequence" as some kind
         | of smooth signal that can be added to each input vector. You
         | might appreciate this detailed explanation:
         | https://towardsdatascience.com/master-positional-encoding-pa...
         | 
         | Incidentally, you can encode dates (e.g. day of week) in a
         | model as sin(day of week) and cos(day of week) to ensure that
         | "day 7" is mathematically adjacent to "day 1".
        
       | bjornsing wrote:
       | This was the best explanation I've seen. I think it comes down to
       | essentially two aspects: 1) he doesn't try to hide complexity and
       | 2) he explains what he thinks is the purpose of each computation.
       | This really reduces the room for ambiguity that ruins so many
       | other attempts to explain transformers.
        
       | Xcelerate wrote:
       | As someone with a background in quantum chemistry and some types
       | of machine learning (but not neural networks so much) it was a
       | bit striking while watching this video to see the parallels
       | between the transformer model and quantum mechanics.
       | 
       | In quantum mechanics, the state of your entire physical system is
       | encoded as a very high dimensional normalized vector (i.e., a ray
       | in a Hilbert space). The evolution of this vector through time is
       | given by the time-translation operator for the system, which can
       | loosely be thought of as a unitary matrix U (i.e., a probability
       | preserving linear transformation) equal to exp(-iHt), where H is
       | the Hamiltonian matrix of the system that captures its "energy
       | dynamics".
       | 
       | From the video, the author states that the prediction of the next
       | token in the sequence is determined by computing the next
       | context-aware embedding vector from the last context-aware
       | embedding vector _alone_. Our prediction is therefore the result
       | of a linear state function applied to a high dimensional vector.
       | This seems a lot to me like we have produced a Hamiltonian of our
       | overall system (generated offline via the training data), then we
       | reparameterize our particular subsystem (the context window) to
       | put it into an appropriate basis congruent with the Hamiltonian
       | of the system, then we apply a one step time translation, and
       | finally transform the resulting vector back into its original
       | basis.
       | 
       | IDK, when your background involves research in a certain field,
       | every problem looks like a nail for that particular hammer. Does
       | anyone else see parallels here or is this a bit of a stretch?
        
         | bdjsiqoocwk wrote:
         | I think you're just describing a state machine, no? The fact
         | that you encode the state in a vector and steps by matrices is
         | an implementation detail...?
        
           | Xcelerate wrote:
           | Perhaps a probabilistic FSM describes the actual
           | computational process better since we don't have a concept
           | equivalent to superposition with transformers (I think?), but
           | the framework of a FSM alone doesn't seem to capture the
           | specifics of where the model/machine comes from (what I'm
           | calling the Hamiltonian), nor how a given context window (the
           | subsystem) relates to it. The change of basis that involves
           | the attention mechanism (to achieve context-awareness) seems
           | to align better with existing concepts in QM.
           | 
           | One might model the human brain as a FSM as well, but I'm not
           | sure I'd call the predictive ability of the brain an
           | implementation detail.
        
             | BoGoToTo wrote:
             | | context window
             | 
             | I actually just asked a question on the physics stack
             | exchange that is semi relevant to this. https://physics.sta
             | ckexchange.com/questions/810429/functiona...
             | 
             | In my question I was asking about a hypothetical time-
             | evolution operator that includes an analog of a light cone
             | that you could think of as a context window. If you had a
             | quantum state that was evolved through time by this
             | operator then I think you could think of the speed of light
             | being a byproduct of the width of the context window of
             | some operator that progresses the quantum state forward by
             | some time interval.
             | 
             | Note I am very much hobbyist-tier with physics so I could
             | also be way off base and this could all be nonsense.
        
               | ricardobeat wrote:
               | I'm way out of my depth here, but wouldn't such a
               | function have to encode an amount of information/state
               | orders of magnitude larger than the definition of the
               | function itself?
               | 
               | If this turns out to be possible, we will have found the
               | solution to the Sloot mystery :D
               | 
               | https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_Syst
               | em
        
               | DaiPlusPlus wrote:
               | The article references patent "1009908C2" but I can't
               | find it in the Dutch patent site, nor Google Patent
               | search.
               | 
               | The rest of the article has "crank" written all over it;
               | almost certainly investor fraud too - it'd be
               | straightforward to fake the claimed smartcard video thing
               | to a nontechnical observer - though not quite as
               | egregious as Steorn Orbo or Theranos though.
        
           | feoren wrote:
           | Not who you asked (and I don't quite understand everything)
           | but I think that's about right, except in the continuous
           | world. You pick an encoding scheme (either the Lagrangian or
           | the Hamiltonian) to go from state -> vector. You have a
           | "rules" matrix, very roughly similar to a Markov matrix, H,
           | and (stretching the limit of my knowledge here) exp(-iHt)
           | _very_ roughly  "translates" from the discrete stepwise world
           | to the continuous world. I'm sure that last part made more
           | knowledgeable people cringe, but it's roughly in the right
           | direction. The part I don't understand at all is the _-i_
           | factor: exp(-it) just circles back on itself after t=2pi, so
           | it feels like exp(-iHt) should be a periodic function?
        
         | lagrange77 wrote:
         | I only understand half of it, but it sounds very interesting.
         | I've always wondered, if the principle of stationary action
         | could be of any help with machine learning, e.g. provide an
         | alternative point of view / formulation.
        
         | BoGoToTo wrote:
         | I've been thinking about his a bit lately. If time is non-
         | continuous then could you model the time evolution of the
         | universe as some operator recursively applied to the quantum
         | state of the universe? If each application of the operator
         | progresses the state of the universe by a single planck-time
         | could we even observe a difference between that and a universe
         | where time is continuous?
        
           | BobbyTables2 wrote:
           | I think Wolfram made news proposing something roughly along
           | these lines.
           | 
           | Either way, I find Planck time/energy to be a very spooky
           | concept.
           | 
           | https://wolframphysics.org/
        
           | tweezy wrote:
           | So one of the most "out there" non-fiction books I've read
           | recently is called "Alien Information Theory". It's a wild
           | ride and there's a lot of flat-out crazy stuff in it but it's
           | a really engaging read. It's written by a computational
           | neuroscientist who's obsessed with DMT. The DMT parts are
           | pretty wild, but the computational neuroscience stuff is
           | intriguing.
           | 
           | In one part he talks about a thought experiment modeling the
           | universe as a multidimensional cellular automata. Where
           | fundamental particles are nothing more than the information
           | they contain. And particles colliding is a computation that
           | tells how that node and the adjacent nodes to update their
           | state.
           | 
           | Way out and not saying there's anything truth to it. But it
           | was a really interesting and fun concept to chew on.
        
             | andoando wrote:
             | Im working on a model to do just that :) The game of life
             | is not too far off either.
        
             | Gooblebrai wrote:
             | You might enjoy his next book: Reality Switch.
        
           | pas wrote:
           | This sounds like the Bohmian pilot wave theory (which is a
           | global formulation of QM). ... Which might be not that crazy,
           | since spooky action at a distance is already a given. And in
           | cosmology (or quantum gravity) some models are describing a
           | region of space based only its surface. So in some sense the
           | universe is much less information dense, than we think.
           | 
           | https://en.m.wikipedia.org/wiki/Holographic_principle
        
         | francasso wrote:
         | I don't think the analogy holds: even if you forget all the
         | preceding non linear steps, you are still left with just a
         | linear dynamical system. It's neither complex nor unitary,
         | which are two fundamental characteristics of quantum mechanics.
        
         | cmgbhm wrote:
         | Not a direct comment on the question but I had a math PhD as an
         | intern before. One of his comments was having tons of high
         | dimensional linear algebra stuff was super advanced 1900s and
         | has plenty of room for new cs discovery.
         | 
         | Didn't make the "what was going on then in physics " connection
         | until now.
        
       | shahbazac wrote:
       | Is there a reference which describes how the current architecture
       | evolved? Perhaps from very simple core idea to the famous "all
       | you need paper?"
       | 
       | Otherwise it feels like lots of machinery created out of nowhere.
       | Lots of calculations and very little intuition.
       | 
       | Jeremy Howard made a comment on Twitter that he had seen various
       | versions of this idea come up again and again - implying that
       | this was a natural idea. I would love to see examples of where
       | else this has come up so I can build an intuitive understanding.
        
         | ollin wrote:
         | karpathy gave a good high-level history of the transformer
         | architecture in this Stanford lecture
         | https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618
        
         | HarHarVeryFunny wrote:
         | Roughly:
         | 
         | 1) The initial seq-2-seq approach was using LSTMs - one to
         | encode the input sequence, and one to decode the output
         | sequence. It's amazing that this worked at all - encode a
         | variable length sentence into a fixed size vector, then decode
         | it back into another sequence, usually of different length
         | (e.g. translate from one language to another).
         | 
         | 2) There are two weaknesses of this RNN/LSTM approach - the
         | fixed size representation, and the corresponding lack of
         | ability to determine which parts of the input sequence to use
         | when generating specific parts of the output sequence. These
         | deficiencies were addressed by Bahdanau et al in an
         | architecture that combined encoder-decoder RNNs with an
         | attention mechanism ("Bahdanau attention") that looked at each
         | past state of the RNN, not just the final one.
         | 
         | 3) RNNs are inefficient to train, so Jakob Uszkoreit was
         | motivated to come up with an approach that better utilized
         | available massively parallel hardware, and noted that language
         | is as much hierarchical as sequential, suggesting a layered
         | architecture where at each layer the tokens of the sub-sequence
         | would be processed in parallel, while retaining a Bahdanau-type
         | attention mechanism where these tokens would attend to each
         | other ("self-attention") to predict the next layer of the
         | hierarchy. Apparently in initial implementation the idea
         | worked, but not better than other contemporary approaches
         | (incl. convolution), but then another team member, Noam
         | Shazeer, took the idea and developed it, coming up with an
         | architecture (which I've never seen described) that worked much
         | better, which was then experimentally ablated to remove
         | unnecessary components, resulting in the original transformer.
         | I'm not sure who came up with the specific key-based form of
         | attention in this final architecture.
         | 
         | 4) The original transformer, as described in the "attention is
         | all you need paper", still had a separate encoder and decoder,
         | copying earlier RNN based approaches, and this was used in some
         | early models such as Google's BERT, but this is unnecessary for
         | language models, and OpenAI's GPT just used the decoder
         | component, which is what everyone uses today. With this
         | decoder-only transformer architecture the input sentence is
         | input into the bottom layer of the transformer, and transformed
         | one step at a time as it passes through each subsequent layer,
         | before emerging at the top. The input sequence has an end-of-
         | sequence token appended to it, which is what gets transformed
         | into the next-token (last token) of the output sequence.
        
           | krat0sprakhar wrote:
           | Thank you for this summary! Very well explained. Any tips on
           | what resources you use to keep updated on this field?
        
             | HarHarVeryFunny wrote:
             | Thanks. Mostly just Twitter, following all the companies &
             | researchers for any new announcements, then reading any
             | interesting papers mentioned/linked. I also subscribe to
             | YouTube channels like Dwarkesh Patel (interviewer) and
             | Yannic Kilcher (AI News), and search out YouTube interviews
             | with the principles. Of course I also read any AI news here
             | on HN, and sometimes there may be interesting information
             | in the comments.
             | 
             | There's a summary of social media AI news here, that
             | sometimes surfaces something interesting.
             | 
             | https://buttondown.email/ainews/archive/
        
       | kordlessagain wrote:
       | What I'm now wondering about is how intuition to connect
       | completely separate ideas works in humans. I will have very
       | strong intuition something is true, but very little way to show
       | it directly. Of course my feedback on that may be biased, but it
       | does seem some people have "better" intuition than others.
        
       | cs702 wrote:
       | Fantastic work by Grant Sanderson, as usual.
       | 
       |  _Attention has won_.[a]
       | 
       | It deserves to be more widely understood.
       | 
       | ---
       | 
       | [a] Nothing has outperformed attention so far, not even Mamba:
       | https://arxiv.org/abs/2402.01032
        
       | stillsut wrote:
       | In training we learn a.) the embeddings and b.) the KQ/MLP-
       | weights.
       | 
       | How well do Transformers perform given learned embeddings but
       | only randomly initialized decoder weights? Do they produce word
       | soup of related concepts? Anything syntactically coherent?
       | 
       | Once a well trained high dimensional representation of tokens are
       | established. can they learn KQ/MLP weights significantly faster?
        
       ___________________________________________________________________
       (page generated 2024-04-15 23:00 UTC)