[HN Gopher] Visualizing Attention, a Transformer's Heart [video]
___________________________________________________________________
Visualizing Attention, a Transformer's Heart [video]
Author : rohitpaulk
Score : 824 points
Date : 2024-04-14 23:38 UTC (23 hours ago)
(HTM) web link (www.3blue1brown.com)
(TXT) w3m dump (www.3blue1brown.com)
| promiseofbeans wrote:
| His previous post 'But what is a GPT?' is also really good:
| https://www.3blue1brown.com/lessons/gpt
| nostrebored wrote:
| Working in a closely related space and this instantly became part
| of my team's onboarding docs.
|
| Worth noting that a lot of the visualization code is available in
| Github.
|
| https://github.com/3b1b/videos/tree/master/_2024/transformer...
| sthatipamala wrote:
| Sounds interesting; what else is part of those onboarding docs?
| jiggawatts wrote:
| It always blows my mind that Grant Sanderson can explain complex
| topics in such a clear, understandable way.
|
| I've seen several tutorials, visualisations, and blogs explaining
| Transformers, but I didn't fully understand them until this
| video.
| chrishare wrote:
| His content and impact is phenomenal
| bilsbie wrote:
| I finally understand this! Why did every other video make it so
| confusing!
| chrishare wrote:
| It is confusing, 3b1b is just that good.
| visarga wrote:
| At the same time it feels extremely simple
|
| attention(Q,K,V) = softmax (Q K^T [?] dK ) @ V
|
| is just half a row; the multi-head, masking and positional
| stuff just toppings
|
| we have many basic algorithms in CS that are more involved,
| it's amazing we get language understanding from such simple
| math
| bilsbie wrote:
| For me I never had too much trouble understanding the
| algorithm. But this is the first time I can see why it
| works.
| Solvency wrote:
| Because:
|
| 1. good communication requires an intelligence that most people
| sadly lack
|
| 2. because the type of people who are smart enough to invent
| transformers have zero incentive to make them easily
| understandable.
|
| most documents are written by authors subconsciously desperate
| to mentally flex on their peers.
| penguin_booze wrote:
| Pedagogy requires empathy, to know what it's like to not know
| something. They'll often draw on experiences the listener is
| already familiar with, and then bridge the gap. This skill is
| orthogonal to the mastery of the subject itself, which I
| think is the reason most descriptions sound confusing,
| inadequate, and/or incomprehensible.
|
| Often, the disseminating medium is a one-sided, like a video
| or a blog post, which doesn't help, either. A conversational
| interaction would help the expert sense why someone outside
| the domain find the subject confusing ("ah, I see what you
| mean"...), discuss common pitfalls ("you might think it's
| like this... but no, it's more like this...") etc.
| WithinReason wrote:
| 2. It's not malice. The longer you have understood something
| the harder it is to explain it, since you already forgot what
| it was like to not understand it.
| thomasahle wrote:
| I'm someone who would love to get better at making educational
| videos/content. 3b1b is obviously the gold standard here.
|
| I'm curious what things other videos did worse compared to
| 3b1b?
| bilsbie wrote:
| I think he had a good, intuitive understanding that he wanted
| to communicate and he made it come through.
|
| I like how he was able to avoid going into the weeds and stay
| focused on leading you to understanding. I remember another
| video where I got really hung up on positional encoding and I
| felt like I could t continue until I understood that. Or
| other videos that overfocus on matrix operations or softmax,
| etc.
| ur-whale wrote:
| > Why did every other video make it so confusing!
|
| In my experience, with very few notable exceptions (e.g.
| Feynmann), researchers are the worst when it comes to clearly
| explaining to others what they're doing.
|
| I'm at the point where I'm starting believe that pedagogy and
| research generally are mutually exclusive skills.
| namaria wrote:
| It's extraordinarily difficult to imagine how it feels not to
| understand something. Great educators can bridge that gap. I
| don't think it's correlated with research ability in any way.
| It's just a very rare skill set, to be able to empathize with
| people who don't understand what you do.
| Al-Khwarizmi wrote:
| Not sure if you mean it as rhetorical question but I think it's
| an interesting question. I think there are at least three
| factors why most people are confused about Transformers:
|
| 1. The standard terminology is "meh" at most. The word
| "attention" itself is just barely intuitive, "self-attention"
| is worse, and don't get me started about "key" and "value".
|
| 2. The key papers (Attention is All You Need, the BERT paper,
| etc.) are badly written. This is probably an unpopular opinion.
| But note that I'm not diminishing their merits. It's perfectly
| compatible to write a hugely impactful, transformative paper
| describing an amazing breakthrough, but just don't explain it
| very well. And that's exactly what happened, IMO.
|
| 3. The way in which these architectures were discovered was
| largely by throwing things at the wall and seeing what sticked.
| There is no reflection process that ended on a prediction that
| such an architecture would work well, which was then
| empirically verified. It's empirical all the way through. This
| means that we don't have a full understanding of why it works
| so well, all explanations are post hoc rationalizations (in
| fact, lately there is some work implying that other
| architectures may work equally well if tweaked enough). It's
| hard to explain something that you don't even fully understand.
|
| Everyone who is trying to explain transformers has to overcome
| these three disadvantages... so most explanations are
| confusing.
| cmplxconjugate wrote:
| >This is probably an unpopular opinion.
|
| I wouldn't say so. Historically it's quite common. Maxwell's
| EM papers used such convoluted notation it it quite difficult
| to read. It wasn't until they were reformulated in vector
| calculus that they became infinitely more digestible.
|
| I think though your third point is the most important; right
| now people are focused on results.
| maleldil wrote:
| > This is probably an unpopular opinion
|
| There's a reason The Illustrated Transformer[1] was/is so
| popular: it made the original paper much more digestible.
|
| [1] https://jalammar.github.io/illustrated-transformer/
| thinkingtoilet wrote:
| Grant has a gift of explaining complicated things very clearly.
| There's a good reason his channel is so popular.
| YossarianFrPrez wrote:
| This video (with a slightly different title on YouTube) helped me
| realize that the attention mechanism isn't exactly a specific
| function so much as it is a meta-function. If I understand it
| correctly, Attention + learned weights effectively enables a
| Transformer to learn a semi-arbitrary function, one which
| involves a matching mechanism (i.e., the scaled dot-product.)
| hackinthebochs wrote:
| Indeed. The power of attention is that it searches the space of
| functions and surfaces the best function given the constraints.
| This is why I think linear attention will never come close to
| the ability of standard attention, the quadratic term is a
| necessary feature of searching over all pairs of inputs and
| outputs.
| mastazi wrote:
| That example with the "was" token at the end of a murder novel is
| genius (at 3:58 - 4:28 in the video) really easy for a non
| technical person to understand.
| hamburga wrote:
| I think Ilya gets credit for that example -- I've heard him use
| it in his interview with Jensen Huang.
| spacecadet wrote:
| Fun video. Much of my "art" lately has been dissecting models,
| injecting or altering attention, and creating animated
| visualizations of their inner workings. Some really fun shit.
| j_bum wrote:
| Link? Sounds fun and reminds me of this tweet [0]
|
| [0] https://x.com/jaschasd/status/1756930242965606582
| spacecadet wrote:
| Nah someone down voted it. And yes, it looks like that + 20
| others that are animated.
| CamperBob2 wrote:
| Downvotes == empty boats. If "Empty Boat parable" doesn't
| ring a bell, Google it...
| globalnode wrote:
| unless an algorithm decides to block or devalue the
| content, but yeah i looked it up, very interesting
| parable, thanks for sharing.
| spacecadet wrote:
| anger is a gift
| thomasahle wrote:
| I like the way he uses a low-rank decomposition of the Value
| matrix instead of Value+Output matrices. Much more intuitive!
| imjonse wrote:
| It is the first time I hear about the Value matrix being low
| rank, so for me this was the confusing part. Codebases I have
| seen also have value + output matrixes so it is clearer that
| Q,K,V are similar sizes and there's a separate projection
| matrix that adapts to the dimensions of the next network layer.
| UPDATE: He mentions this in the last sections of the video.
| namelosw wrote:
| You might also want to check out other 3b1b videos on neural
| networks since there are sort of progressions between each video
| https://www.3blue1brown.com/topics/neural-networks
| rollinDyno wrote:
| Hold on, every predicted token is only a function of the previous
| token? I must have something wrong. This would mean that within
| the embedding of "was", which is of length 12,228 in this
| example. Is it really possible that this space is so rich as to
| have a single point in it encapsulate a whole novel?
| faramarz wrote:
| it's not about a single point encapsulating a novel, but how
| sequences of such embeddings can represent complex ideas when
| processed by the model's layers.
|
| each prediction is based on a weighted context of all previous
| tokens, not just the immediately preceding one.
| rollinDyno wrote:
| That weighted context is the 12228 dimensional vector, no?
|
| I suppose that when you each element in the vector weighs 16
| bits then the space is immense and capable to have a novel in
| a point.
| vanjajaja1 wrote:
| at that point what it has is not a representation of the input,
| its a representation of what the next output could be. ie. its
| a lossy process and you can't extract what came in the past,
| only the details relevant to next word prediction
|
| (is my understanding)
| rollinDyno wrote:
| If the point was the presentation of only the next token, and
| predicted tokens were a function of only the preceding token,
| then the vector of the new token wouldn't have the
| information to produce new tokens that kept telling the
| novel.
| jgehring wrote:
| That's what happens in the very last layer. But at that point
| the embedding for "was" got enriched multiple times, i.e., in
| each attention pass, with information from the whole context
| (which is the whole novel here). So for the example, it would
| contain the information to predict, let's say, the first token
| of the first name of the murderer.
|
| Expanding on that, you could imagine that the intent of the
| sentence to complete (figuring out the murderer) would have to
| be captured in the first attention passes so that other layers
| would then be able to integrate more and more context in order
| to extract that information from the whole context. Also, it
| means that the forward passes for previous tokens need to have
| extracted enough salient high-level information already since
| you don't re-compute all attention passes for all tokens for
| each next token to predict.
| evolvingstuff wrote:
| You are correct, that is an error in an otherwise great video.
| The k+1 token is not merely a function of the kth vector, but
| rather all prior vectors (combined using attention). There is
| nothing "special" about the kth vector.
| tylerneylon wrote:
| Awesome video. This helps to show how the Q*K matrix
| multiplication is a bottleneck, because if you have sequence
| (context window) length S, then you need to store an SxS size
| matrix (the result of all queries times all keys) in memory.
|
| One great way to improve on this bottleneck is a new-ish idea
| called Ring Attention. This is a good article explaining it:
|
| https://learnandburn.ai/p/how-to-build-a-10m-token-context
|
| (I edited that article.)
| rahimnathwani wrote:
| He lists Ring Attention and half a dozen other techniques, but
| they're not within the scope of this video:
| https://youtu.be/eMlx5fFNoYc?t=784
| danielhanchen wrote:
| Oh with Flash Attention, you never have to construct the (S, S)
| matrix ever (also in article) Since its softmax(Q @ K^T /
| sqrt(d)) @ V, you can form the final output in tiles.
|
| In Unsloth, memory usage scales linearly (not quadratically)
| due to Flash Attention (+ you get 2x faster finetuning, 80%
| less VRAM use + 2x faster inference). Still O(N^2) FLOPs
| though.
|
| On that note, on long contexts, Unsloth's latest release fits
| 4x longer contexts than HF+FA2 with +1.9% overhead. So 228K
| context on H100.
| mehulashah wrote:
| This is one of the best explanations that I've seen on the topic.
| I wish there was more work, however, not on how Transfomers work,
| but why they work. We are still figuring it out, but I feel that
| the exploration is not at all systematic.
| seydor wrote:
| I have found the youtube videos by CodeEmporium to be simpler to
| follow https://www.youtube.com/watch?v=Nw_PJdmydZY
|
| Transformer is hard to describe with analogies, and TBF there is
| no good explanation why it works, so it may be better to just
| present the mechanism, "leaving the interpretation to the
| viewer". Also, it's simpler to describe dot products as vectors
| projecting on one another
| nerdponx wrote:
| > TBF there is no good explanation why it works
|
| My mental justification for attention has always been that the
| output of the transformer is a sequence of new token vectors
| such that each individual output token vector incorporates
| contextual information from the surrounding input token
| vectors. I know it's incomplete, but it's better than nothing
| at all.
| rcarmo wrote:
| You're effectively steering the predictions based on adjacent
| vectors (and precursors from the prompt). That mental model
| works fine.
| eurekin wrote:
| > TBF there is no good explanation why it works
|
| I thought the general consesus was: "transformers allow
| neural networks to have adaptive weights".
|
| As opposed to the previous architectures, were every edge
| connecting two neurons always has the same weight.
|
| EDIT: a good video, where it's actually explained better:
| https://youtu.be/OFS90-FX6pg?t=750&si=A_HrX1P3TEfFvLay
| mjburgess wrote:
| The explanation is just that NNs are a stat fitting alg
| learning a conditional probability distribution,
| P(next_word|previous_words). Their weights are a model of this
| distribution. LLMs are a hardware innovation: they make it
| possible for GPUs to compute this at scale across TBs of data.
|
| Why does, 'mat' follow from 'the cat sat on the ...' because
| 'mat' is the most frequent word in the dataset; and the NN is a
| model of those frequencies.
|
| Why is 'London in UK' "known" but 'London in France' isnt? Just
| because 'UK' much more frequently occurs in the dataset.
|
| The algorithm isnt doing anything other than aligning
| computation to hardware; the computation isnt doing anything
| interesting. The value comes from the conditional probability
| structure in the data. -- that comes from people arranging
| words usefully, because they're communicating information with
| one another
| nerdponx wrote:
| I think you're downplaying the importance of the
| attention/transformer architecture here. If it was "just" a
| matter of throwing compute at probabilities, then we wouldn't
| need any special architecture at all.
|
| P(next_word|previous_words) is ridiculously hard to estimate
| in a way that is actually useful. Remember how bad text
| generation used to be before GPT? There is innovation in
| discovering an architecture that makes it possible to learn
| P(next_word|previous_words), in addition to the computing
| techniques and hardware improvements required to make it
| work.
| mjburgess wrote:
| Yes, it's really hard -- the innovation is aligning the
| really basic dot-product similarity mechanism to hardware.
| You can use basically any NN structure to do the same task,
| the issue is that they're untrainable because they arent
| parallizable.
|
| There is no innovation here in the sense of a brand new
| algorithm for modelling conditional probabilities -- the
| innovation is in adapting the algorithm for GPU training on
| text/etc.
| bruce343434 wrote:
| I don't know why you seem to have such a bone to pick
| with transformers but imo it's still interesting to learn
| about it, and reading your dismissively toned drivel of
| "just" and "simply" makes me tired. You're barking up the
| wrong tree man, what are you on about.
| mjburgess wrote:
| No issue with transformers -- the entire field of
| statistical learning, decision trees to NNs, do the same
| thing... there's no mystery here. No person with any
| formal training in mathematical finance, applied
| statistics, hard experimental sciences on complex
| domains... etc. would be taken in here.
|
| I'm trying my best to inform people who are interested in
| being informed, against an entire media ecosystem being
| played like a puppet-on-a-string by ad companies. The
| strategy of these companies is to exploit how easy is it
| to strap anthropomorphic interfaces over models of word
| frequencies and have everyone lose their minds.
|
| Present the same models as a statistical dashboard, and
| few would be so adamant that their sci-fi fantasy is the
| reality.
| divan wrote:
| Do you have blog or anything to follow?
| mjburgess wrote:
| I may start publishing academic papers in XAI as part of
| a PhD; if I do, I'll share somehow. The problem is the
| asymmetry of bullshit: the size of paper necessary for
| academics to feel that claims have been evidenced is
| book-length for critique but 2pg for "novel
| contributions".
| eutectic wrote:
| Different models have different inductive biases. There
| is no way you could build GPT4 with decision trees.
| jameshart wrote:
| "There's no mystery here"
|
| Nobody's claiming there's 'mystery'. Transformers are a
| well known, publicly documented architecture. This thread
| is about a video explaining exactly how they work - that
| they are a highly parallelizable approach that lends
| itself to scaling back propagation training.
|
| "No person with ... formal training ... would be taken in
| here"
|
| All of a sudden you're accusing someone of perpetuating a
| fraud - I'm not sure who though. "Ad companies"?
|
| Are you seriously claiming that there hasn't been a
| qualitative improvement in the results of language
| generation tasks as a result of applying transformers in
| the large language model approach? Word frequencies turn
| out to be a powerful thing to model!
|
| It's ALL just hype, none of the work being done in the
| field has produced any value, and everyone should... use
| 'statistical dashboards' (whatever those are)?
| fellendrone wrote:
| > models of word frequencies
|
| Ironically, your best effort to inform people seems to be
| misinformed.
|
| You're talking about a Markov model, not a language model
| with trained attention mechanisms. For a start,
| transformers can consider the entire context (which could
| be millions of tokens) rather than simple state to state
| probabilities.
|
| No wonder you believe people are being 'taken in' and
| 'played by the ad companies'; your own understanding
| seems to be fundamentally misplaced.
| saeranv wrote:
| I think they are accounting for the entire context, they
| specifically write out:
|
| >> P(next_word|previous_words)
|
| So the "next_word" is conditioned on "previous_words"
| (plural), which I took to mean the joint distribution of
| all previous words.
|
| But, I think even that's too reductive. The transformer
| is specifically not a function acting as some incredibly
| high-dimensional lookup table of token conditional
| probabilities. It's learning a (relatively) small amount
| of parameters to compress those learned conditional
| probabilities into a radically lower-dimensional
| embedding.
|
| Maybe you could describe this as a discriminative model
| of conditional probability, but at some point, we start
| describing that kind of information compression as
| semantic understanding, right?
| nerdponx wrote:
| It's reductive because it obscures just how complicated
| that `P(next_word|previous_words)` is, and it obscures
| the fact that "previous_words" is itself a carefully-
| constructed (tokenized & vectorized) representation of a
| huge amount of text. One individual "state" in this
| Markov-esque chain is on the order of an entire book, in
| the bigger models.
| mjburgess wrote:
| It doesnt matter how big it is, it's properties dont
| change. eg., it never says, "I like what you're wearing"
| because it likes what I'm wearing.
|
| It seems there's an entire generation of people taken-in
| by this word, "complexity" and it's just magic sauce that
| gets sprinkled over ad-copy for big tech.
|
| We know what it means to compute P(word|words), we know
| what it means that P("the sun is hot") > P("the sun is
| cold") ... and we know that by computing this, you arent
| actaully modelling the temperature of the sun.
|
| It's just so disheartening how everyone becomes so
| anthropomorphically credulous here... can we not even get
| sun worship out of tech? Is it not possible for people to
| understand that conditional probability structures do not
| model mental states?
|
| No model of conditional probabilities over text tokens,
| no matter how many text tokens it models, ever says, "the
| weather is nice in august" because it means the weather
| is nice in august. It has never been in an august; or in
| weahter; nor does it have the mental states for
| preference, desire.. nor has it's text generation been
| caused by the august weather.
|
| This is extremely obvious, as in, simply refelect on why
| the people who wrote those historical text did so.. and
| reflect on why an LLM generates this text... and you can
| see that even if an LLM produced word-for-word MLK's I
| have a dream speech, it does not have a dream. It has not
| suffered any oppression; nor organised any labour; nor
| made demands on the moral conscience of the public.
|
| This shouldnt need to be said to a crowd who can
| presumably understand what it means to take a
| distribution of text tokens and subset them. It doesnt
| matter how complex the weight structure of an NN is: this
| tells you only how compressed the conditional probability
| distribution is over many TBs of all of text history.
| drdeca wrote:
| Perhaps you have misunderstood what the people you are
| talking about, mean?
|
| Or, if not, perhaps you are conflating what they mean
| with something else?
|
| Something doesn't need to have had a subjective
| experience of the world in order to act as a model of
| some parts of the world.
| nerdponx wrote:
| You're tilting at windmills here. Where in this thread do
| you see anyone taking about the LLM as anything other
| than a next-token prediction model?
|
| Literally all of the pushback you're getting is because
| you're trivializing the choice of model architecture,
| claiming that it's all so obvious and simple and it's all
| the same thing in the end.
|
| Yes, of course, these models have to be well-suited to
| run on our computers, in this case GPUs. And sure, it's
| an interesting perspective that maybe they work well
| because they are well-suited for GPUs and not because
| they have some deep fundamental meaning. But you can't
| act like everyone who doesn't agree with your perspective
| is just an AI hypebeast con artist.
| kordlessagain wrote:
| Somebody's judgment weights need to be updated to include
| emoji embeddings.
| YetAnotherNick wrote:
| No. This is blatantly false. The belief that recurrent
| model can't be scaled is untrue. People have recently
| trained MAMBA with billions of parameters. The
| fundamental reason why transformers changed the field is
| that they are lot more scalable context length wise, and
| LSTM, LRU etc doesn't come close.
| mjburgess wrote:
| > they are lot more scalable context length wise
|
| Sure, we're agreeing. I'm just being less specific.
| YetAnotherNick wrote:
| Scalable as in loss wise scalable, not compute wise.
| HarHarVeryFunny wrote:
| Yes, but pure Mamba doesn't perform as well as a
| transformer (and neither did LTSMs). This is why you see
| hybrid architectures like Jamba = Mamba + transformer.
| The ability to attend to specific tokens is really key,
| and what is lost in recurrent models where sequence
| history is munged into a single state.
| YetAnotherNick wrote:
| That's my point. It doesn't perform in terms of loss,
| even though it performs well enough in terms of compute
| HarHarVeryFunny wrote:
| > Yes, it's really hard -- the innovation is aligning the
| really basic dot-product similarity mechanism to
| hardware. You can use basically any NN structure to do
| the same task, the issue is that they're untrainable
| because they arent parallizable.
|
| This is only partially true. I wouldn't say you could use
| *any* NN architecture for sequence-to-sequence
| prediction. You either have to model them as a
| potentially infinite sequence with an RNN of some sort
| (e.g. LSTM), or, depending on the sequence type, model
| them as a hierarchy of sub-sequences, using something
| like a multi-layered convolution or transformer.
|
| The transformer is certainly well suited to current
| massively parallel hardware architectures, and this was
| also a large part of the motivation for the design.
|
| While the transformer isn't the only way to do seq-2-seq
| with neural nets, I think the reason it is so successful
| is more than simply being scalable and well matched to
| the available training hardware. Other techniques just
| don't work as well. From the mechanistic interpretability
| work that has been done so far, it seems that learnt
| "induction heads", utilizing the key-based attention, and
| layered architecture, are what give transformers their
| power.
| JeremyNT wrote:
| > _There is innovation in discovering an architecture that
| makes it possible to learn P(next_word|previous_words), in
| addition to the computing techniques and hardware
| improvements required to make it work._
|
| Isn't that essentially what mjburgess said in the parent
| post?
|
| > _LLMs are a hardware innovation: they make it possible
| for GPUs to compute this at scale across TBs of data... The
| algorithm isnt doing anything other than aligning
| computation to hardware_
| nerdponx wrote:
| Not really, and no. Torch and CUDA align computation to
| hardware.
|
| If it were just a matter of doing that, we would be fine
| with fully-connected MLP. And maybe that would work with
| orders of magnitude more data and compute than we
| currently throw at these models. But we are already
| pushing the cutting edge of those things to get useful
| results out of the specialized architecture.
|
| Choosing the right NN architecture is like feature
| engineering: the exact details don't matter that much,
| but getting the right overall structure can be the
| difference between learning a working model and failing
| to learn a working model, _from the same source data_
| with the same information content. Clearly our choice of
| inductive bias matters, and the transformer architecture
| is clearly an improvement over other designs.
|
| Surely you wouldn't argue that a CNN is "just" aligning
| computation to hardware, right? Transformers are clearly
| showing themselves as a reliably effective model
| architecture for text in the same way that CNNs are
| reliably effective for images.
| mjburgess wrote:
| Err... no. MLPs are fundamentally sequential algorithms
| (backprop weight updating). All major innovations in NN
| design have been to find ways of designing the
| architecture to fit GPU compute paradigms.
|
| It was an innovation, in the 80s, to map image structure
| to weight structure that underpins CNNs. That isnt what
| made CNNs trainable though.. that was alexnet, and just
| go read the paper... its pretty upfront about how the NN
| architecture is designed to fit the GPU... that's the
| point of it
| seydor wrote:
| People specifically would like to know what the attention
| calculations add to this learning of the distribution
| ffwd wrote:
| Just speculating but I think attention enables
| differentiation of semantic concepts for a word or sentence
| within a particular context. Like for any total set of
| training data you have a lesser number of semantic concepts
| (like let's say you have 10000 words, then it might contain
| 2000 semantic concepts, and those concepts are defined by
| the sentence structure and surrounding words, which is why
| they have a particular meaning), and then attention allows
| to differentiate those different contexts at different
| levels (words/etc). Also the fact you can do this attention
| at runtime/inference means you can generate the context
| from the prompt, which enables the flexibility of variable
| prompt/variable output but you lose the precision of giving
| an exact prompt and getting an exact answer
| ffwd wrote:
| I'm not one to whine about downvotes but I just have to
| say, it's a bad feeling when I can't even respond to the
| negative feedback because there is no accompanying
| comment. Did I misinterpret something? Did you? Who will
| ever know when there is no information. :L
| forrestthewoods wrote:
| I find this take super weak sauce and shallow.
|
| This recent $10,000 challenge is super super interesting
| imho.
| https://twitter.com/VictorTaelin/status/1778100581837480178
|
| State of the art models are doing more than "just" predicting
| the probability of the next symbol.
| mjburgess wrote:
| You underestimate the properties of the sequential-
| conditional structure of human communication.
|
| Consider how a clever 6yo could fake being a physicist with
| access to a library of physics textbooks and a shredder.
| All the work is done for them. You'd need to be a physicist
| to spot them faking it.
|
| Of course, LLMs are in a much better position than having
| shredded physics textbooks -- they have shreddings of all
| books. So you actually have to try to expose this process,
| rather than just gullibly prompt using confirmation bias.
| It's trivial to show they work this way, both formally and
| practically.
|
| The issue is, practically, gullible people aren't trying.
| astrange wrote:
| You can program algorithms into transformer networks, up
| to the limit of how many computations you get.
|
| https://srush.github.io/raspy/
|
| Are you going to do computer reductionism too and say
| computers can't do arithmetic, they just run electricity
| through silicon?
| forrestthewoods wrote:
| I don't find your model either convincing or useful.
| nextaccountic wrote:
| > Why does, 'mat' follow from 'the cat sat on the ...'
| because 'mat' is the most frequent word in the dataset; and
| the NN is a model of those frequencies.
|
| What about cases that are not present in the dataset?
|
| The model must be doing _something_ besides storing raw
| probabilities to avoid overfitting and enable generalization
| (imagine that you could have a very performant model - when
| it works - but it sometimes would spew "Invalid input, this
| was not in the dataset so I don't have a conditional
| probability and I will bail out")
| albertzeyer wrote:
| You are more speaking about n-gram models here. NNs do far
| more than that.
|
| Or if you just want to say that NNs are used as a statistical
| model here: Well, yea, but that doesn't really tell you
| anything. Everything can be a statistical model.
|
| E.g., you could also say "this is exactly the way the human
| brain works", but it doesn't really tell you anything how it
| really works.
| mjburgess wrote:
| My description is true of any statistical learning
| algorithm.
|
| The thing that people are looking to for answers, the NN
| itself, does not have them. That's like looking to Newton's
| compass to understand his general law of gravitation.
|
| The reason that LLMs trained on the internet and every
| ebook has the structure of human communication is because
| the dataset has that structure. Why does the data have that
| structure? this requires science, there is no explanation
| "in the compass".
|
| NNs are statistical models trained on data -- drawing
| analogies to animals is a mystification that causes
| people's ability to think clearly he to jump out the
| window. No one compares stock price models to the human
| brain; no banking regulator says, "well your volatility
| estimates were off because your machines had the wrong
| thoughts". This is pseudoscience.
|
| Animals are not statistical learning algorithms, so the
| reason that's uninformative is because it's false. Animals
| are in direct causal contact with the world and uncover its
| structure through interventional action and counterfactual
| reasoning. The structure of animal bodies, and the general
| learning strategies are well-known, and having nothing to
| do with LLMs/NNs.
|
| The reason that I know "The cup is in my hand" is not
| because P("The cup is in my hand"|HistoricalTexts) > P(not
| "The cup is in my hand"|HistoricalTexts)
| Demlolomot wrote:
| If learning in real life over 5-20 years shows the same
| result as a LLM being trained by billions of tokens, than
| yes it can be compared.
|
| And there are a lot of people out there who do not a lot
| of reasoning.
|
| After all optical illusions exist, our brain generalizes.
|
| The same thing happens with words like the riddle about
| the doctor operating on a child were we discover that the
| doctor is actually a female.
|
| And while llms only use text, we can already see how
| multimodal models become better, architecture gets better
| and hardware too.
| mjburgess wrote:
| I don't know what your motivation in comparison is; mine
| is science, ie., explanation.
|
| I'm not interested that your best friend emits the same
| words in the same order as an LLM; i'm more interested
| that he does so because he enjoys you company whereas the
| LLM does not.
|
| Engineer's overstep their mission when they assume that
| because you can substitute one thing for another, and
| sell a product in doing so, that this is informative. It
| isnt. I'm not interested in whether you can replace the
| sky for a skybox and have no one notice -- who cares?
| What might fool an ape is _everything_ , and what that
| matters for science is _nothing_.
| Demlolomot wrote:
| My thinking is highly influenced by brain research.
|
| We don't just talk about a LLM we talk about a neuronal
| network architecture.
|
| There is a direct link to us (neural networks)
| vineyardmike wrote:
| > The reason that I know "The cup is in my hand" is not
| because P("The cup is in my hand"|HistoricalTexts) >
| P(not "The cup is in my hand"|HistoricalTexts)
|
| I mostly agree with your points, but I still disagree
| with this premise. Humans (and other animals) absolutely
| are statistical reasoning machines. They're just advanced
| ones which can process more than "text" - they're multi-
| modal.
|
| As a super dumb-simple set of examples: Think about the
| origin of the phrase "Cargo Cult" and similar religious
| activities - people will absolutely draw conclusions
| about the world based on their learned observations.
| Intellectual "reasoning" (science!) really just relies on
| more probabilities or correlations.
|
| The reason you know the cup is in your hand is because
| P("I see a cup and a hand"|HistoryOfEyesight) + P("I feel
| a cylinder shape"|HistoryOfTactileFeeling) + .... >
| P(Inverse). You can pretend it's because humans are
| intelligent beings with deep reasoning skills (not trying
| to challenge your smarts here!), but humans learn through
| trial and error just like a NN with reinforcement
| learning.
|
| Close your eyes and ask a person to randomly place either
| a cup from your kitchen in your hand or a different
| object. You can probably tell which one is it is. Why?
| Because you have learned what it feels like, and learned
| countless examples of cups that are different, from years
| of passive practice. Thats basically deep learning.
| mjburgess wrote:
| I mean something specific by "statistics": modelling
| frequency associations in static ensembles of data.
|
| Having a body which changes over time that interacts with
| a world that changes over time makes animal learning not
| statistical (call it, say, experimental). That animals
| fall into skinner-box irrational behaviour can be
| modelled as a kind of statistical learning, but it
| actually isnt.
|
| It's a failure of ecological salience mechanisms in
| regulating the "experimental learning" that animals
| engage in. Eg., with the cargo cults the reason they
| adopted that view was because their society had a "big
| man" value system based on material acquisition and
| western waring powers seemed Very Big and so were
| humiliating. In order to retain their status they adopted
| (apparently irrational) theories of how the world worked
| (gods, etc).
|
| From the outside this process might seem statistical, but
| it's the opposite. Their value system made material
| wealth have a different causal salience which was useful
| in their original ecology (a small island with small
| resources), but it went haywire when faced with the whole
| world.
|
| Eventually these mechanisms update with this new
| information, or the tribe dies off -- but what's going
| wrong here is that the very very non-statistical learning
| ends up describable that way.
|
| This is indeed, why we should be very concerned about
| people skinner-boxing themsleves with LLMs
| vineyardmike wrote:
| > Having a body which changes over time that interacts
| with a world that changes over time makes animal learning
| not statistical (call it, say, experimental).
|
| The "experiment" of life is what defines the statical
| values! Experimentation is just learning what the
| statistical output of something is.
|
| If I hand you a few dice, you'd probably be able to guess
| the statistical probability of every number for given
| roll. Because you've learned that through years of
| observation building a mental model. If I hand you a
| weighted die, suddenly your mental model is gone, and you
| can re-learn experimentally by rolling it a bunch. How
| can you explain experimental learning except
| "statistically"?
|
| > they adopted (apparently irrational) theories of how
| the world worked (gods, etc)
|
| They can be wrong without being irrational. Building an
| airport doesn't make planes show up, but planes won't
| show up without an airport. If you're an island nation
| with little understanding of the global geopolitical
| environment of WWII, you'd have no idea why planes
| started showing up on your island, but they keep showing
| up, and only at an airport. It seems rational to assume
| they'd continue showing up to airports.
|
| > that animals fall into skinner-box irrational behaviour
| can be modelled as a kind of statistical learning, but it
| actually isnt
|
| What is it if not statistical?
|
| Also skinner boxes are, in a way, perfectly rational.
| There's no way to understand the environment, and if
| pushing a button feeds you, then rationally you should
| push the button when hungry. Humans like to think we're
| smart because we've invented deductive reasoning, and we
| quote "correlation is not causation" that we're not just
| earning to predict the world around us from past
| experiences.
| mjburgess wrote:
| For dice the ensemble average is the time-average: if you
| roll the dice 1000 times the probability of getting a
| different result doesn't change.
|
| For almost everything in the world, action on it, changes
| it. There are vanishingly few areas where this isn't the
| case (most physics, most chemistry, etc.).
|
| Imagine trying to do statistics but every time you
| sampled from reality the distribution of your sample
| changes not due to randomness, but because reality has
| changed. Now, can you do statistics? No.
|
| It makes all the difference in the world to have a body
| and hold the thing you're studying. Statistics is trying
| to guess the shape of the ice cube from the puddle;
| animal learning is making ice cubes.
| data_maan wrote:
| > Having a body which changes over time that interacts
| with a world that changes over time makes animal learning
| not statistical (call it, say, experimental). That
| animals fall into skinner-box irrational behaviour can be
| modelled as a kind of statistical learning, but it
| actually isnt.
|
| RL is doing just this, simulating an environment. And we
| can have an agent "learn" in that environment.
|
| I think tying learning to a body is too restrictive. The
|
| You strongly rely on the assumption that "something else"
| generates the statistics we observe, but scientifically,
| there exists little evidence whether that "something
| else" exists (see eg the Bayesian brain).
| cornholio wrote:
| > "this is exactly the way the human brain works"
|
| I'm always puzzled by such assertions. A cursory look at
| the technical aspects of an iterated attention - perceptron
| transformation clearly shows it's just a convoluted and
| powerful way to query the training data, a "fancy" Markov
| chain. The only rationality it can exhibit is that which is
| already embedded in the dataset. If trained on nonsensical
| data it would generate nonsense and if trained with a
| partially non-sensical dataset it will generate an average
| between truth and nonsense that maximizes some abstract
| algorithmic goal.
|
| There is no knowledge generation going on, no rational
| examination of the dataset through the lens of an internal
| model of reality that allows the rejection of invalid
| premises. The intellectual food already chewed and digested
| in the form of the training weights, with the model just
| mechanically extracting the nutrients, as opposed to
| venturing in the outside world to hunt.
|
| So if it works "just like the human brain", it does so in a
| very remote sense, just like a basic neural net works "just
| like the human brain", i.e individual biological neurons
| can be said to be somewhat similar.
| pas wrote:
| If a human spends the first 30 years of their life in a
| cult they will be also speaking nonsense a lot - from our
| point of view.
|
| Sure, we have a nice inner loop, we do some pruning,
| picking and choosing, updating, weighting things based on
| emotions, goals, etc.
|
| Who knows how complicated those things will prove to
| model/implement...
| michaelt wrote:
| That's not really an explanation that tells people all that
| much, though.
|
| I can explain that car engines 'just' convert gasoline into
| forward motion. But if a the person hearing the explanation
| is hoping to learn what a cam belt or a gearbox is, or why
| cars are more reliable now than they were in the 1970s, or
| what premium gas is for, or whether helicopter engines work
| on the same principle - they're going to need a more detailed
| explanation.
| mjburgess wrote:
| It explains the LLM/NN. If you want to explain why it emits
| words in a certain order you need to explain how reality
| generated the dataset, ie., you need to explain how people
| communicate (and so on).
|
| There is no mystery why an NN trained on the night sky
| would generate nightsky-like photos; the mystery is why
| those photos have those patterns... solving that is called
| astrophysics.
|
| Why do people, in reasoning through physics problems, write
| symbols in a certain order? Well, explain physics,
| reasoning, mathematical notation, and so on. The ordering
| of the symbols gives rise to a certain utility of
| immitating that order -- but it isnt explained by that
| order. That's circular: "LLMs generate text in the order
| they do, because that's the order of the text they were
| given"
| michaelt wrote:
| That leaves loads of stuff unexplained.
|
| If the LLM is capable of rewording the MIT license into a
| set of hard-hitting rap battle lyrics, but the training
| dataset didn't contain any examples of anyone doing that,
| is the LLM therefore capable of producing output beyond
| the limits of its training data set?
|
| Is an LLM inherently constrained to mediocrity? If an LLM
| were writing a novel, does its design force it to produce
| cliche characters and predictable plotlines? If applied
| in science, are they inherently incapable of advancing
| the boundaries of human knowledge?
|
| Why transformers instead of, say, LSTMs?
|
| Must attention be multi-headed? Why can't the model have
| a simpler architecture, allowing such implementation
| details to emerge from the training data?
|
| Must they be so big that leading performance is only in
| the hands of multi-billion-dollar corporations?
|
| What's going on with language handling? Are facts learned
| in an abstract enough way that they can cross language
| barriers? Should a model produce different statements of
| fact when questioned in different languages? Does France
| need a French-language LLM?
|
| Is it reasonable to expect models to perform basic
| arithmetic accurately? What about summarising long
| documents?
|
| Why is it that I can ask questions with misspellings, but
| get answers with largely correct spelling? If
| misspellings were in the training data, why aren't they
| in the output? Does the cleverness that stops LLMs from
| learning misspellings from the training data also stop
| them from learning other common mistakes?
|
| If LLMs can be trained to be polite despite having
| examples of impoliteness in their training data, can they
| also be trained to not be racist, despite having examples
| of racism in their training data?
|
| Can a model learn a fact that is very rarely present in
| the training data - like an interesting result in an
| obscure academic paper? Or must a fact be widely known
| and oft-repeated in order to be learned?
|
| Merely saying "it predicts the next word" doesn't really
| explain much at all.
| mjburgess wrote:
| Which conditional probability sequences can be exploited
| for engineering utility cannot be known ahead of time;
| nor is it explained by the NN. It's explained by
| investigating how the data was created by people.
|
| Train a NN to generate pictures of the nightsky: which
| can be used for navigation? Who knows, ahead of time. The
| only way of knowing is to have an explanation of how the
| solar system works and then check the pictures are
| accurate enough.
|
| The NN which generates photos of the nightsky has nothing
| in it that explains the solar system, nor does any aspect
| of an NN model the solar system. The photos it was
| trained on happened to have their pixels arranged in that
| order.
|
| Why those arrangements occur is explained by
| astrophysics.
|
| If you want to understand what ChatGPT can do, you need
| to ask OpenAI for their training data and then perform
| scientific investigations of its structure and how that
| structure came to be.
|
| Talking in terms of the NN model is propaganda and
| pseudoscience: the NN didnt arrange the pixels, gravity
| did. Likewise, the NN isnt arranging rap lyrics in that
| order because it's rapping: singers are.
|
| There is no actual mystery here. It's just we are
| prevented form access to the data by OpenAI, and struggle
| to explain reality which generated that data -- which
| requires years of actual science.
| pas wrote:
| It has a lot of things already encoded regarding the
| solar system, but it cannot really access it, it cannot -
| as far as I know - run functions on its own internal
| encoded data, right? If it does something like that, it's
| because it learned that higher-level pattern based on
| training data.
|
| The problem with NN arrangements in general is that we
| don't know if it's actually pulling out some exact
| training data (or a useful so-far-unseen pattern from the
| data!) or it's some distorted confabulation. (Clever Hans
| all again. If I ask ChatGPT to code me a nodeJS IMAP
| backup program it does, but the package it gleeful
| imports/require()s is made up.
|
| And while the typical artsy arts have loose rules, where
| making up new shit based on what people wish for is
| basically the only one, in other contexts that's a hard
| no-no.
| IanCal wrote:
| This is wrong, or at least a simplification to the point of
| removing any value.
|
| > NNs are a stat fitting alg learning a conditional
| probability distribution, P(next_word|previous_words).
|
| They are trained to maximise this, yes.
|
| > Their weights are a model of this distribution.
|
| That doesn't really follow, but let's leave that.
|
| > Why does, 'mat' follow from 'the cat sat on the ...'
| because 'mat' is the most frequent word in the dataset; and
| the NN is a model of those frequencies.
|
| Here's the rub. If how you describe them is all they're doing
| then a sequence of never-before-seen words would have _no_
| valid response. All words would be equally likely. It would
| mean that a single brand new word would result in absolute
| gibberish following it as there 's nothing to go on.
|
| Let's try:
|
| Input: I have one kjsdhlisrnj and I add another kjsdhlisrnj,
| tell me how many kjsdhlisrnj I now have.
|
| Result: You now have two kjsdhlisrnj.
|
| I would wager a solid amount that kjsdhlisrnj never appears
| in the input data. If it does pick another one, it doesn't
| matter.
|
| So we are learning something _more general_ than the
| frequencies of sequences of tokens.
|
| I always end up pointing to this but OthelloGPT is very
| interesting https://thegradient.pub/othello/
|
| While it's _trained_ on sequences of moves, what it _does_ is
| more than just "sequence a,b,c is followed by d most often"
| mjburgess wrote:
| Any NN "trained on" data sampled from an abstract complete
| outcome space (eg., a game with formal rules; mathematical
| sequences, etc) can often represent that space completely.
| It comes down to whether you can form conditional
| probability models of the rules, and that's usually
| possible because that's what _abstract_ rules are.
|
| > I have one kjsdhlisrnj and I add another kjsdhlisrnj,
| tell me how many kjsdhlisrnj I now have.
|
| 1. P(number-word|tell me how many...) > P(other-kinds-of-
| words|tell me how many...)
|
| 2. P(two|I have one ... I add another ...) > P(one|...) >
| P(three|...) > others
|
| This is trivial.
| IanCal wrote:
| Right, learning more abstract rules about how things work
| is the goal and where the value comes in. Not all
| algorithms are able to do this, even if they can do what
| you describe in your first comment.
|
| That's why they're interesting, othellogpt is interesting
| because it builds a world model.
| mjburgess wrote:
| It builds a model of a "world" whose structure is
| conditional probabilities, this is circular. It's like
| saying you can use a lego model to build a model of
| another lego model. All the papers which "show" NNs
| building "world" models arent using any world. It's lego
| modelling lego.
|
| The lack of a world model only matters when the data NNs
| are trained on aren't valid measures of the world that
| data is taken to model. All the moves of a chess game are
| a complete model of chess. All the books ever written
| aren't a model of, well, anything -- the structure of the
| universe isnt the structure of text tokens.
|
| The only reason _all_ statistical algorithms, including
| NNs, appear to model the actual world is because patterns
| in data give this appearance: P(The Sun is Hot) > P(The
| Sun is Cold) -- there is no model of the sun here.
|
| The reason P("The Sun is Hot") seems to model the sun, is
| because we can read the english words "sun" and "hot" --
| it is we who think the machine which generates this text
| does so semantically.. but the people who wrote that
| phrase in the dataset did so; the machine is just
| generating "hot" because of that dataset.
| IanCal wrote:
| Othellogpt is fed only moves and builds a model of the
| current board state in its activations. It never sees a
| board.
|
| > It's like saying you can use a lego model to build a
| model of another lego model.
|
| No it's like using a description of piece placements and
| having a picture in mind about what the current model
| looks like.
| mjburgess wrote:
| The "board" is abstract. Any game of this sort is defined
| by a series of conditional probabilities:
|
| {P(Pawn_on_sqare_blah|previous_moves) ... etc.}
|
| What all statistical learning algorithms model is sets of
| conditional probabilities. So any stat alg is a model of
| a set of these rules... that's the "clay" of these
| models.
|
| The problem is the physical world isn't anything like
| this. The reason I say, "I liked that TV show" is because
| I had a series of mental states caused by the TV show
| over time (and so on). This isnt representable as a set
| of conditional probs in the same way.
|
| You could imagine, at the end of history, there being a
| total set of all possible conditional probabilities: P(I
| liked show|my_mental_states, time, person, location,
| etc.) -- this would be uncomputable, but it could be
| supposed.
|
| If you had that dataset then yes, NNs would learn the
| entire structure of the world, because _that 's the
| dataset_. The problem is that the world cannot be
| represented in this fashion, not that NNs could model it
| if it could be. A decision tree could.
|
| P(I liked the TV show) doesnt follow from any dataset
| ever collected. It follows from my mental states. So no
| NN can ever model it. They can model frequency
| associations of these phrases in historical text
| documents: this isnt a model of hte world
| IanCal wrote:
| > Any game of this sort is defined by a series of
| conditional probabilities:
| {P(Pawn_on_sqare_blah|previous_moves) ... etc.}
|
| That would always be 1 or 0, but also _that data is not
| fed into othellogpt_. That is not the dataset. It is not
| fed in board states at all.
|
| It learns it, but it is _not the dataset_.
| mjburgess wrote:
| It is the dataset. When you're dealing with abstract
| objects (ie., mathematical spaces), they are all
| isomorphic.
|
| It doesnt matter if you "feed in" 1+1+1+1 or 2+2 or
| sqrt(16).
|
| The rules of chess are encoded either explicit rules or
| by contrast classes of valid/invalid games. These are
| equivalent formulations.
|
| When you're dealing with text tokens it does matter if
| "Hot" is frequently after "The Sun is..." because reality
| isnt an abstract space, and text tokens arent measures of
| it.
| IanCal wrote:
| > It is the dataset.
|
| No. A series of moves alone provides strictly less
| information than a board state or state + list of rules.
| mjburgess wrote:
| If the NN learns the game, that is itself an existence
| proof of the opposite, (by obvious information-theoretic
| arguments).
|
| Training is supervised, so you don't need bare sets of
| moves to encode the rules; you just need a way of
| subsetting the space into contrast classes of
| valid/invalid.
|
| It's a lie to say the "data" is the moves, the data is
| the full outcome space: ({legal moves}, {illegal moves})
| where the moves are indexed by the board structure
| (necessarily, since moves are defined by the board
| structure -- its an _abstract_ game). So there 's two
| deceptions here: (1) supervision structures the training
| space; and (2) the individual training rows have
| sequential structure which maps to board structure.
|
| Complete information about the game is provided to the
| NN.
|
| But let's be clear, the othellogpt still generates
| illegal moves -- showing that it does not learn the
| binary conditional structure of the actual game.
|
| The deceptiveness of training a NN on a game whose rules
| are conditional probability structures and then claiming
| the very-good-quality conditional probability structures
| it finds are "World Models" is... maddening.
|
| This is all just fraud to me; frauds dressing up other
| frauds in transparent clothing. LLMs trained on the
| internet are being sold as approximating the actual
| world, not 8x8 boardgames. I have nothing polite to say
| about any of this
| IanCal wrote:
| > It's a lie to say the "data" is the moves, the data is
| the full outcome space: ({legal moves}, {illegal moves})
|
| There is nothing about illegal moves provided to
| othellogpt as far as I'm aware.
|
| > Complete information about the game is provided to the
| NN.
|
| That is not true. Where is the information that there are
| two players provided? Or that there are two colours? Or
| how the colours change? Where is the information about
| invalid moves provided?
|
| > But let's be clear, the othellogpt still generates
| illegal moves -- showing that it does not learn the
| binary conditional structure of the actual game.
|
| Not perfectly, no. But that's not at all required for my
| point, though is relevant if you try and use the fact it
| learns to play the game as proof that moves provide all
| information about legal board states.
| mjburgess wrote:
| How do you think the moves are represented?
|
| All abstract games of this sort are just sequences of bit
| patterns, each pattern related to the full legal space by
| a conditional probability structure (or, equivalently, as
| set ratios).
|
| Strip away all the NN b/s and anthropomorphic language
| and just represent it to yourself using bit sets.
|
| Then ask: how hard is it to approximate the space from
| which these bit sets are drawn using arbitrarily deep
| conditional probability structures?
|
| it's trivial
|
| the problem the author sets up about causal structures in
| the world cannot be represented as a finite sample of bit
| set sequences -- and even if it could, that isnt the data
| being used
|
| the author hasn't understood the basics of what the
| 'world model' problem even is
| pas wrote:
| how does it work underneath?
|
| "kjsdhlisrnj" is in the context, it gets tokenized, and now
| when the LLM is asked to predict/generate next-token
| sequences somehow "kjsdhlisrnj" is there too. it learns
| patterns. okay sure, they ger encoded somehow, but during
| infernce how does this lead to application of a recalled
| pattern on the right token(s)?
|
| also, can it invent new words?
| empath-nirvana wrote:
| Yes. This is one of the first things people did with
| chatgpt:
|
| can you help me invent a language for a fictional setting
| with talking frogs. Give me a sample sentence that frogs
| might say in their language, along with a translation.
| ChatGPT Creating a language for a fictional setting can
| be a fun and imaginative task! For your setting with
| talking frogs, we'll consider the environment and
| characteristics that might influence their language.
| Here's a brief outline of the linguistic features and a
| sample sentence:
|
| Linguistic Features of the Frog Language Phonetics: Since
| frogs are known for their croaks and ribbits, their
| language might include a lot of guttural and clicking
| sounds, which can be represented in writing with
| characters like "k", "g", "q", and clicks like "!" or
| "|=".
|
| Syntax: We might imagine that frogs are direct and to the
| point, so their sentence structure could be subject-verb-
| object (SVO), similar to English, for straightforward
| communication.
|
| Vocabulary: Words would likely relate to their daily life
| --water, flies, lily pads, weather changes, predators,
| etc.
|
| Grammar: Simplified grammar with minimal use of tenses
| and conjugations to reflect an emphasis on the present
| moment, which is typical for many animalistic languages
| in fiction.
|
| Sample Language: Froglan Froglii (hello) Lok (water) Krik
| (eat) Zraak (fly - the insect) !quu (danger) |=ogg (jump)
| Sample Sentence in Froglan "Froglii! Lok krik zraak
| |=ogg."
|
| Translation "Hello! Jump eat fly water."
|
| This sentence, structured in a somewhat English syntax
| for ease of understanding, would literally mean that a
| frog is greeting another and then mentioning that it will
| jump towards water to eat flies. The sentence structure
| and vocabulary can be further refined and expanded based
| on how deeply you want to dive into the language
| creation!
| sirsinsalot wrote:
| It isn't some kind of Markov chain situation. Attention
| cross-links the abstract meaning of words, subtle
| implications based on context and so on.
|
| So, "mat" follows "the cat sat on the" where we understand
| the entire worldview of the dataset used for training; not
| just the next-word probability based on one or more previous
| words ... it's based on all previous meaning probability, and
| those meaning probablility and so on.
| fellendrone wrote:
| > Why does, 'mat' follow from 'the cat sat on the ...'
|
| You're confidently incorrect by oversimplifying all LLMs to a
| base model performing a completion from a trivial context of
| 5 words.
|
| This is tantamount to a straw man. Not only do few people use
| untuned base models, it completely ignores in-context
| learning that allows the model to build complex semantic
| structures from the relationships learnt from its training
| data.
|
| Unlike base models, instruct and chat fine-tuning teaches
| models to 'reason' (or rather, perform semantic calculations
| in abstract latent spaces) with their "conditional
| probability structure", as you call it, to varying extents.
| The model must learn to use its 'facts', understand
| semantics, and perform abstractions in order to follow
| arbitrary instructions.
|
| You're also confabulating the training metric of "predicting
| tokens" with the mechanisms required to satisfy this metric
| for complex instructions. It's like saying "animals are just
| performing survival of the fittest". While technically
| correct, complex behaviours evolve to satisfy this 'survival'
| metric.
|
| You could argue they're "just stitching together phrases",
| but then you would be varying degrees of wrong:
|
| For one, this assumes phrases are compressed into
| semantically addressable units, which is already a form of
| abstraction ripe for allowing reasoning beyond 'stochastic
| parroting'.
|
| For two, it's well known that the first layers perform basic
| structural analysis such as grammar, and later layers perform
| increasing levels of abstract processing.
|
| For three, it shows a lack of understanding in how
| transformers perform semantic computation in-context from the
| relationships learnt by the feed-forward layers. If you're
| genuinely interested in understanding the computation model
| of transformers and how attention can perform semantic
| computation, take a look here: https://srush.github.io/raspy/
|
| For a practical example of 'understanding' (to use the term
| loosely), give an instruct/chat tuned model the text of an
| article and ask it something like "What questions should this
| article answer, but doesn't?" This requires not just
| extracting phrases from a source, but understanding the
| context of the article on several levels, then reasoning
| about what the context is not asserting. Even comparatively
| simple 4x7B MoE models are able to do this effectively.
| astrange wrote:
| LLMs don't work on words, they work on sequences of subword
| tokens. "It doesn't actually do anything" is a common
| explanation that's clearly a form of cope, because you can't
| even explain why it can form complete words, let alone
| complete sentences.
| fspeech wrote:
| There are an infinite number of distributions that can fit
| the training data well (e.g., one that completely memorize
| the data and therefore replicate the frequencies). The trick
| is to find the distributions that generalize well, and here
| the NN architecture is critical.
| blt wrote:
| As a computer scientist, the "differentiable hash table"
| interpretation worked for me. The AIAYN paper alludes to it by
| using the query/key/value names, but doesn't explicitly say the
| words "hash table". I guess some other paper introduced them?
| abotsis wrote:
| I think what made this so digestible for me were the animations.
| The timing, how they expand/contract and unfold while he's
| speaking.. is all very well done.
| _delirium wrote:
| That is definitely one of the things he does better than most.
| He actually wrote his own custom animation library for math
| animations: https://github.com/3b1b/manim
| divan wrote:
| Also check out community edition: https://www.manim.community
| rayval wrote:
| Here's a compelling visualization of the functioning of an LLM
| when processing a simple request: https://bbycroft.net/llm
|
| This complements the detailed description provided by 3blue1brown
| bugthe0ry wrote:
| When visualised this way, the scale of GPT-3 is insane. I can't
| imagine what 4 would like here.
| spi wrote:
| IIRC, GPT-4 would actually be a bit _smaller_ to visualize
| than GPT3. Details are not public, but from the leaks GPT-4
| (at least, some by-now old version of it) was a mixture of
| expert, with every model having around 110B parameters [1].
| So, while the total number of parameters is bigger than GPT-3
| (1800B vs. 175B), it is "just" 16 copies of a smaller (110B)
| parameters model. So if you wanted to visualize it in any
| meaningful way, the plot wouldn't grow bigger - or it would,
| if you included all different experts, but they are just
| copies of the same architecture with different parameters,
| which is not all that useful for visualization purposes.
|
| [1] https://medium.com/@daniellefranca96/gpt4-all-details-
| leaked...
| joaogui1 wrote:
| Mixture of Experts is not just 16 copies of a network, it's
| a single network where for the feed forward layers the
| tokens are routed to different experts, but the attention
| layers are still shared. Also there are interesting choices
| around how the routing works and I believe the exact
| details of what OpenAI is doing are not public. In fact I
| believe someone making a visualization of that would
| dispell a ton of myths around what are MoEs and how they
| work
| lying4fun wrote:
| amazing visualisation
| justanotherjoe wrote:
| It seems he brushes over the positional encoding, which for me
| was the most puzzling part of transformers. The way I understood
| it, positional encoding is much like dates. Just like dates,
| there are repeating minutes, hours, days, months...etc. Each of
| these values has shorter 'wavelength' than the next. The values
| are then used to identify the position of each tokens. Like, 'oh,
| im seeing january 5th tokens. I'm january 4th. This means this is
| after me'. Of course the real pos.encoding is much smoother and
| doesn't have abrupt end like dates/times, but i think this was
| the original motivation for positional encodings.
| nerdponx wrote:
| That's one way to think about it.
|
| It's clever way to encode "position in sequence" as some kind
| of smooth signal that can be added to each input vector. You
| might appreciate this detailed explanation:
| https://towardsdatascience.com/master-positional-encoding-pa...
|
| Incidentally, you can encode dates (e.g. day of week) in a
| model as sin(day of week) and cos(day of week) to ensure that
| "day 7" is mathematically adjacent to "day 1".
| bjornsing wrote:
| This was the best explanation I've seen. I think it comes down to
| essentially two aspects: 1) he doesn't try to hide complexity and
| 2) he explains what he thinks is the purpose of each computation.
| This really reduces the room for ambiguity that ruins so many
| other attempts to explain transformers.
| Xcelerate wrote:
| As someone with a background in quantum chemistry and some types
| of machine learning (but not neural networks so much) it was a
| bit striking while watching this video to see the parallels
| between the transformer model and quantum mechanics.
|
| In quantum mechanics, the state of your entire physical system is
| encoded as a very high dimensional normalized vector (i.e., a ray
| in a Hilbert space). The evolution of this vector through time is
| given by the time-translation operator for the system, which can
| loosely be thought of as a unitary matrix U (i.e., a probability
| preserving linear transformation) equal to exp(-iHt), where H is
| the Hamiltonian matrix of the system that captures its "energy
| dynamics".
|
| From the video, the author states that the prediction of the next
| token in the sequence is determined by computing the next
| context-aware embedding vector from the last context-aware
| embedding vector _alone_. Our prediction is therefore the result
| of a linear state function applied to a high dimensional vector.
| This seems a lot to me like we have produced a Hamiltonian of our
| overall system (generated offline via the training data), then we
| reparameterize our particular subsystem (the context window) to
| put it into an appropriate basis congruent with the Hamiltonian
| of the system, then we apply a one step time translation, and
| finally transform the resulting vector back into its original
| basis.
|
| IDK, when your background involves research in a certain field,
| every problem looks like a nail for that particular hammer. Does
| anyone else see parallels here or is this a bit of a stretch?
| bdjsiqoocwk wrote:
| I think you're just describing a state machine, no? The fact
| that you encode the state in a vector and steps by matrices is
| an implementation detail...?
| Xcelerate wrote:
| Perhaps a probabilistic FSM describes the actual
| computational process better since we don't have a concept
| equivalent to superposition with transformers (I think?), but
| the framework of a FSM alone doesn't seem to capture the
| specifics of where the model/machine comes from (what I'm
| calling the Hamiltonian), nor how a given context window (the
| subsystem) relates to it. The change of basis that involves
| the attention mechanism (to achieve context-awareness) seems
| to align better with existing concepts in QM.
|
| One might model the human brain as a FSM as well, but I'm not
| sure I'd call the predictive ability of the brain an
| implementation detail.
| BoGoToTo wrote:
| | context window
|
| I actually just asked a question on the physics stack
| exchange that is semi relevant to this. https://physics.sta
| ckexchange.com/questions/810429/functiona...
|
| In my question I was asking about a hypothetical time-
| evolution operator that includes an analog of a light cone
| that you could think of as a context window. If you had a
| quantum state that was evolved through time by this
| operator then I think you could think of the speed of light
| being a byproduct of the width of the context window of
| some operator that progresses the quantum state forward by
| some time interval.
|
| Note I am very much hobbyist-tier with physics so I could
| also be way off base and this could all be nonsense.
| ricardobeat wrote:
| I'm way out of my depth here, but wouldn't such a
| function have to encode an amount of information/state
| orders of magnitude larger than the definition of the
| function itself?
|
| If this turns out to be possible, we will have found the
| solution to the Sloot mystery :D
|
| https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_Syst
| em
| DaiPlusPlus wrote:
| The article references patent "1009908C2" but I can't
| find it in the Dutch patent site, nor Google Patent
| search.
|
| The rest of the article has "crank" written all over it;
| almost certainly investor fraud too - it'd be
| straightforward to fake the claimed smartcard video thing
| to a nontechnical observer - though not quite as
| egregious as Steorn Orbo or Theranos though.
| feoren wrote:
| Not who you asked (and I don't quite understand everything)
| but I think that's about right, except in the continuous
| world. You pick an encoding scheme (either the Lagrangian or
| the Hamiltonian) to go from state -> vector. You have a
| "rules" matrix, very roughly similar to a Markov matrix, H,
| and (stretching the limit of my knowledge here) exp(-iHt)
| _very_ roughly "translates" from the discrete stepwise world
| to the continuous world. I'm sure that last part made more
| knowledgeable people cringe, but it's roughly in the right
| direction. The part I don't understand at all is the _-i_
| factor: exp(-it) just circles back on itself after t=2pi, so
| it feels like exp(-iHt) should be a periodic function?
| lagrange77 wrote:
| I only understand half of it, but it sounds very interesting.
| I've always wondered, if the principle of stationary action
| could be of any help with machine learning, e.g. provide an
| alternative point of view / formulation.
| BoGoToTo wrote:
| I've been thinking about his a bit lately. If time is non-
| continuous then could you model the time evolution of the
| universe as some operator recursively applied to the quantum
| state of the universe? If each application of the operator
| progresses the state of the universe by a single planck-time
| could we even observe a difference between that and a universe
| where time is continuous?
| BobbyTables2 wrote:
| I think Wolfram made news proposing something roughly along
| these lines.
|
| Either way, I find Planck time/energy to be a very spooky
| concept.
|
| https://wolframphysics.org/
| tweezy wrote:
| So one of the most "out there" non-fiction books I've read
| recently is called "Alien Information Theory". It's a wild
| ride and there's a lot of flat-out crazy stuff in it but it's
| a really engaging read. It's written by a computational
| neuroscientist who's obsessed with DMT. The DMT parts are
| pretty wild, but the computational neuroscience stuff is
| intriguing.
|
| In one part he talks about a thought experiment modeling the
| universe as a multidimensional cellular automata. Where
| fundamental particles are nothing more than the information
| they contain. And particles colliding is a computation that
| tells how that node and the adjacent nodes to update their
| state.
|
| Way out and not saying there's anything truth to it. But it
| was a really interesting and fun concept to chew on.
| andoando wrote:
| Im working on a model to do just that :) The game of life
| is not too far off either.
| Gooblebrai wrote:
| You might enjoy his next book: Reality Switch.
| pas wrote:
| This sounds like the Bohmian pilot wave theory (which is a
| global formulation of QM). ... Which might be not that crazy,
| since spooky action at a distance is already a given. And in
| cosmology (or quantum gravity) some models are describing a
| region of space based only its surface. So in some sense the
| universe is much less information dense, than we think.
|
| https://en.m.wikipedia.org/wiki/Holographic_principle
| francasso wrote:
| I don't think the analogy holds: even if you forget all the
| preceding non linear steps, you are still left with just a
| linear dynamical system. It's neither complex nor unitary,
| which are two fundamental characteristics of quantum mechanics.
| cmgbhm wrote:
| Not a direct comment on the question but I had a math PhD as an
| intern before. One of his comments was having tons of high
| dimensional linear algebra stuff was super advanced 1900s and
| has plenty of room for new cs discovery.
|
| Didn't make the "what was going on then in physics " connection
| until now.
| shahbazac wrote:
| Is there a reference which describes how the current architecture
| evolved? Perhaps from very simple core idea to the famous "all
| you need paper?"
|
| Otherwise it feels like lots of machinery created out of nowhere.
| Lots of calculations and very little intuition.
|
| Jeremy Howard made a comment on Twitter that he had seen various
| versions of this idea come up again and again - implying that
| this was a natural idea. I would love to see examples of where
| else this has come up so I can build an intuitive understanding.
| ollin wrote:
| karpathy gave a good high-level history of the transformer
| architecture in this Stanford lecture
| https://youtu.be/XfpMkf4rD6E?si=MDICNzZ_Mq9uzRo9&t=618
| HarHarVeryFunny wrote:
| Roughly:
|
| 1) The initial seq-2-seq approach was using LSTMs - one to
| encode the input sequence, and one to decode the output
| sequence. It's amazing that this worked at all - encode a
| variable length sentence into a fixed size vector, then decode
| it back into another sequence, usually of different length
| (e.g. translate from one language to another).
|
| 2) There are two weaknesses of this RNN/LSTM approach - the
| fixed size representation, and the corresponding lack of
| ability to determine which parts of the input sequence to use
| when generating specific parts of the output sequence. These
| deficiencies were addressed by Bahdanau et al in an
| architecture that combined encoder-decoder RNNs with an
| attention mechanism ("Bahdanau attention") that looked at each
| past state of the RNN, not just the final one.
|
| 3) RNNs are inefficient to train, so Jakob Uszkoreit was
| motivated to come up with an approach that better utilized
| available massively parallel hardware, and noted that language
| is as much hierarchical as sequential, suggesting a layered
| architecture where at each layer the tokens of the sub-sequence
| would be processed in parallel, while retaining a Bahdanau-type
| attention mechanism where these tokens would attend to each
| other ("self-attention") to predict the next layer of the
| hierarchy. Apparently in initial implementation the idea
| worked, but not better than other contemporary approaches
| (incl. convolution), but then another team member, Noam
| Shazeer, took the idea and developed it, coming up with an
| architecture (which I've never seen described) that worked much
| better, which was then experimentally ablated to remove
| unnecessary components, resulting in the original transformer.
| I'm not sure who came up with the specific key-based form of
| attention in this final architecture.
|
| 4) The original transformer, as described in the "attention is
| all you need paper", still had a separate encoder and decoder,
| copying earlier RNN based approaches, and this was used in some
| early models such as Google's BERT, but this is unnecessary for
| language models, and OpenAI's GPT just used the decoder
| component, which is what everyone uses today. With this
| decoder-only transformer architecture the input sentence is
| input into the bottom layer of the transformer, and transformed
| one step at a time as it passes through each subsequent layer,
| before emerging at the top. The input sequence has an end-of-
| sequence token appended to it, which is what gets transformed
| into the next-token (last token) of the output sequence.
| krat0sprakhar wrote:
| Thank you for this summary! Very well explained. Any tips on
| what resources you use to keep updated on this field?
| HarHarVeryFunny wrote:
| Thanks. Mostly just Twitter, following all the companies &
| researchers for any new announcements, then reading any
| interesting papers mentioned/linked. I also subscribe to
| YouTube channels like Dwarkesh Patel (interviewer) and
| Yannic Kilcher (AI News), and search out YouTube interviews
| with the principles. Of course I also read any AI news here
| on HN, and sometimes there may be interesting information
| in the comments.
|
| There's a summary of social media AI news here, that
| sometimes surfaces something interesting.
|
| https://buttondown.email/ainews/archive/
| kordlessagain wrote:
| What I'm now wondering about is how intuition to connect
| completely separate ideas works in humans. I will have very
| strong intuition something is true, but very little way to show
| it directly. Of course my feedback on that may be biased, but it
| does seem some people have "better" intuition than others.
| cs702 wrote:
| Fantastic work by Grant Sanderson, as usual.
|
| _Attention has won_.[a]
|
| It deserves to be more widely understood.
|
| ---
|
| [a] Nothing has outperformed attention so far, not even Mamba:
| https://arxiv.org/abs/2402.01032
| stillsut wrote:
| In training we learn a.) the embeddings and b.) the KQ/MLP-
| weights.
|
| How well do Transformers perform given learned embeddings but
| only randomly initialized decoder weights? Do they produce word
| soup of related concepts? Anything syntactically coherent?
|
| Once a well trained high dimensional representation of tokens are
| established. can they learn KQ/MLP weights significantly faster?
___________________________________________________________________
(page generated 2024-04-15 23:00 UTC)