[HN Gopher] LLM Visualization
       ___________________________________________________________________
        
       LLM Visualization
        
       Author : plibither8
       Score  : 47 points
       Date   : 2023-12-03 06:08 UTC (16 hours ago)
        
 (HTM) web link (bbycroft.net)
 (TXT) w3m dump (bbycroft.net)
        
       | gryfft wrote:
       | Damn, this looks _phenomenal._ I 've been wanting to do a deep
       | dive like this for a while-- the 3D model is a spectacular
       | pedagogic device.
        
         | quickthrower2 wrote:
         | Andrej Karpathy twisting his hands as he explains it is also a
         | great device. Not being sarcastic, when he explains it I
         | understand it for a good minute it two. Then need to rewatch as
         | I forget (but that is just me)!
        
       | airesQ wrote:
       | Incredible work.
       | 
       | So much depth; initially I thought it's "just" a 3d model. The
       | animations are amazing.
        
       | reqo wrote:
       | I feel like visualizations like this are what is missing from
       | univeristy curricula. Now imagine a professor going trough each
       | animation describing exactly what is happening, I am pretty sure
       | students would get a much more in-depth understanding!
        
         | Arson9416 wrote:
         | Isn't it amazing that a random person on the internet can
         | produce free educational content that trumps university
         | courses? With all the resources and expertise that universities
         | have, why do they get shown up all the time? Do they just not
         | know how to educate?
        
           | cmiller1 wrote:
           | Have you considered the considerably greater breadth of
           | content required for a full course, as well as the other
           | responsibilities of the people teaching them such as testing,
           | public speaking, etc.
        
             | dannyw wrote:
             | It probably would be massively beneficial to society and
             | progress if teaching professors could spend more time and
             | attention on teaching.
        
           | syntaxers wrote:
           | It's an incentives problem. At research universities,
           | promotion is contingent on research output, and teaching is
           | often seen as a distraction. At so-called teaching
           | universities, promotion (or even survival) is mainly
           | contingent on throughput, and not measures of pedagogical
           | outcome.
           | 
           | If you are a teaching faculty at a university, it is against
           | your own interests to invest time to develop novel teaching
           | materials. The exception might be writing textbooks, which
           | can be monetized, but typically are a net-negative endeavor.
        
             | tiffanyg wrote:
             | Unfortunately, this is a problem throughout our various
             | economies, at this point.
             | 
             | Professionalization of management with its easy over-
             | reliance on the simplification of the quantitative - of
             | "metrics" - along with the scales (size) this allows and
             | manner in which fundamental issues get obscured tends to
             | produce these types of results. This is, of course, well
             | known in business schools and efforts are generally made to
             | ensure graduates are aware of some of the downsides of "the
             | quantitative." Unsurprisingly, over time, there is a kind
             | of "forcing" that tends to drive these systems towards the
             | results like you describe.
             | 
             | It's usually the case that imposition of metrics,
             | optimization, etc. - "mathematical methods" - is quite
             | beneficial at first, but once systems are improved in
             | sensible ways based on insights gained through this, less
             | desirable behavior begins to occur. Multiple factors
             | including basic human psychology factor into this ... which
             | I think is getting beyond the scope of what's reasonable to
             | include in this comment.
        
           | 29athrowaway wrote:
           | Except for this man: Professor Robert Ghrist
           | 
           | https://www.youtube.com/c/ProfGhristMath
           | 
           | This person is amazing.
        
           | sgnelson wrote:
           | Because Faculty are generally paid very poorly, have many
           | courses to teach, and what takes up more and more of their
           | time are the broken bureaucratic systems they have to deal
           | with.
           | 
           | Add that at research universities, they have to do research.
           | 
           | Also add in that at many schools, way too many students are
           | just there to clock in, clock out and get a piece of paper
           | that says they did it. Way too few are there to actually get
           | an education. This has very real consequences on the moral of
           | the instructors. When your students don't care, it's very
           | hard for you to care. If your students aren't willing to work
           | hard, why are you willing to work hard? Because you're paid
           | so well?
           | 
           | I know plenty of instructors who would love to do things like
           | this, but when are they going too? When are they going to
           | find the time to learn the skills necessary to build an
           | interactive web app? You think everyone outside of comp sci
           | and like disciplines just naturally know how to build these
           | types of apps?
           | 
           | I could go on, but the tl;dr of it is: Educators are over
           | worked, underpaid and don't have enough time in the day.
        
           | erosenbe0 wrote:
           | I second the reply about incentives. Funding curriculum
           | materials and professional curriculum development is often
           | seen as more of a K-12 thing. There is not even enough at the
           | vocational level.
           | 
           | If big competitive grants and competitive salaries went to
           | people with demonstrated ability like the engineer of this
           | viz, there would be less stem dropouts in colleges and more
           | summer learning! Also, in technical trades like green
           | construction, solar, hvac, building retrofits, data center
           | operations and the like, people would get farther and it
           | would be a more diverse bunch.
        
           | lusus_naturae wrote:
           | The person who made this went to university for maths
           | education (if I found the right profile).
        
           | webnrrd2k wrote:
           | This is the result of a single person on the internet, who
           | was not chosen randomly. it's not a fair characterization to
           | call this the product of some random person on the internet.
           | You can't just choose anyone on the internet at random and
           | get results this good.
           | 
           | Also, according to his home page, Mr. Bycroft earned a BSc in
           | Mathematics and Physics from the University of Canterbury in
           | 2012. It's true that this page isn't the direct result of a
           | professor or a university course, and it's also true that
           | it's not a completely separate thing. It seems clear that his
           | university education had a big role to play in creating this.
        
           | wrsh07 wrote:
           | You're betting on the hundreds of top university cs
           | professors to produce better content than the hundreds of
           | thousands of industry veterans or hobbyists...
           | 
           | Why does YouTube sometimes have better content than
           | professionally produced media? It's a really long tail of
           | creators and educators
        
             | physPop wrote:
             | This isn't new. Textbooks exist for the same reason, so we
             | don't need to duplicate effort creating teaching materials
             | and can have a kind of accepted core cirriculum.
        
         | jwilber wrote:
         | If you enjoy interactive visuals of machine learning
         | algorithms, do check out https://mlu-explain.github.io/
        
       | smy20011 wrote:
       | Really cool!
        
       | thefourthchime wrote:
       | First off, this is fabulous work. I went through it for the Nano,
       | but is there a way to do the step-by-step for the other LLMs?
        
         | Heidaradar wrote:
         | Below the title, there's a few others you can choose from
         | (GPT-2 small and XL and GPT-3)
        
       | valdect wrote:
       | Super cool! It's always nice to look something concrete
        
       | warkanlock wrote:
       | This is an excellent tool to realize how an LLM actually works
       | from the ground up!
       | 
       | For those reading it and going through each step, if by chance
       | you get stuck on why 48 elements are in the first array, please
       | refer to the model.py on minGPT [1]
       | 
       | It's an architectural decision that it will be great to mention
       | in the article since people without too much context might lose
       | it
       | 
       | [1]
       | https://github.com/karpathy/minGPT/blob/master/mingpt/model....
        
         | namocat wrote:
         | Yes, thank you - It was unexplained, so I got stuck on "Why
         | 48?", thinking I'd missing something right out of the gate.
        
           | zombiwoof wrote:
           | I was thinking 42 ;-)
        
       | tsunamifury wrote:
       | This shows how the individual weights and vectors work but unless
       | I'm missing something doesn't quite illustrate yet how higher
       | order vectors are created at the sentence and paragraph level.
       | This might be an emergent property within this system though so
       | it's hard to "illustrate". how all of this ends up with a world
       | simulation needs to be understood better and I hope this advances
       | further.
        
         | visarga wrote:
         | The deeper you go, the higher the order. It's what attention
         | does at each layer, makes information circulate.
        
           | tsunamifury wrote:
           | Thanks, I assumed that was the case, but they didn't make
           | that explicit. Then the question is, is the world simulation
           | running at the highest order attention layer or is it an
           | emergent property of the interaction cycle between the
           | attention layers.
        
       | 29athrowaway wrote:
       | I big kudos to the author of this.
       | 
       | Not only has the visualization, but it's interactive, has
       | explanations for each item, has excellent performance and is open
       | source: https://github.com/bbycroft/llm-viz/blob/main/src/llm
       | 
       | Another interesting visualization related thing:
       | https://github.com/shap/shap
        
       | gbertasius wrote:
       | This is AMAZING! I'm about to go into Uni and this will be useful
       | for my ML classes.
        
       | baq wrote:
       | Could as well be titled 'dissecting magic into matmuls and dot
       | products for dummies'. Great stuff. Went away even more amazed
       | that LLMs work as well as they do.
        
       | Simon_ORourke wrote:
       | This is brilliant work, thanks for sharing.
        
         | reexpressionist wrote:
         | Ditto. This is the most sophisticated viz of parameters I've
         | seen...and it's also an interactive, step-through tutorial!
        
       | hmate9 wrote:
       | This is a phenomenal visualisation. I wish I saw this when I was
       | trying to wrap my head around transformers a while ago. This
       | would have made it so much easier.
        
       | wills_forward wrote:
       | My jaw drop to see algorhythmic complexity laid out so clearly in
       | a 3d space like that. I wish I was smart enough to know if it's
       | accurate or not.
        
         | block_dagger wrote:
         | To know, you must perform intellectual work, not merely be
         | smart. I bet you are smart enough.
        
       | drdg wrote:
       | Very cool. The explanations of what each part is doing is really
       | insightful. And I especially like how the scale jumps when you
       | move from e.g. Nano all the way to GPT-3 ....
        
       | atgctg wrote:
       | A lot of transformer explanations fail to mention what makes self
       | attention so powerful.
       | 
       | Unlike traditional neural networks with fixed weights, self-
       | attention layers adaptively weight connections between inputs
       | based on context. This allows transformers to accomplish in a
       | single layer what would take traditional networks multiple
       | layers.
        
         | kmeisthax wrote:
         | None of this seems obvious just reading the original _Attention
         | is all you need_ paper. Is there a more in-depth explanation of
         | how this adaptive weighting works?
        
           | WhitneyLand wrote:
           | It's definitely not obvious no matter how smart you are! The
           | common metaphor used is it's like a conversation.
           | 
           | Imagine you read one comment in some forum, posted in a long
           | conversation thread. It wouldn't be obvious what's going on
           | unless you read more of the thread right?
           | 
           | A single paper is like a single comment, in a thread that
           | goes on for years and years.
           | 
           | For example, why don't papers explain what
           | tokens/vectors/embedding layers are? Well, they did already,
           | except that comment in the thread came 2013 with the word2vec
           | paper!
           | 
           | You might think wth? To keep up with this some one would have
           | to spend a huge part of their time just reading papers. So
           | yeah that's kind of what researchers do.
           | 
           | The alternative is to try to find where people have distilled
           | down the important information or summarized it. That's where
           | books/blogs/youtube etc come in.
        
             | andai wrote:
             | Is there a way of finding interesting "chains" of such
             | papers, short of scanning the references / "cited by" page?
             | 
             | (For example, Google Scholar lists 98797 citations for
             | Attention is all you need!)
        
               | WhitneyLand wrote:
               | As a prerequisite to the attention paper? One to check
               | out is:
               | 
               | A Survey on Contextual Embeddings
               | https://arxiv.org/abs/2003.07278
               | 
               | Embeddings are sort of what all this stuff is built on so
               | it should help demystify the newer papers (it's actually
               | newer than the attention paper but a better overview than
               | starting with the older word2vec paper).
               | 
               | Then after the attention paper an important one is:
               | 
               | Language Models are Few-Shot Learners
               | https://arxiv.org/abs/2005.14165
               | 
               | I'm intentionally trying to not give a big list because
               | they're so time-consuming. I'm sure you'll quickly branch
               | out based on your interests.
        
           | pizza wrote:
           | softmax(QK) gives you a probability matrix of shape [seq,
           | seq]. Think of this like an adjacency matrix with edges with
           | flow weights that are probabilities. Hence semantic routing
           | of V.
           | 
           | where
           | 
           | - Q = X @ W_Q
           | 
           | - K = X @ W_K
           | 
           | - V = X @ V
           | 
           | hence
           | 
           | attn_head_i = (softmax(Q@K/normalizing term) @ V)
           | 
           | Each head corresponds to a different concurrent routing
           | system
           | 
           | The transformer just adds normalization and mlp feature
           | learning parts around that.
        
           | kaimac wrote:
           | I found these notes very useful. They also contain a nice
           | summary of how LLMs/transformers work. It doesn't help that
           | people can't seem to help taking a concept that has been
           | around for decades (kernel smoothing) and giving it a fancy
           | new name (attention).
           | 
           | http://bactra.org/notebooks/nn-attention-and-
           | transformers.ht...
        
             | CyberDildonics wrote:
             | It's just as bad a "convolutional neural networks" instead
             | of "images being scaled down"
        
           | albertzeyer wrote:
           | The audience of this paper are other researchers who already
           | know the concept of attention, which was very well known
           | already in the field. In such research papers, such things
           | are never explained again, as all the researchers already
           | know this or can read other sources, which are cited, but
           | focus on the actual research questions. In this case, the
           | research question was simply: Can we get away by just using
           | attention and not using the LSTM anymore? Before that,
           | everyone was using both together.
           | 
           | I think learning it following it more this historical
           | development can be helpful. E.g. in this case here, learn the
           | concept of attention, specifically cross attention first. And
           | that is this paper: Bahdanau, Cho, Bengio, "Neural Machine
           | Translation by Jointly Learning to Align and Translate",
           | 2014, https://arxiv.org/abs/1409.0473
           | 
           | That paper introduces it. But even that is maybe quite dense,
           | and to really grasp it, it helps to reimplement those things.
           | 
           | It's always dense, because those papers already have space
           | constraints given by the conferences, max 9 pages or so. To
           | get a better detailed overview, you can study the authors
           | code, or other resources. There is a lot now about those
           | topics, whole books, etc.
        
             | BOOSTERHIDROGEN wrote:
             | What books cover exclusively about this topic ? Thanks
        
               | CardenB wrote:
               | I doubt any.
        
               | albertzeyer wrote:
               | This is frequently a topic here on HN. E.g.:
               | 
               | https://udlbook.github.io/udlbook/
               | (https://news.ycombinator.com/item?id=38424939)
               | 
               | https://fleuret.org/francois/lbdl.html
               | (https://news.ycombinator.com/item?id=35767789)
               | 
               | https://www.fast.ai/
               | (https://news.ycombinator.com/item?id=24237207)
               | 
               | https://d2l.ai/
               | (https://news.ycombinator.com/item?id=38428225)
               | 
               | Some more:
               | 
               | https://news.ycombinator.com/item?id=35543774
               | 
               | There is a lot more. Just google for "deep learning", and
               | you'll find a lot of content. And most of that will cover
               | attention, as it is a really basic concept now.
        
           | andai wrote:
           | Turns out Attention is all you need isn't all you need!
           | 
           | (I'm sorry)
        
           | gtoubassi wrote:
           | I struggled to get an intuition for this, but on another HN
           | thread earlier this year saw the recommendation for Sebastian
           | Raschka's series. Starting with this video:
           | https://www.youtube.com/watch?v=mDZil99CtSU and maybe the
           | next three or four. It was really helpful to get a sense of
           | the original 2014 concept of attention which is easier to
           | understand but less powerful
           | (https://arxiv.org/abs/1409.0473), and then how it gets
           | powerful with the more modern notion of attention. So if you
           | have a reasonable intuition for "regular" ANNs I think this
           | is a great place to start.
        
         | WhitneyLand wrote:
         | In case it's confusing for anyone to see "weight" as a verb and
         | a noun so close together, there are indeed two different things
         | going on:
         | 
         | 1. There are the model weights, aka the parameters. These are
         | what get adjusted during training to do the learning part. They
         | always exist.
         | 
         | 2. There are attention weights. These are part of the
         | transformer architecture and they "weight" the context of the
         | input. They are ephemeral. Used and discarded. Don't always
         | exist.
         | 
         | They are both typically 32-bit floats in case you're curious
         | but still different concepts.
        
           | kirill5pol wrote:
           | I think a good way of explaining #2 is "weight" in the sense
           | of a weighted average
        
           | airstrike wrote:
           | I always thought the verb was "weigh" not "weight", but
           | apparently the latter is also in the dictionary as a verb.
           | 
           | Oh well... it seems like it's more confusing than I thought
           | https://www.merriam-webster.com/wordplay/when-to-use-
           | weigh-a...
        
             | bobbylarrybobby wrote:
             | "To weight" is to _assign_ a weight (e.g., to weight
             | variables differently in a model), whereas "to weigh" is to
             | _observe_ and /or _record_ a weight (as a scale does).
        
               | DavidSJ wrote:
               | A few other cases of this sort of thing:
               | 
               | affect (n). an emotion or feeling. "She has a positive
               | affect."
               | 
               | effect (n). a result or change due to some event. "The
               | effect of her affect is to make people like her."
               | 
               | affect (v). to change or modify [X], have an effect upon
               | [X]. "The weather affects my affect."
               | 
               | effect (v). to bring about [X] or cause [X] to happen.
               | "Our protests are designed to effect change."
               | 
               | Also:
               | 
               | cost (v). to require a payment or loss of [X]. "That
               | Apple will cost $5." Past tense cost: "That Apple cost
               | $5."
               | 
               | cost (v). to estimate the price of [X]. "The accounting
               | department will cost the construction project at $5
               | million." Past tense costed. "The accounting department
               | costed the construction project at $5 million."
        
         | lchengify wrote:
         | Just to add on, a good way to learn these terms is to look at
         | the history of neural networks rather than looking at
         | transformer architecture in a vacuum
         | 
         | This [1] post from 2021 goes over attention mechanisms as
         | applied to RNN / LSTM networks. It's visual and goes into a bit
         | more detail, and I've personally found RNN / LSTM networks
         | easier to understand intuitively.
         | 
         | [1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-
         | at...
        
       | skadamat wrote:
       | If folks want a lower dimensional version of this for their own
       | models, I'm a big fan of the Netron library for model
       | architecture visualization.
       | 
       | Wrote about it here: https://about.xethub.com/blog/visualizing-
       | ml-models-github-n...
        
       | arikrak wrote:
       | This looks pretty cool! Anyone know of visualizations for simpler
       | neural networks? I'm aware of tensorflow playground but that's
       | just for a toy example, is there anything for visualizing a real
       | example (e.g handwriting recognition)?
        
         | Logge wrote:
         | https://okdalto.github.io/VisualizeMNIST_web/
        
         | atonalfreerider wrote:
         | We made a VR visualization back in 2017
         | https://youtu.be/x6y14yAJ9rY
        
       | rvz wrote:
       | Rather than looking at the visuals of this network, it is more
       | better to focus on the actual problem with these LLMs which the
       | author already has shown:
       | 
       | With in the transformer section:
       | 
       | > As is common in deep learning, it's hard to say exactly what
       | each of these layers is doing, but we have some general ideas:
       | the earlier layers tend to focus on learning lower-level features
       | and patterns, while the later layers learn to recognize and
       | understand higher-level abstractions and relationships.
       | 
       | That is the problem and yet these black boxes are just as
       | explainable as a magic scroll.
        
         | nlh wrote:
         | I find this problem fascinating.
         | 
         | For decades we've puzzled at how the inner workings of the
         | brain works, and thought we've learned a lot we still don't
         | fully understand it. So, we figure, we'll just make an
         | artificial brain and THEN we'll be able to figure it out.
         | 
         | And here we are, finally a big step closer to an artificial
         | brain and once again, we don't know how it works :)
         | 
         | (Although to be fair we're spending all of our efforts making
         | the models better and better and not on learning their low
         | level behaviors. Thankfully when we decide to study them it'll
         | be a wee less invasive and actually doable, in theory.)
        
       | shaburn wrote:
       | Visualization never seems to get the credit due in software
       | development. This is amazing.
        
       | flockonus wrote:
       | Twitter thread by the author sharing some extra context on this
       | work:
       | https://twitter.com/BrendanBycroft/status/173104295714982714...
        
       | russellbeattie wrote:
       | I've wondered for a while if as LLM usage matures, there will be
       | an effort to optimize hotspots like what happened with VMs, or
       | auto indexed like in relational DBs. I'm sure there are common
       | data paths which get more usage, which could somehow be
       | prioritized, either through pre-processing or dynamically,
       | helping speed up inference.
        
       | holtkam2 wrote:
       | The visualization I've been looking for for months. I would have
       | happily paid serious money for this... the fact that it's free is
       | such a gift and I don't take it for granted.
        
       | Solvency wrote:
       | Wish it were mobile friendly.
        
       | physPop wrote:
       | Honestly reading the pytorch implementation of minGTP is a lot
       | more informative than an inscrutable 3d rendering. It's a well
       | commended and pedagogical implementation. I applaud the
       | intention, and it looks slick, but I'm not sure it really conveys
       | information in an efficient way.
        
       ___________________________________________________________________
       (page generated 2023-12-03 23:00 UTC)