[HN Gopher] LLM Visualization
___________________________________________________________________
LLM Visualization
Author : plibither8
Score : 47 points
Date : 2023-12-03 06:08 UTC (16 hours ago)
(HTM) web link (bbycroft.net)
(TXT) w3m dump (bbycroft.net)
| gryfft wrote:
| Damn, this looks _phenomenal._ I 've been wanting to do a deep
| dive like this for a while-- the 3D model is a spectacular
| pedagogic device.
| quickthrower2 wrote:
| Andrej Karpathy twisting his hands as he explains it is also a
| great device. Not being sarcastic, when he explains it I
| understand it for a good minute it two. Then need to rewatch as
| I forget (but that is just me)!
| airesQ wrote:
| Incredible work.
|
| So much depth; initially I thought it's "just" a 3d model. The
| animations are amazing.
| reqo wrote:
| I feel like visualizations like this are what is missing from
| univeristy curricula. Now imagine a professor going trough each
| animation describing exactly what is happening, I am pretty sure
| students would get a much more in-depth understanding!
| Arson9416 wrote:
| Isn't it amazing that a random person on the internet can
| produce free educational content that trumps university
| courses? With all the resources and expertise that universities
| have, why do they get shown up all the time? Do they just not
| know how to educate?
| cmiller1 wrote:
| Have you considered the considerably greater breadth of
| content required for a full course, as well as the other
| responsibilities of the people teaching them such as testing,
| public speaking, etc.
| dannyw wrote:
| It probably would be massively beneficial to society and
| progress if teaching professors could spend more time and
| attention on teaching.
| syntaxers wrote:
| It's an incentives problem. At research universities,
| promotion is contingent on research output, and teaching is
| often seen as a distraction. At so-called teaching
| universities, promotion (or even survival) is mainly
| contingent on throughput, and not measures of pedagogical
| outcome.
|
| If you are a teaching faculty at a university, it is against
| your own interests to invest time to develop novel teaching
| materials. The exception might be writing textbooks, which
| can be monetized, but typically are a net-negative endeavor.
| tiffanyg wrote:
| Unfortunately, this is a problem throughout our various
| economies, at this point.
|
| Professionalization of management with its easy over-
| reliance on the simplification of the quantitative - of
| "metrics" - along with the scales (size) this allows and
| manner in which fundamental issues get obscured tends to
| produce these types of results. This is, of course, well
| known in business schools and efforts are generally made to
| ensure graduates are aware of some of the downsides of "the
| quantitative." Unsurprisingly, over time, there is a kind
| of "forcing" that tends to drive these systems towards the
| results like you describe.
|
| It's usually the case that imposition of metrics,
| optimization, etc. - "mathematical methods" - is quite
| beneficial at first, but once systems are improved in
| sensible ways based on insights gained through this, less
| desirable behavior begins to occur. Multiple factors
| including basic human psychology factor into this ... which
| I think is getting beyond the scope of what's reasonable to
| include in this comment.
| 29athrowaway wrote:
| Except for this man: Professor Robert Ghrist
|
| https://www.youtube.com/c/ProfGhristMath
|
| This person is amazing.
| sgnelson wrote:
| Because Faculty are generally paid very poorly, have many
| courses to teach, and what takes up more and more of their
| time are the broken bureaucratic systems they have to deal
| with.
|
| Add that at research universities, they have to do research.
|
| Also add in that at many schools, way too many students are
| just there to clock in, clock out and get a piece of paper
| that says they did it. Way too few are there to actually get
| an education. This has very real consequences on the moral of
| the instructors. When your students don't care, it's very
| hard for you to care. If your students aren't willing to work
| hard, why are you willing to work hard? Because you're paid
| so well?
|
| I know plenty of instructors who would love to do things like
| this, but when are they going too? When are they going to
| find the time to learn the skills necessary to build an
| interactive web app? You think everyone outside of comp sci
| and like disciplines just naturally know how to build these
| types of apps?
|
| I could go on, but the tl;dr of it is: Educators are over
| worked, underpaid and don't have enough time in the day.
| erosenbe0 wrote:
| I second the reply about incentives. Funding curriculum
| materials and professional curriculum development is often
| seen as more of a K-12 thing. There is not even enough at the
| vocational level.
|
| If big competitive grants and competitive salaries went to
| people with demonstrated ability like the engineer of this
| viz, there would be less stem dropouts in colleges and more
| summer learning! Also, in technical trades like green
| construction, solar, hvac, building retrofits, data center
| operations and the like, people would get farther and it
| would be a more diverse bunch.
| lusus_naturae wrote:
| The person who made this went to university for maths
| education (if I found the right profile).
| webnrrd2k wrote:
| This is the result of a single person on the internet, who
| was not chosen randomly. it's not a fair characterization to
| call this the product of some random person on the internet.
| You can't just choose anyone on the internet at random and
| get results this good.
|
| Also, according to his home page, Mr. Bycroft earned a BSc in
| Mathematics and Physics from the University of Canterbury in
| 2012. It's true that this page isn't the direct result of a
| professor or a university course, and it's also true that
| it's not a completely separate thing. It seems clear that his
| university education had a big role to play in creating this.
| wrsh07 wrote:
| You're betting on the hundreds of top university cs
| professors to produce better content than the hundreds of
| thousands of industry veterans or hobbyists...
|
| Why does YouTube sometimes have better content than
| professionally produced media? It's a really long tail of
| creators and educators
| physPop wrote:
| This isn't new. Textbooks exist for the same reason, so we
| don't need to duplicate effort creating teaching materials
| and can have a kind of accepted core cirriculum.
| jwilber wrote:
| If you enjoy interactive visuals of machine learning
| algorithms, do check out https://mlu-explain.github.io/
| smy20011 wrote:
| Really cool!
| thefourthchime wrote:
| First off, this is fabulous work. I went through it for the Nano,
| but is there a way to do the step-by-step for the other LLMs?
| Heidaradar wrote:
| Below the title, there's a few others you can choose from
| (GPT-2 small and XL and GPT-3)
| valdect wrote:
| Super cool! It's always nice to look something concrete
| warkanlock wrote:
| This is an excellent tool to realize how an LLM actually works
| from the ground up!
|
| For those reading it and going through each step, if by chance
| you get stuck on why 48 elements are in the first array, please
| refer to the model.py on minGPT [1]
|
| It's an architectural decision that it will be great to mention
| in the article since people without too much context might lose
| it
|
| [1]
| https://github.com/karpathy/minGPT/blob/master/mingpt/model....
| namocat wrote:
| Yes, thank you - It was unexplained, so I got stuck on "Why
| 48?", thinking I'd missing something right out of the gate.
| zombiwoof wrote:
| I was thinking 42 ;-)
| tsunamifury wrote:
| This shows how the individual weights and vectors work but unless
| I'm missing something doesn't quite illustrate yet how higher
| order vectors are created at the sentence and paragraph level.
| This might be an emergent property within this system though so
| it's hard to "illustrate". how all of this ends up with a world
| simulation needs to be understood better and I hope this advances
| further.
| visarga wrote:
| The deeper you go, the higher the order. It's what attention
| does at each layer, makes information circulate.
| tsunamifury wrote:
| Thanks, I assumed that was the case, but they didn't make
| that explicit. Then the question is, is the world simulation
| running at the highest order attention layer or is it an
| emergent property of the interaction cycle between the
| attention layers.
| 29athrowaway wrote:
| I big kudos to the author of this.
|
| Not only has the visualization, but it's interactive, has
| explanations for each item, has excellent performance and is open
| source: https://github.com/bbycroft/llm-viz/blob/main/src/llm
|
| Another interesting visualization related thing:
| https://github.com/shap/shap
| gbertasius wrote:
| This is AMAZING! I'm about to go into Uni and this will be useful
| for my ML classes.
| baq wrote:
| Could as well be titled 'dissecting magic into matmuls and dot
| products for dummies'. Great stuff. Went away even more amazed
| that LLMs work as well as they do.
| Simon_ORourke wrote:
| This is brilliant work, thanks for sharing.
| reexpressionist wrote:
| Ditto. This is the most sophisticated viz of parameters I've
| seen...and it's also an interactive, step-through tutorial!
| hmate9 wrote:
| This is a phenomenal visualisation. I wish I saw this when I was
| trying to wrap my head around transformers a while ago. This
| would have made it so much easier.
| wills_forward wrote:
| My jaw drop to see algorhythmic complexity laid out so clearly in
| a 3d space like that. I wish I was smart enough to know if it's
| accurate or not.
| block_dagger wrote:
| To know, you must perform intellectual work, not merely be
| smart. I bet you are smart enough.
| drdg wrote:
| Very cool. The explanations of what each part is doing is really
| insightful. And I especially like how the scale jumps when you
| move from e.g. Nano all the way to GPT-3 ....
| atgctg wrote:
| A lot of transformer explanations fail to mention what makes self
| attention so powerful.
|
| Unlike traditional neural networks with fixed weights, self-
| attention layers adaptively weight connections between inputs
| based on context. This allows transformers to accomplish in a
| single layer what would take traditional networks multiple
| layers.
| kmeisthax wrote:
| None of this seems obvious just reading the original _Attention
| is all you need_ paper. Is there a more in-depth explanation of
| how this adaptive weighting works?
| WhitneyLand wrote:
| It's definitely not obvious no matter how smart you are! The
| common metaphor used is it's like a conversation.
|
| Imagine you read one comment in some forum, posted in a long
| conversation thread. It wouldn't be obvious what's going on
| unless you read more of the thread right?
|
| A single paper is like a single comment, in a thread that
| goes on for years and years.
|
| For example, why don't papers explain what
| tokens/vectors/embedding layers are? Well, they did already,
| except that comment in the thread came 2013 with the word2vec
| paper!
|
| You might think wth? To keep up with this some one would have
| to spend a huge part of their time just reading papers. So
| yeah that's kind of what researchers do.
|
| The alternative is to try to find where people have distilled
| down the important information or summarized it. That's where
| books/blogs/youtube etc come in.
| andai wrote:
| Is there a way of finding interesting "chains" of such
| papers, short of scanning the references / "cited by" page?
|
| (For example, Google Scholar lists 98797 citations for
| Attention is all you need!)
| WhitneyLand wrote:
| As a prerequisite to the attention paper? One to check
| out is:
|
| A Survey on Contextual Embeddings
| https://arxiv.org/abs/2003.07278
|
| Embeddings are sort of what all this stuff is built on so
| it should help demystify the newer papers (it's actually
| newer than the attention paper but a better overview than
| starting with the older word2vec paper).
|
| Then after the attention paper an important one is:
|
| Language Models are Few-Shot Learners
| https://arxiv.org/abs/2005.14165
|
| I'm intentionally trying to not give a big list because
| they're so time-consuming. I'm sure you'll quickly branch
| out based on your interests.
| pizza wrote:
| softmax(QK) gives you a probability matrix of shape [seq,
| seq]. Think of this like an adjacency matrix with edges with
| flow weights that are probabilities. Hence semantic routing
| of V.
|
| where
|
| - Q = X @ W_Q
|
| - K = X @ W_K
|
| - V = X @ V
|
| hence
|
| attn_head_i = (softmax(Q@K/normalizing term) @ V)
|
| Each head corresponds to a different concurrent routing
| system
|
| The transformer just adds normalization and mlp feature
| learning parts around that.
| kaimac wrote:
| I found these notes very useful. They also contain a nice
| summary of how LLMs/transformers work. It doesn't help that
| people can't seem to help taking a concept that has been
| around for decades (kernel smoothing) and giving it a fancy
| new name (attention).
|
| http://bactra.org/notebooks/nn-attention-and-
| transformers.ht...
| CyberDildonics wrote:
| It's just as bad a "convolutional neural networks" instead
| of "images being scaled down"
| albertzeyer wrote:
| The audience of this paper are other researchers who already
| know the concept of attention, which was very well known
| already in the field. In such research papers, such things
| are never explained again, as all the researchers already
| know this or can read other sources, which are cited, but
| focus on the actual research questions. In this case, the
| research question was simply: Can we get away by just using
| attention and not using the LSTM anymore? Before that,
| everyone was using both together.
|
| I think learning it following it more this historical
| development can be helpful. E.g. in this case here, learn the
| concept of attention, specifically cross attention first. And
| that is this paper: Bahdanau, Cho, Bengio, "Neural Machine
| Translation by Jointly Learning to Align and Translate",
| 2014, https://arxiv.org/abs/1409.0473
|
| That paper introduces it. But even that is maybe quite dense,
| and to really grasp it, it helps to reimplement those things.
|
| It's always dense, because those papers already have space
| constraints given by the conferences, max 9 pages or so. To
| get a better detailed overview, you can study the authors
| code, or other resources. There is a lot now about those
| topics, whole books, etc.
| BOOSTERHIDROGEN wrote:
| What books cover exclusively about this topic ? Thanks
| CardenB wrote:
| I doubt any.
| albertzeyer wrote:
| This is frequently a topic here on HN. E.g.:
|
| https://udlbook.github.io/udlbook/
| (https://news.ycombinator.com/item?id=38424939)
|
| https://fleuret.org/francois/lbdl.html
| (https://news.ycombinator.com/item?id=35767789)
|
| https://www.fast.ai/
| (https://news.ycombinator.com/item?id=24237207)
|
| https://d2l.ai/
| (https://news.ycombinator.com/item?id=38428225)
|
| Some more:
|
| https://news.ycombinator.com/item?id=35543774
|
| There is a lot more. Just google for "deep learning", and
| you'll find a lot of content. And most of that will cover
| attention, as it is a really basic concept now.
| andai wrote:
| Turns out Attention is all you need isn't all you need!
|
| (I'm sorry)
| gtoubassi wrote:
| I struggled to get an intuition for this, but on another HN
| thread earlier this year saw the recommendation for Sebastian
| Raschka's series. Starting with this video:
| https://www.youtube.com/watch?v=mDZil99CtSU and maybe the
| next three or four. It was really helpful to get a sense of
| the original 2014 concept of attention which is easier to
| understand but less powerful
| (https://arxiv.org/abs/1409.0473), and then how it gets
| powerful with the more modern notion of attention. So if you
| have a reasonable intuition for "regular" ANNs I think this
| is a great place to start.
| WhitneyLand wrote:
| In case it's confusing for anyone to see "weight" as a verb and
| a noun so close together, there are indeed two different things
| going on:
|
| 1. There are the model weights, aka the parameters. These are
| what get adjusted during training to do the learning part. They
| always exist.
|
| 2. There are attention weights. These are part of the
| transformer architecture and they "weight" the context of the
| input. They are ephemeral. Used and discarded. Don't always
| exist.
|
| They are both typically 32-bit floats in case you're curious
| but still different concepts.
| kirill5pol wrote:
| I think a good way of explaining #2 is "weight" in the sense
| of a weighted average
| airstrike wrote:
| I always thought the verb was "weigh" not "weight", but
| apparently the latter is also in the dictionary as a verb.
|
| Oh well... it seems like it's more confusing than I thought
| https://www.merriam-webster.com/wordplay/when-to-use-
| weigh-a...
| bobbylarrybobby wrote:
| "To weight" is to _assign_ a weight (e.g., to weight
| variables differently in a model), whereas "to weigh" is to
| _observe_ and /or _record_ a weight (as a scale does).
| DavidSJ wrote:
| A few other cases of this sort of thing:
|
| affect (n). an emotion or feeling. "She has a positive
| affect."
|
| effect (n). a result or change due to some event. "The
| effect of her affect is to make people like her."
|
| affect (v). to change or modify [X], have an effect upon
| [X]. "The weather affects my affect."
|
| effect (v). to bring about [X] or cause [X] to happen.
| "Our protests are designed to effect change."
|
| Also:
|
| cost (v). to require a payment or loss of [X]. "That
| Apple will cost $5." Past tense cost: "That Apple cost
| $5."
|
| cost (v). to estimate the price of [X]. "The accounting
| department will cost the construction project at $5
| million." Past tense costed. "The accounting department
| costed the construction project at $5 million."
| lchengify wrote:
| Just to add on, a good way to learn these terms is to look at
| the history of neural networks rather than looking at
| transformer architecture in a vacuum
|
| This [1] post from 2021 goes over attention mechanisms as
| applied to RNN / LSTM networks. It's visual and goes into a bit
| more detail, and I've personally found RNN / LSTM networks
| easier to understand intuitively.
|
| [1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-
| at...
| skadamat wrote:
| If folks want a lower dimensional version of this for their own
| models, I'm a big fan of the Netron library for model
| architecture visualization.
|
| Wrote about it here: https://about.xethub.com/blog/visualizing-
| ml-models-github-n...
| arikrak wrote:
| This looks pretty cool! Anyone know of visualizations for simpler
| neural networks? I'm aware of tensorflow playground but that's
| just for a toy example, is there anything for visualizing a real
| example (e.g handwriting recognition)?
| Logge wrote:
| https://okdalto.github.io/VisualizeMNIST_web/
| atonalfreerider wrote:
| We made a VR visualization back in 2017
| https://youtu.be/x6y14yAJ9rY
| rvz wrote:
| Rather than looking at the visuals of this network, it is more
| better to focus on the actual problem with these LLMs which the
| author already has shown:
|
| With in the transformer section:
|
| > As is common in deep learning, it's hard to say exactly what
| each of these layers is doing, but we have some general ideas:
| the earlier layers tend to focus on learning lower-level features
| and patterns, while the later layers learn to recognize and
| understand higher-level abstractions and relationships.
|
| That is the problem and yet these black boxes are just as
| explainable as a magic scroll.
| nlh wrote:
| I find this problem fascinating.
|
| For decades we've puzzled at how the inner workings of the
| brain works, and thought we've learned a lot we still don't
| fully understand it. So, we figure, we'll just make an
| artificial brain and THEN we'll be able to figure it out.
|
| And here we are, finally a big step closer to an artificial
| brain and once again, we don't know how it works :)
|
| (Although to be fair we're spending all of our efforts making
| the models better and better and not on learning their low
| level behaviors. Thankfully when we decide to study them it'll
| be a wee less invasive and actually doable, in theory.)
| shaburn wrote:
| Visualization never seems to get the credit due in software
| development. This is amazing.
| flockonus wrote:
| Twitter thread by the author sharing some extra context on this
| work:
| https://twitter.com/BrendanBycroft/status/173104295714982714...
| russellbeattie wrote:
| I've wondered for a while if as LLM usage matures, there will be
| an effort to optimize hotspots like what happened with VMs, or
| auto indexed like in relational DBs. I'm sure there are common
| data paths which get more usage, which could somehow be
| prioritized, either through pre-processing or dynamically,
| helping speed up inference.
| holtkam2 wrote:
| The visualization I've been looking for for months. I would have
| happily paid serious money for this... the fact that it's free is
| such a gift and I don't take it for granted.
| Solvency wrote:
| Wish it were mobile friendly.
| physPop wrote:
| Honestly reading the pytorch implementation of minGTP is a lot
| more informative than an inscrutable 3d rendering. It's a well
| commended and pedagogical implementation. I applaud the
| intention, and it looks slick, but I'm not sure it really conveys
| information in an efficient way.
___________________________________________________________________
(page generated 2023-12-03 23:00 UTC)