[HN Gopher] 3Blue1Brown: But what is a GPT? [video]
___________________________________________________________________
3Blue1Brown: But what is a GPT? [video]
Author : huhhuh
Score : 184 points
Date : 2024-04-01 19:37 UTC (3 hours ago)
(HTM) web link (www.youtube.com)
(TXT) w3m dump (www.youtube.com)
| lucidrains wrote:
| I can't think of anyone better to teach attention mechanism to
| the masses. This is a dream come true
| acchow wrote:
| Incredible. This 3B1B series was started 6 years ago and keeps
| going today with chapter 5.
|
| If you haven't seen the first few chapters, I cannot recommend
| enough.
| user_7832 wrote:
| Would you be able to compare them to Andrew Ng's course?
| sk11001 wrote:
| They're not really comparable - if you're wondering if you
| should do one or the other, you should do both.
| ctrw wrote:
| The way you compare a technical drawing of a steam engine
| to The Fighting Temeraire oil painting.
| yinser wrote:
| What an unbelievable salve for all the April Fool's content. Pipe
| this directly into my veins.
| user_7832 wrote:
| I've just started this video, but already have a question if
| anyone's familiar with GPT workings - I thought that these models
| chose the next word based on what's most likely. But if they
| choose based on "one of the likely" words, could (in general)
| that not lead to a situation where the list of predictions for
| the next word are much less likely? Running possibilities of "two
| words together", then, would be more beneficial if
| computationally possible (and so on for 3, 4 and n words). Does
| this exist?
|
| (I realize that choosing the most likely word wouldn't
| necessarily solve the issue, but choosing the most likely phrase
| possibly might.)
|
| Edit, post seeing the video and comments: it's beam search, along
| with temperature to control these things.
| mvsin wrote:
| Something like this does exist, production systems rarely use
| greedy search but have more holistic search algorithms.
|
| An example is Beam Search:https://www.width.ai/post/what-is-
| beam-search
|
| Essentially we keep a window of probabilities of predicted
| tokens to improve the final quality of output.
| user_7832 wrote:
| Thanks, that's exactly what I was looking for! Any idea if
| it's possible to use beam search on local models like
| mistral? It sounds like the choice of beam search vs say
| top-p or top-k should be in the software and not embedded,
| right?
| yunohn wrote:
| This is actually a great question for which I found an
| interesting attempt:
| https://andys.page/posts/llm_sampling_strategies/
|
| (No affiliation)
| activatedgeek wrote:
| If you use HuggingFace models, then a few simpler decoding
| algorithms are already implemented for `generate` method of
| all supported models.
|
| Here is a blog post that describes it:
| https://huggingface.co/blog/how-to-generate.
|
| I will warn you though that beam search is typically what
| you do NOT want. Beam search approximately optimizes for
| the "highest likely sequence at the token level." This is
| rarely what you need in practice with open-ended
| generations (e.g. a question-answering chat bot). In
| practice, you need "highest likely semantic sequence,"
| which is much harder problem.
|
| Of course, various approximations for semantic alignment
| are currently in the literature, but still a wide open
| problem.
| doctoboggan wrote:
| The temperature setting is used to select how rare of a next
| token is possible. If set to 0 the. The top of the likely list
| is chosen, if set greater than 0 then some lower probability
| tokens may be chosen.
| user_7832 wrote:
| Thanks, learnt something new today!
| not_a_dane wrote:
| It is the part of softmax layer, but not all the time.
| davekeck wrote:
| > then some lower probability tokens may be chosen
|
| Can you explain how it chooses one of the lower-probability
| tokens? Is it just random?
| lxe wrote:
| There's a whole bunch of different normalization and sampling
| techniques that you can perform that can alter the quality or
| expressiveness of the model, e.g.
| https://docs.sillytavern.app/usage/common-settings/#sampler-...
| Vespasian wrote:
| If you liked that, Andrej karpathy has a few interesting videos
| on his channels explaining Neural Networks and their inner
| workings which are aimed at people who know how to program.
| jtonz wrote:
| As a reasonably experienced programmer that has watched
| Andrej's videos the one thing I would recommend is that they
| not be used as a starting point to learn neural networks but as
| a reinforcement or enhancement method once you know the
| fundamentals.
|
| I was ignorant enough to try and jump straight in to his videos
| and despite him recommending I watch his preceeding videos I
| incorrectly assumed I could figure it out as I went. There is
| verbiage in there that you simply must know to get the most out
| of it. After giving up, going away and filling in the gaps
| though some other learnings, I went back and his videos become
| (understandably) massively more valueable for me.
|
| I would strongly recommend anyone else wanting to learn neural
| networks that they learn from my mistake.
| __loam wrote:
| 3B1B is one of the best stem educators in YouTube.
| lxe wrote:
| Can't wait for the next videos. I think I'll finally be able to
| internalize and understand how these things work.
| throwawayk7h wrote:
| The next token is taken by sampling the logits in the final
| column after unembedding. But isn't that just the last token
| again? Or is the matrix resized to N+1 at some step?
___________________________________________________________________
(page generated 2024-04-01 23:00 UTC)