[HN Gopher] 3Blue1Brown: But what is a GPT? [video]
       ___________________________________________________________________
        
       3Blue1Brown: But what is a GPT? [video]
        
       Author : huhhuh
       Score  : 184 points
       Date   : 2024-04-01 19:37 UTC (3 hours ago)
        
 (HTM) web link (www.youtube.com)
 (TXT) w3m dump (www.youtube.com)
        
       | lucidrains wrote:
       | I can't think of anyone better to teach attention mechanism to
       | the masses. This is a dream come true
        
         | acchow wrote:
         | Incredible. This 3B1B series was started 6 years ago and keeps
         | going today with chapter 5.
         | 
         | If you haven't seen the first few chapters, I cannot recommend
         | enough.
        
           | user_7832 wrote:
           | Would you be able to compare them to Andrew Ng's course?
        
             | sk11001 wrote:
             | They're not really comparable - if you're wondering if you
             | should do one or the other, you should do both.
        
             | ctrw wrote:
             | The way you compare a technical drawing of a steam engine
             | to The Fighting Temeraire oil painting.
        
       | yinser wrote:
       | What an unbelievable salve for all the April Fool's content. Pipe
       | this directly into my veins.
        
       | user_7832 wrote:
       | I've just started this video, but already have a question if
       | anyone's familiar with GPT workings - I thought that these models
       | chose the next word based on what's most likely. But if they
       | choose based on "one of the likely" words, could (in general)
       | that not lead to a situation where the list of predictions for
       | the next word are much less likely? Running possibilities of "two
       | words together", then, would be more beneficial if
       | computationally possible (and so on for 3, 4 and n words). Does
       | this exist?
       | 
       | (I realize that choosing the most likely word wouldn't
       | necessarily solve the issue, but choosing the most likely phrase
       | possibly might.)
       | 
       | Edit, post seeing the video and comments: it's beam search, along
       | with temperature to control these things.
        
         | mvsin wrote:
         | Something like this does exist, production systems rarely use
         | greedy search but have more holistic search algorithms.
         | 
         | An example is Beam Search:https://www.width.ai/post/what-is-
         | beam-search
         | 
         | Essentially we keep a window of probabilities of predicted
         | tokens to improve the final quality of output.
        
           | user_7832 wrote:
           | Thanks, that's exactly what I was looking for! Any idea if
           | it's possible to use beam search on local models like
           | mistral? It sounds like the choice of beam search vs say
           | top-p or top-k should be in the software and not embedded,
           | right?
        
             | yunohn wrote:
             | This is actually a great question for which I found an
             | interesting attempt:
             | https://andys.page/posts/llm_sampling_strategies/
             | 
             | (No affiliation)
        
             | activatedgeek wrote:
             | If you use HuggingFace models, then a few simpler decoding
             | algorithms are already implemented for `generate` method of
             | all supported models.
             | 
             | Here is a blog post that describes it:
             | https://huggingface.co/blog/how-to-generate.
             | 
             | I will warn you though that beam search is typically what
             | you do NOT want. Beam search approximately optimizes for
             | the "highest likely sequence at the token level." This is
             | rarely what you need in practice with open-ended
             | generations (e.g. a question-answering chat bot). In
             | practice, you need "highest likely semantic sequence,"
             | which is much harder problem.
             | 
             | Of course, various approximations for semantic alignment
             | are currently in the literature, but still a wide open
             | problem.
        
         | doctoboggan wrote:
         | The temperature setting is used to select how rare of a next
         | token is possible. If set to 0 the. The top of the likely list
         | is chosen, if set greater than 0 then some lower probability
         | tokens may be chosen.
        
           | user_7832 wrote:
           | Thanks, learnt something new today!
        
           | not_a_dane wrote:
           | It is the part of softmax layer, but not all the time.
        
           | davekeck wrote:
           | > then some lower probability tokens may be chosen
           | 
           | Can you explain how it chooses one of the lower-probability
           | tokens? Is it just random?
        
         | lxe wrote:
         | There's a whole bunch of different normalization and sampling
         | techniques that you can perform that can alter the quality or
         | expressiveness of the model, e.g.
         | https://docs.sillytavern.app/usage/common-settings/#sampler-...
        
       | Vespasian wrote:
       | If you liked that, Andrej karpathy has a few interesting videos
       | on his channels explaining Neural Networks and their inner
       | workings which are aimed at people who know how to program.
        
         | jtonz wrote:
         | As a reasonably experienced programmer that has watched
         | Andrej's videos the one thing I would recommend is that they
         | not be used as a starting point to learn neural networks but as
         | a reinforcement or enhancement method once you know the
         | fundamentals.
         | 
         | I was ignorant enough to try and jump straight in to his videos
         | and despite him recommending I watch his preceeding videos I
         | incorrectly assumed I could figure it out as I went. There is
         | verbiage in there that you simply must know to get the most out
         | of it. After giving up, going away and filling in the gaps
         | though some other learnings, I went back and his videos become
         | (understandably) massively more valueable for me.
         | 
         | I would strongly recommend anyone else wanting to learn neural
         | networks that they learn from my mistake.
        
       | __loam wrote:
       | 3B1B is one of the best stem educators in YouTube.
        
       | lxe wrote:
       | Can't wait for the next videos. I think I'll finally be able to
       | internalize and understand how these things work.
        
       | throwawayk7h wrote:
       | The next token is taken by sampling the logits in the final
       | column after unembedding. But isn't that just the last token
       | again? Or is the matrix resized to N+1 at some step?
        
       ___________________________________________________________________
       (page generated 2024-04-01 23:00 UTC)