[HN Gopher] The Illustrated DeepSeek-R1
       ___________________________________________________________________
        
       The Illustrated DeepSeek-R1
        
       Author : amrrs
       Score  : 142 points
       Date   : 2025-01-27 20:51 UTC (2 hours ago)
        
 (HTM) web link (newsletter.languagemodels.co)
 (TXT) w3m dump (newsletter.languagemodels.co)
        
       | caithrin wrote:
       | This is fantastic work, thank you!
        
       | jasonjmcghee wrote:
       | For the uninitiated, this is the same author as the many other
       | "The Illustrated..." blog posts.
       | 
       | A particularly popular one:
       | https://jalammar.github.io/illustrated-transformer/
       | 
       | Always very high quality.
        
         | punkspider wrote:
         | Thanks so much for mentioning this. His name carries a lot of
         | weight for me as well.
        
       | blackeyeblitzar wrote:
       | The thing I still don't understand is how DeepSeek built the base
       | model cheaply, and why their models seem to think they are GPT4
       | when asked. This article says the base model is from their
       | previous paper, but that paper also doesn't make clear what they
       | trained on. The earlier paper is mostly a description of
       | optimization techniques they applied. It does mention pretraining
       | on 14.8T tokens with 2.7M H800 GPU hours to produce the base
       | DeepSeek-V3. But what were those tokens? The paper describes the
       | corpus only in vague ways.
        
         | moralestapia wrote:
         | A friend just sent me a screenshot where he asks DeepSeek if it
         | has an app for Mac and it replies that they have a ChatGPT app
         | from OpenAI, lol.
         | 
         | I 100% believe they distilled GPT-4, hence the low "training"
         | cost.
        
           | Philpax wrote:
           | Er, how would that reduce the cost? You still need to train
           | the model, which is the expensive bit.
           | 
           | Also, the base model for V3 and the only-RL-tuned R1-Zero are
           | available, and they behave like base models, which seems
           | unlikely if they used data from OpenAI as their primary data
           | source.
           | 
           | It's much more likely that they've consumed the background
           | radiation of the web, where OpenAI contamination is dominant.
        
         | moritonal wrote:
         | I imagine it's a mix of either using ChatGPT as an the oracle
         | to get training data. Or, it's the radiocarbon issue where the
         | Internet has so much info on ChatGPT other models now get
         | confused.
        
       | whoistraitor wrote:
       | It's remarkable we've hit a threshold where so much can be done
       | with synthetic data. The reasoning race seems an utterly solvable
       | problem now (thanks mostly to the verifiability of results). I
       | guess the challenge then becomes non-reasoning domains, where
       | qualitative and truly creative results are desired.
        
         | kenjackson wrote:
         | It seems like we need an evaluation model for creativity. I'm
         | curious, is there research on this -- for example, can one
         | score a random painting and output how creative/good a given
         | population is likely to find it?
        
       ___________________________________________________________________
       (page generated 2025-01-27 23:00 UTC)