[HN Gopher] Implementation of Google's Griffin Architecture - RN...
       ___________________________________________________________________
        
       Implementation of Google's Griffin Architecture - RNN LLM
        
       Author : milliondreams
       Score  : 118 points
       Date   : 2024-04-10 17:47 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | VHRanger wrote:
       | Like RWKV and Mamba, this is mixing some RNN properties to avoid
       | the issues transformers have.
       | 
       | However I'm curious about their scaling claims. They have a plot
       | that shows how the model scales in training with the FLOPs you
       | throw at it.
       | 
       | But the issue we should rather be concerned with is the wall time
       | of training for a set amount of hardware.
       | 
       | Back in 2018, we could train medium sized RNNs, the issue was
       | with wall time of training and training stability.
        
         | whimsicalism wrote:
         | transformers were also just better at the LM task than 2018
         | RNNs for equal amount of flop training
        
           | VHRanger wrote:
           | Yeah, that's just the training stability part to my knowledge
        
             | whimsicalism wrote:
             | they're also just less capable models. like just adding
             | attention on top of an RNN made them a lot better
        
         | foota wrote:
         | Do you know the downside with RWKV? Based on how they present
         | it, it seems like the best thing since sliced bread, but I
         | would have assumed that it would have been widely adopted if
         | that were the case.
        
           | VHRanger wrote:
           | It seems only OK as a model? Looking at the LLM chat
           | leaderboard it's 71st and the 14B version is worse than a lot
           | of 7B models:
           | 
           | https://huggingface.co/spaces/lmsys/chatbot-arena-
           | leaderboar...
           | 
           | Also, llama.cpp makes inference accessible for a lot of
           | people, and it's not available for RWKV.
           | 
           | Not to knock on the model, I'm sure it's good. I also like
           | that it's a succesful example of citizen science.
           | 
           | It's just not popular enough to have the inference
           | infrastructure transformers have, not established enough to
           | attract enough money to get 60B+ models trained, and so on.
        
             | whimsicalism wrote:
             | i believe it is undertrained, at minimum
        
             | WanderPanda wrote:
             | This leaderboard is not the best for comparing model
             | architectures, the dataset and finetuning have too much
             | influence. I think perplexity on a particular dataset would
             | be a better way to compare
        
           | jimmyl02 wrote:
           | From what I know about RWKV, it's mostly a one man effort and
           | doesn't have the same data pipeline / resources as most major
           | labs. It's a bit unfortunate but I'm curious about the
           | performance given the same training corpus as OpenAI's GPTs.
           | Maybe some labs have tried internally but haven't released
           | results? On the other hand it makes sense to invest more
           | money into transformer training runs as they have been proven
           | to work.
           | 
           | They really burst onto the scene and brought back RNNs in the
           | world of transformers. The claim that RWKV isn't
           | paralleizable during training also seems to be refuted in
           | their readme. I'd guess it's generalizable performance as
           | there is a difference between doing well on benchmarks and
           | being usable. Personally I've tried running the weights a
           | long time ago when it was first released and the results
           | weren't usable but I'm sure there has been considerable
           | progress since then.
        
         | GaggiX wrote:
         | The paper shows that the speed is comparable to transformer
         | models, faster with smaller with "long" sequence length like
         | 8k.
        
       | riku_iki wrote:
       | I didn't get one detail: they selected 6B transformer as baseline
       | and compared it to 7B Griffin
       | 
       | Why wouldn't select equal size models?..
        
         | szundi wrote:
         | They probably had them for some reason and it was cheaper not
         | to retrain one of them again
        
           | riku_iki wrote:
           | Its just performance comparison is misleading then, they
           | report marginal improvements which is expected just because
           | of models size differences..
        
             | GaggiX wrote:
             | It also performs better on any other size.
        
               | riku_iki wrote:
               | They have baseline transformer of max size 6B in tables.
               | Other models are trained on very different data and
               | probably differently.
        
               | GaggiX wrote:
               | All the MQA transformers, Hawk and Griffin are trained on
               | the same MassiveText dataset so no.
        
               | riku_iki wrote:
               | Yes, but MQA is limited to 6B size, while "other" larger
               | non-RNN models in table(Llama-2) are not trained on the
               | same dataset, and Hawk and Griffin are 7B. Sorry, I don't
               | understand your point.
        
               | GaggiX wrote:
               | The point is that it also beats the baseline on every
               | other size (1B and 3B). So it wouldn't be surprising to
               | see it beat a 7B transformer model like the 6B model.
               | Note 2 on page 5 probably explains why the sizes are
               | different.
        
       | spxneo wrote:
       | im not smart enough to know the significance of this...is Griffin
       | like MAMBA?
        
         | VHRanger wrote:
         | Yes, like RWKV and Mamba this is a new generation of models
         | that are more like big RNNs than pure transformers we have now
        
       ___________________________________________________________________
       (page generated 2024-04-10 23:00 UTC)