[HN Gopher] Implementation of Google's Griffin Architecture - RN...
___________________________________________________________________
Implementation of Google's Griffin Architecture - RNN LLM
Author : milliondreams
Score : 118 points
Date : 2024-04-10 17:47 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| VHRanger wrote:
| Like RWKV and Mamba, this is mixing some RNN properties to avoid
| the issues transformers have.
|
| However I'm curious about their scaling claims. They have a plot
| that shows how the model scales in training with the FLOPs you
| throw at it.
|
| But the issue we should rather be concerned with is the wall time
| of training for a set amount of hardware.
|
| Back in 2018, we could train medium sized RNNs, the issue was
| with wall time of training and training stability.
| whimsicalism wrote:
| transformers were also just better at the LM task than 2018
| RNNs for equal amount of flop training
| VHRanger wrote:
| Yeah, that's just the training stability part to my knowledge
| whimsicalism wrote:
| they're also just less capable models. like just adding
| attention on top of an RNN made them a lot better
| foota wrote:
| Do you know the downside with RWKV? Based on how they present
| it, it seems like the best thing since sliced bread, but I
| would have assumed that it would have been widely adopted if
| that were the case.
| VHRanger wrote:
| It seems only OK as a model? Looking at the LLM chat
| leaderboard it's 71st and the 14B version is worse than a lot
| of 7B models:
|
| https://huggingface.co/spaces/lmsys/chatbot-arena-
| leaderboar...
|
| Also, llama.cpp makes inference accessible for a lot of
| people, and it's not available for RWKV.
|
| Not to knock on the model, I'm sure it's good. I also like
| that it's a succesful example of citizen science.
|
| It's just not popular enough to have the inference
| infrastructure transformers have, not established enough to
| attract enough money to get 60B+ models trained, and so on.
| whimsicalism wrote:
| i believe it is undertrained, at minimum
| WanderPanda wrote:
| This leaderboard is not the best for comparing model
| architectures, the dataset and finetuning have too much
| influence. I think perplexity on a particular dataset would
| be a better way to compare
| jimmyl02 wrote:
| From what I know about RWKV, it's mostly a one man effort and
| doesn't have the same data pipeline / resources as most major
| labs. It's a bit unfortunate but I'm curious about the
| performance given the same training corpus as OpenAI's GPTs.
| Maybe some labs have tried internally but haven't released
| results? On the other hand it makes sense to invest more
| money into transformer training runs as they have been proven
| to work.
|
| They really burst onto the scene and brought back RNNs in the
| world of transformers. The claim that RWKV isn't
| paralleizable during training also seems to be refuted in
| their readme. I'd guess it's generalizable performance as
| there is a difference between doing well on benchmarks and
| being usable. Personally I've tried running the weights a
| long time ago when it was first released and the results
| weren't usable but I'm sure there has been considerable
| progress since then.
| GaggiX wrote:
| The paper shows that the speed is comparable to transformer
| models, faster with smaller with "long" sequence length like
| 8k.
| riku_iki wrote:
| I didn't get one detail: they selected 6B transformer as baseline
| and compared it to 7B Griffin
|
| Why wouldn't select equal size models?..
| szundi wrote:
| They probably had them for some reason and it was cheaper not
| to retrain one of them again
| riku_iki wrote:
| Its just performance comparison is misleading then, they
| report marginal improvements which is expected just because
| of models size differences..
| GaggiX wrote:
| It also performs better on any other size.
| riku_iki wrote:
| They have baseline transformer of max size 6B in tables.
| Other models are trained on very different data and
| probably differently.
| GaggiX wrote:
| All the MQA transformers, Hawk and Griffin are trained on
| the same MassiveText dataset so no.
| riku_iki wrote:
| Yes, but MQA is limited to 6B size, while "other" larger
| non-RNN models in table(Llama-2) are not trained on the
| same dataset, and Hawk and Griffin are 7B. Sorry, I don't
| understand your point.
| GaggiX wrote:
| The point is that it also beats the baseline on every
| other size (1B and 3B). So it wouldn't be surprising to
| see it beat a 7B transformer model like the 6B model.
| Note 2 on page 5 probably explains why the sizes are
| different.
| spxneo wrote:
| im not smart enough to know the significance of this...is Griffin
| like MAMBA?
| VHRanger wrote:
| Yes, like RWKV and Mamba this is a new generation of models
| that are more like big RNNs than pure transformers we have now
___________________________________________________________________
(page generated 2024-04-10 23:00 UTC)