[HN Gopher] The RWKV language model: An RNN with the advantages ...
___________________________________________________________________
The RWKV language model: An RNN with the advantages of a
transformer
Author : T-A
Score : 174 points
Date : 2023-03-30 10:03 UTC (12 hours ago)
(HTM) web link (johanwind.github.io)
(TXT) w3m dump (johanwind.github.io)
| Straw wrote:
| Unfortunately its not very good at longer context lengths, which
| sort of defeats the point of efficient scaling with context. See
| https://twitter.com/arankomatsuzaki/status/16390003799784038...
|
| Its also not really an RNN. The best way to describe the key time
| mixing operation is a normalized exponentially weighted moving
| average (EMA)- no non-linearity. Once viewed this way, its not
| surprising that it struggles at longer contexts- everything
| decays, and it has limited space to put things. Of course, it
| does have some clever tricks, and can choose to remember things
| for a while by upweighting them, but not forever.
| Der_Einzige wrote:
| Yup. This is why GPT-4 long context length version is going to
| be such a god damn gamechanger.
|
| The current memory techniques we have outside of ultra long
| context lengths are lossy and imperfect. I wish the langchain
| spammers (yes it's a good tool) would acknowledge this when
| they keep posting everywhere about the "memory module".
| zaptrem wrote:
| It already is a game changer, Bing Chat's Creative mode
| appears to be using >8k token context (though I'm not sure if
| it's 16k or 32k).
| lysecret wrote:
| Yes, I am so curious how they did it, I know about flash
| attention but there is no way this gets us all the way there.
| sebzim4500 wrote:
| Why not? They charge 8x more for a 32k context than for a
| 8k context (note that the prices are normally presented per
| token, here I'm talking about absolute cost). Naive scaling
| on the self-attention component would suggest 16x compute
| and 4x memory, while the rest of it (forward layers,
| embeddings, activation functions etc.) all would go up 4x
| both compute and memory.
| imranq wrote:
| The stats shown are pretty compelling that this transformer + RNN
| approach works. Looking back at the Llama paper
| (https://arxiv.org/pdf/2302.13971.pdf), many of the results are
| comparable between llama and RWKV. E.g. Llama 13bn scores 80.1 on
| PIQA and RWKV scores 77.5
|
| I wonder if RLHF would boost performance for this architecture in
| the same way
| titaniumtown wrote:
| I personally think the way these chat assistants are created
| using these stateless models is extremely flawed. I'm glad that
| the technique used in RWKV seems successful, I wonder if this
| transfomer + RNN combination will be adopted elsewhere.
| VHRanger wrote:
| Transformers scale in training mainly because they are
| stateless!
|
| They have a long lookback window (30,000 tokens in GPT4!) so
| the issues are smoothed over.
|
| LSTMs end up being very finnicky to train and their
| sequential nature makes model scaling bottlenecked in
| parallelism.
| gliptic wrote:
| RWKV training can be parallelized just like transformers.
| It's not LSTM.
| VHRanger wrote:
| Right, I meant why transformers exploded over LSTM-based
| LMs in 2018 was because of those training issues that
| transformers solved.
|
| I'm excited about the next iteration of models focusing
| on inference. RWVK seems promising on this end. Also love
| the citizen science aspect of RWVK coming out.
| leumassuehtam wrote:
| You can currently play around with the 7B[0] and 14B[1]
| parameters models hosted in Hugging Face Spaces.
|
| [0] https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
|
| [1] https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
| joshspankit wrote:
| This gave me a good chuckle. Thought I would share. (From
| ChatRWKV)
|
| Prompt:
|
| A website can be built in 10 simple steps
|
| Output:
|
| 1. Research
|
| 2. Research
|
| 3. Research
|
| 4. Research
|
| 5. Research
|
| 6. Research
|
| 7. Design
|
| 8. Design
|
| 9. Design
|
| 10. Design
| aww_dang wrote:
| Also here:
|
| https://bellard.org/ts_server/
| tarruda wrote:
| Somehow not surprised Bellard is also going to make an impact
| in AI/LLMs
| mdp2021 wrote:
| > _Somehow not surprised Bellard is also going to make an
| impact in AI[... ]_
|
| He also wrote an engine: "LibNC: C Library for Tensor
| Manipulation"
|
| https://bellard.org/libnc/
|
| > _[ ...] /LLMs_
|
| That could be a wish come true.
|
| Edit: in fact, already one of his marks can be seen:
| <<Larger models work optimally on lower cost GPUs (e.g. RTX
| 3090, RTX A6000) thanks to efficient quantization>>
| synctext wrote:
| > going to make an impact in AI/LLMs
|
| This seems to be made for mass parallelism.
|
| Question: Would this scale to 100 million of collaborating
| 4G/5G smartphones? We have operational federated learning
| code and decentralised learning code (trivial stochastic
| gradient descent). Application: simple learn-to-Rank based
| on ClickLogs like its 2005 [1]. Then you have a true
| decentralised Google.
|
| Would love to link RWKV to other pure decentralised tech.
| In the past we have build the first self-compiling Android
| app and first Android-to-Android P2P overlay network.
| Without any helper peers for carrier-grade NAT puncturing.
| My university systems lab lacks the size to keep up with
| the recent pace of innovation.
|
| [1] https://grouplens.org/beyond2005/full/pouwelse.pdf
| ducktective wrote:
| Man fears AGI
|
| AGI fears Fabrice Bellard
| muttled wrote:
| You're not lying. When I saw the site I thought "isn't that
| the dude who managed to emulate an entire PC in Javascript
| and boot Windows in a browser? Yep that's him."
| lucidrains wrote:
| while this topic has visibility, there is also another relatively
| unexplored research direction, fine-tuning pretrained
| transformers into RNNs
|
| https://arxiv.org/abs/2210.04243
|
| https://aclanthology.org/2021.emnlp-main.830.pdf
| [deleted]
| ipsum2 wrote:
| > though in practice, the model might have a hard time
| generalizing to much longer context lengths than it saw during
| training
|
| What's the benefit of RWKV then? The results are practically
| identical to transformers.
| sebzim4500 wrote:
| It is way cheaper to run inference on commodity hardware,
| because you don't have to keep track of the activations on all
| the previous tokens (only the 3 previous ones IIRC).
| og_kalu wrote:
| You can easily fine-tune for more context and indeed he does.
| There are at least 8192 ctx versions out now (started with
| 1024) and he plans to go all the way to at least 16k
| vagabund wrote:
| - from O(N2) to O(N) complexity during training wrt context
| length
|
| - only need the hidden state at position t to compute t+1, i.e.
| run inference locally on edge devices
|
| - fine-tunable to longer context lengths than seen in pre-
| training
|
| If it continues to scale as well as it has so far it's pretty
| huge.
| PartiallyTyped wrote:
| > RWKV combines the best features of RNNs and transformers.
| During training, we use the transformer type formulation of the
| architecture, which allows massive parallelization (with a sort
| of attention which scales linearly with the number of tokens).
| For inference, we use an equivalent formulation which works
| like an RNN with a state. This allows us to get the best of
| both worlds.
|
| > So we basically have a model which trains like a transformer,
| except that long context length is not expensive. And during
| inference, we need substantially less memory and can implicitly
| handle "infinite" context length (though in practice, the model
| might have a hard time generalizing to much longer context
| lengths than it saw during training).
|
| The tl:dr Linear Attention avoids the quadratic token scaling
| and is equivalent to specific instances of RNNs.
| choeger wrote:
| Can someone ELI5 how linear attention works? Isn't that
| impossible in principle?
| sebzim4500 wrote:
| It's not really attention, it's 'just' a exponential moving
| average over time where different channels have different decay
| rates. This is a simplification, in the actual architecture
| there are also convolution layers.
| numeri wrote:
| I've seen this hyped a lot on Reddit, and its main contributing
| base seems to be on a Discord channel, but have yet to see any
| scholarly discussion behind it.
|
| Is there a reason the creator hasn't written or tried to publish
| a paper on it? I'd love to see an peer-reviewed discussion
| detailing how it works, maybe studying how it behaves internally,
| evaluating its performance more rigorously than posting a few
| examples, maybe in comparison to Transformers with similar
| parameter counts, or training/inference costs in FLOPs. Perhaps
| that's not the creator's main priority, though.
| adeon wrote:
| I think I've seen this same sentiment on many of the Reddit
| threads on RWKV. I think the author prefers to spend their time
| on coding it up and explaining it on the GitHub rather than
| taking the time to write a paper on it.
|
| I kinda respect that, can't force someone to explain their ways
| if they prefer hacking to writing. I would also love to see
| more rigorous measurements and discussion though.
|
| The creator replied: "Thank you :) Too busy for that at this
| moment, but I will get a paper out later this year." once on a
| reddit post asking exactly this.
|
| https://old.reddit.com/r/MachineLearning/comments/1135aew/r_...
| codewithcheese wrote:
| If the context size is unbounded, how does the time complexity
| scale with the size of the context, and what are the limiting
| factors that affect performance as the context size grows larger?
| sebzim4500 wrote:
| The time and space complexity of inference is constant wrt.
| context size. You will probably need more parameters to match
| the performance of a transformer though, so whether it scales
| better in practice is an open question.
| freeqaz wrote:
| When I've played around with this on my box, I haven't been able
| to get the output to be very decent. Maybe I'm just used to
| ChatGPT though.
|
| Does anybody have some examples of output that it generates to
| explain what this is capable of?
| imranq wrote:
| Check out the web demo for the 14B parameter model:
| https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
|
| I tried a few queries. It works okay for basic prompts for
| coding questions but can give pretty good results with more
| context / prompt engineering
| vagabund wrote:
| If you're expecting chatGPT-esque outputs I"d suggest this 7B
| demo that's fine-tuned on the Alpaca dataset.
|
| https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
| amrb wrote:
| Happy to see this getting attention, lots of great open models
| are being worked on and I can't wait to see something people at
| home with a 3090 could use.
| tarruda wrote:
| You should be able to run Llama 30B q4 on a RTX 3090. Check the
| table here: https://bellard.org/ts_server/
|
| 30B/q4 requires 20GB of RAM while 3090 has 24GB.
| mromanuk wrote:
| which one is the next incarnation of
| performance/price/goodness after the RTX 3090 for running a
| mini AI homelab?
| espadrine wrote:
| To increase performance, you need bigger foundation models,
| like LLaMA 65B. You will always be RAM-bound for that
| (~40GB for LLaMA 65B INT4). So the next step is to upgrade
| your motherboard to have multiple GPU slots. If you really
| wanted to stick to a single-GPU setup, you could upgrade to
| RTX A6000, but it is more expensive than two RTX 3090 while
| holding the same amount of RAM.
| yieldcrv wrote:
| Its a dual slot gpu
|
| What benefit does that offer over 2 GPUs?
| espadrine wrote:
| The two GPUs will need to exchange information in order
| to complete inference, by having one GPU hold half of the
| network weights, and the other holding the other half.
| That transmission of information will be limited by the
| PCIe bandwidth; for instance, 30 GB/s with v4x16.
|
| Meanwhile the A6000 VRAM bandwidth is 768 GB/s (= 16 Gb/s
| (GDDR6) x 384 bit-width / 8 bits per byte).
| coolspot wrote:
| Two 3090 can be connected using NVLink bridge which is
| much faster than PCI-E.
| espadrine wrote:
| I couldn't find precise information on what bandwidth
| you'd get with NVLink on the 3090. To be fair, though, if
| all we do is inference, using Huggingface pipeline
| parallelism, the amount of data transferred is pretty
| small: 8192x2xn_tokens bytes; for most uses, with a
| recent PCIe setup, that bottleneck will take less than
| 0.1 ms per token generated, which may not be the dominant
| latency.
|
| (Also, the RTX 3090 has faster VRAM, >900 GB/s, than the
| A6000, because it is GDDR6X.)
| azeirah wrote:
| Not entirely certain about this, but I believe that because
| Apple M-series GPUs use system memory you could possible
| run the larger models on accessible (albeit expensive)
| consumer hardware.
|
| Need 64GB for 30B
|
| And 128GB for the 65B
|
| I'm not sure about the performance, but I think it should
| be ok? Especially given how much Apple has been investing
| in what -- I believe they call -- neural cores?
|
| Here's some more context:
| https://news.ycombinator.com/item?id=35105364
| michannne wrote:
| Seconding this.
|
| Llama 30B 4-bit has amazing performance, comparable to GPT-3
| quality for my search and novel generating use-cases, and
| fits on a single 3090. In tandem with 3rd party applications
| such as Llama Index and the Alpaca LoRa, GPT-3 (and
| potentially GPT-4) has already been democratized in my eyes.
___________________________________________________________________
(page generated 2023-03-30 23:01 UTC)