[HN Gopher] The RWKV language model: An RNN with the advantages ...
       ___________________________________________________________________
        
       The RWKV language model: An RNN with the advantages of a
       transformer
        
       Author : T-A
       Score  : 174 points
       Date   : 2023-03-30 10:03 UTC (12 hours ago)
        
 (HTM) web link (johanwind.github.io)
 (TXT) w3m dump (johanwind.github.io)
        
       | Straw wrote:
       | Unfortunately its not very good at longer context lengths, which
       | sort of defeats the point of efficient scaling with context. See
       | https://twitter.com/arankomatsuzaki/status/16390003799784038...
       | 
       | Its also not really an RNN. The best way to describe the key time
       | mixing operation is a normalized exponentially weighted moving
       | average (EMA)- no non-linearity. Once viewed this way, its not
       | surprising that it struggles at longer contexts- everything
       | decays, and it has limited space to put things. Of course, it
       | does have some clever tricks, and can choose to remember things
       | for a while by upweighting them, but not forever.
        
         | Der_Einzige wrote:
         | Yup. This is why GPT-4 long context length version is going to
         | be such a god damn gamechanger.
         | 
         | The current memory techniques we have outside of ultra long
         | context lengths are lossy and imperfect. I wish the langchain
         | spammers (yes it's a good tool) would acknowledge this when
         | they keep posting everywhere about the "memory module".
        
           | zaptrem wrote:
           | It already is a game changer, Bing Chat's Creative mode
           | appears to be using >8k token context (though I'm not sure if
           | it's 16k or 32k).
        
           | lysecret wrote:
           | Yes, I am so curious how they did it, I know about flash
           | attention but there is no way this gets us all the way there.
        
             | sebzim4500 wrote:
             | Why not? They charge 8x more for a 32k context than for a
             | 8k context (note that the prices are normally presented per
             | token, here I'm talking about absolute cost). Naive scaling
             | on the self-attention component would suggest 16x compute
             | and 4x memory, while the rest of it (forward layers,
             | embeddings, activation functions etc.) all would go up 4x
             | both compute and memory.
        
       | imranq wrote:
       | The stats shown are pretty compelling that this transformer + RNN
       | approach works. Looking back at the Llama paper
       | (https://arxiv.org/pdf/2302.13971.pdf), many of the results are
       | comparable between llama and RWKV. E.g. Llama 13bn scores 80.1 on
       | PIQA and RWKV scores 77.5
       | 
       | I wonder if RLHF would boost performance for this architecture in
       | the same way
        
         | titaniumtown wrote:
         | I personally think the way these chat assistants are created
         | using these stateless models is extremely flawed. I'm glad that
         | the technique used in RWKV seems successful, I wonder if this
         | transfomer + RNN combination will be adopted elsewhere.
        
           | VHRanger wrote:
           | Transformers scale in training mainly because they are
           | stateless!
           | 
           | They have a long lookback window (30,000 tokens in GPT4!) so
           | the issues are smoothed over.
           | 
           | LSTMs end up being very finnicky to train and their
           | sequential nature makes model scaling bottlenecked in
           | parallelism.
        
             | gliptic wrote:
             | RWKV training can be parallelized just like transformers.
             | It's not LSTM.
        
               | VHRanger wrote:
               | Right, I meant why transformers exploded over LSTM-based
               | LMs in 2018 was because of those training issues that
               | transformers solved.
               | 
               | I'm excited about the next iteration of models focusing
               | on inference. RWVK seems promising on this end. Also love
               | the citizen science aspect of RWVK coming out.
        
       | leumassuehtam wrote:
       | You can currently play around with the 7B[0] and 14B[1]
       | parameters models hosted in Hugging Face Spaces.
       | 
       | [0] https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
       | 
       | [1] https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
        
         | joshspankit wrote:
         | This gave me a good chuckle. Thought I would share. (From
         | ChatRWKV)
         | 
         | Prompt:
         | 
         | A website can be built in 10 simple steps
         | 
         | Output:
         | 
         | 1. Research
         | 
         | 2. Research
         | 
         | 3. Research
         | 
         | 4. Research
         | 
         | 5. Research
         | 
         | 6. Research
         | 
         | 7. Design
         | 
         | 8. Design
         | 
         | 9. Design
         | 
         | 10. Design
        
         | aww_dang wrote:
         | Also here:
         | 
         | https://bellard.org/ts_server/
        
           | tarruda wrote:
           | Somehow not surprised Bellard is also going to make an impact
           | in AI/LLMs
        
             | mdp2021 wrote:
             | > _Somehow not surprised Bellard is also going to make an
             | impact in AI[... ]_
             | 
             | He also wrote an engine: "LibNC: C Library for Tensor
             | Manipulation"
             | 
             | https://bellard.org/libnc/
             | 
             | > _[ ...] /LLMs_
             | 
             | That could be a wish come true.
             | 
             | Edit: in fact, already one of his marks can be seen:
             | <<Larger models work optimally on lower cost GPUs (e.g. RTX
             | 3090, RTX A6000) thanks to efficient quantization>>
        
             | synctext wrote:
             | > going to make an impact in AI/LLMs
             | 
             | This seems to be made for mass parallelism.
             | 
             | Question: Would this scale to 100 million of collaborating
             | 4G/5G smartphones? We have operational federated learning
             | code and decentralised learning code (trivial stochastic
             | gradient descent). Application: simple learn-to-Rank based
             | on ClickLogs like its 2005 [1]. Then you have a true
             | decentralised Google.
             | 
             | Would love to link RWKV to other pure decentralised tech.
             | In the past we have build the first self-compiling Android
             | app and first Android-to-Android P2P overlay network.
             | Without any helper peers for carrier-grade NAT puncturing.
             | My university systems lab lacks the size to keep up with
             | the recent pace of innovation.
             | 
             | [1] https://grouplens.org/beyond2005/full/pouwelse.pdf
        
           | ducktective wrote:
           | Man fears AGI
           | 
           | AGI fears Fabrice Bellard
        
             | muttled wrote:
             | You're not lying. When I saw the site I thought "isn't that
             | the dude who managed to emulate an entire PC in Javascript
             | and boot Windows in a browser? Yep that's him."
        
       | lucidrains wrote:
       | while this topic has visibility, there is also another relatively
       | unexplored research direction, fine-tuning pretrained
       | transformers into RNNs
       | 
       | https://arxiv.org/abs/2210.04243
       | 
       | https://aclanthology.org/2021.emnlp-main.830.pdf
        
         | [deleted]
        
       | ipsum2 wrote:
       | > though in practice, the model might have a hard time
       | generalizing to much longer context lengths than it saw during
       | training
       | 
       | What's the benefit of RWKV then? The results are practically
       | identical to transformers.
        
         | sebzim4500 wrote:
         | It is way cheaper to run inference on commodity hardware,
         | because you don't have to keep track of the activations on all
         | the previous tokens (only the 3 previous ones IIRC).
        
         | og_kalu wrote:
         | You can easily fine-tune for more context and indeed he does.
         | There are at least 8192 ctx versions out now (started with
         | 1024) and he plans to go all the way to at least 16k
        
         | vagabund wrote:
         | - from O(N2) to O(N) complexity during training wrt context
         | length
         | 
         | - only need the hidden state at position t to compute t+1, i.e.
         | run inference locally on edge devices
         | 
         | - fine-tunable to longer context lengths than seen in pre-
         | training
         | 
         | If it continues to scale as well as it has so far it's pretty
         | huge.
        
         | PartiallyTyped wrote:
         | > RWKV combines the best features of RNNs and transformers.
         | During training, we use the transformer type formulation of the
         | architecture, which allows massive parallelization (with a sort
         | of attention which scales linearly with the number of tokens).
         | For inference, we use an equivalent formulation which works
         | like an RNN with a state. This allows us to get the best of
         | both worlds.
         | 
         | > So we basically have a model which trains like a transformer,
         | except that long context length is not expensive. And during
         | inference, we need substantially less memory and can implicitly
         | handle "infinite" context length (though in practice, the model
         | might have a hard time generalizing to much longer context
         | lengths than it saw during training).
         | 
         | The tl:dr Linear Attention avoids the quadratic token scaling
         | and is equivalent to specific instances of RNNs.
        
       | choeger wrote:
       | Can someone ELI5 how linear attention works? Isn't that
       | impossible in principle?
        
         | sebzim4500 wrote:
         | It's not really attention, it's 'just' a exponential moving
         | average over time where different channels have different decay
         | rates. This is a simplification, in the actual architecture
         | there are also convolution layers.
        
       | numeri wrote:
       | I've seen this hyped a lot on Reddit, and its main contributing
       | base seems to be on a Discord channel, but have yet to see any
       | scholarly discussion behind it.
       | 
       | Is there a reason the creator hasn't written or tried to publish
       | a paper on it? I'd love to see an peer-reviewed discussion
       | detailing how it works, maybe studying how it behaves internally,
       | evaluating its performance more rigorously than posting a few
       | examples, maybe in comparison to Transformers with similar
       | parameter counts, or training/inference costs in FLOPs. Perhaps
       | that's not the creator's main priority, though.
        
         | adeon wrote:
         | I think I've seen this same sentiment on many of the Reddit
         | threads on RWKV. I think the author prefers to spend their time
         | on coding it up and explaining it on the GitHub rather than
         | taking the time to write a paper on it.
         | 
         | I kinda respect that, can't force someone to explain their ways
         | if they prefer hacking to writing. I would also love to see
         | more rigorous measurements and discussion though.
         | 
         | The creator replied: "Thank you :) Too busy for that at this
         | moment, but I will get a paper out later this year." once on a
         | reddit post asking exactly this.
         | 
         | https://old.reddit.com/r/MachineLearning/comments/1135aew/r_...
        
       | codewithcheese wrote:
       | If the context size is unbounded, how does the time complexity
       | scale with the size of the context, and what are the limiting
       | factors that affect performance as the context size grows larger?
        
         | sebzim4500 wrote:
         | The time and space complexity of inference is constant wrt.
         | context size. You will probably need more parameters to match
         | the performance of a transformer though, so whether it scales
         | better in practice is an open question.
        
       | freeqaz wrote:
       | When I've played around with this on my box, I haven't been able
       | to get the output to be very decent. Maybe I'm just used to
       | ChatGPT though.
       | 
       | Does anybody have some examples of output that it generates to
       | explain what this is capable of?
        
         | imranq wrote:
         | Check out the web demo for the 14B parameter model:
         | https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
         | 
         | I tried a few queries. It works okay for basic prompts for
         | coding questions but can give pretty good results with more
         | context / prompt engineering
        
         | vagabund wrote:
         | If you're expecting chatGPT-esque outputs I"d suggest this 7B
         | demo that's fine-tuned on the Alpaca dataset.
         | 
         | https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B
        
       | amrb wrote:
       | Happy to see this getting attention, lots of great open models
       | are being worked on and I can't wait to see something people at
       | home with a 3090 could use.
        
         | tarruda wrote:
         | You should be able to run Llama 30B q4 on a RTX 3090. Check the
         | table here: https://bellard.org/ts_server/
         | 
         | 30B/q4 requires 20GB of RAM while 3090 has 24GB.
        
           | mromanuk wrote:
           | which one is the next incarnation of
           | performance/price/goodness after the RTX 3090 for running a
           | mini AI homelab?
        
             | espadrine wrote:
             | To increase performance, you need bigger foundation models,
             | like LLaMA 65B. You will always be RAM-bound for that
             | (~40GB for LLaMA 65B INT4). So the next step is to upgrade
             | your motherboard to have multiple GPU slots. If you really
             | wanted to stick to a single-GPU setup, you could upgrade to
             | RTX A6000, but it is more expensive than two RTX 3090 while
             | holding the same amount of RAM.
        
               | yieldcrv wrote:
               | Its a dual slot gpu
               | 
               | What benefit does that offer over 2 GPUs?
        
               | espadrine wrote:
               | The two GPUs will need to exchange information in order
               | to complete inference, by having one GPU hold half of the
               | network weights, and the other holding the other half.
               | That transmission of information will be limited by the
               | PCIe bandwidth; for instance, 30 GB/s with v4x16.
               | 
               | Meanwhile the A6000 VRAM bandwidth is 768 GB/s (= 16 Gb/s
               | (GDDR6) x 384 bit-width / 8 bits per byte).
        
               | coolspot wrote:
               | Two 3090 can be connected using NVLink bridge which is
               | much faster than PCI-E.
        
               | espadrine wrote:
               | I couldn't find precise information on what bandwidth
               | you'd get with NVLink on the 3090. To be fair, though, if
               | all we do is inference, using Huggingface pipeline
               | parallelism, the amount of data transferred is pretty
               | small: 8192x2xn_tokens bytes; for most uses, with a
               | recent PCIe setup, that bottleneck will take less than
               | 0.1 ms per token generated, which may not be the dominant
               | latency.
               | 
               | (Also, the RTX 3090 has faster VRAM, >900 GB/s, than the
               | A6000, because it is GDDR6X.)
        
             | azeirah wrote:
             | Not entirely certain about this, but I believe that because
             | Apple M-series GPUs use system memory you could possible
             | run the larger models on accessible (albeit expensive)
             | consumer hardware.
             | 
             | Need 64GB for 30B
             | 
             | And 128GB for the 65B
             | 
             | I'm not sure about the performance, but I think it should
             | be ok? Especially given how much Apple has been investing
             | in what -- I believe they call -- neural cores?
             | 
             | Here's some more context:
             | https://news.ycombinator.com/item?id=35105364
        
           | michannne wrote:
           | Seconding this.
           | 
           | Llama 30B 4-bit has amazing performance, comparable to GPT-3
           | quality for my search and novel generating use-cases, and
           | fits on a single 3090. In tandem with 3rd party applications
           | such as Llama Index and the Alpaca LoRa, GPT-3 (and
           | potentially GPT-4) has already been democratized in my eyes.
        
       ___________________________________________________________________
       (page generated 2023-03-30 23:01 UTC)