[HN Gopher] RWKV Language Model
       ___________________________________________________________________
        
       RWKV Language Model
        
       Author : simonpure
       Score  : 156 points
       Date   : 2024-12-30 22:19 UTC (3 days ago)
        
 (HTM) web link (www.rwkv.com)
 (TXT) w3m dump (www.rwkv.com)
        
       | nullc wrote:
       | Anyone ever look at doing a MoE like composition with RWKV and a
       | transformer?
        
         | pico_creator wrote:
         | Not an MoE, but we have already done hybrid models. And found
         | it to be highly performant (as per the training budget)
         | 
         | https://arxiv.org/abs/2407.12077
        
       | intalentive wrote:
       | Idea for killer app for recurrent models: low latency, low memory
       | LLM / TTS coupling. Start decoding / generating speech as soon as
       | new tokens are generated. When the LLM is cranking out token t,
       | the TTS is already working on token t-1. It doesn't have to wait.
       | Then when the LLM is finished, the TTS is nearly finished too.
       | The two models being colocated you just saved another network
       | call as well.
       | 
       | Recurrent models with constant hidden state are naturally suited
       | to streaming data, potentially opening the door to unexplored new
       | use cases.
        
         | computerex wrote:
         | New multimodal models take raw speech input and provide raw
         | speech output, no tts in the middle.
        
           | benob wrote:
           | A relatively detailed description of such systems:
           | https://arxiv.org/abs/2410.00037
        
           | Closi wrote:
           | Seems like the future - so much meaning and context is lost
           | otherwise.
        
           | intalentive wrote:
           | Very cool. Logical next step. Would be interested to know
           | what the dataset looks like.
        
             | moffkalast wrote:
             | Youtube. Youtube is the dataset.
        
         | cootsnuck wrote:
         | This can currently already be done using a streaming capable
         | LLM with a streaming input/output TTS model.
        
           | lostmsu wrote:
           | Any LLM is "streaming capable".
        
             | Xmd5a wrote:
             | https://github.com/mit-han-lab/streaming-llm
             | 
             | On a side node, and that's what led me to the link above, I
             | wonder if it would be possible to chain N streaming LLMs in
             | an agent workflow and get a final output stream almost
             | instantaneously without waiting for N-1 LLM to complete
             | their reply.
        
         | pico_creator wrote:
         | This is actually the hypothesis for cartesia (state space
         | team), and hence their deep focus on voice model specifically.
         | Taking full advantage of recurrent models constant time
         | compute, for low latencies.
         | 
         | RWKV team's focus is still however is first in the multi-
         | lingual text space, then multi-modal space in the future.
        
           | swyx wrote:
           | Karan from Cartesia explains SSMs+voice really well:
           | https://www.youtube.com/watch?v=U9DPRZ0lSIQ
           | 
           | its one of those retrospectively obvious/genius insights that
           | i wish i understood when i first met him
        
         | yshui wrote:
         | Any autoregressive model can do what you are describing.
         | transformers are generating one token at a time too, not all at
         | once.
        
           | whimsicalism wrote:
           | yes but transformers are much slower than state space models
        
           | intalentive wrote:
           | True but memory requirements grow with sequence length. For
           | recurrent models the memory requirement is constant. This is
           | why I qualified with "low memory".
        
       | sushidev wrote:
       | Interesting. Very cryptic for simple user like me. I wonder if
       | it's useful today and for what purposes
        
         | pico_creator wrote:
         | Currently the strongest RWKV model is 32B in size:
         | https://substack.recursal.ai/p/q-rwkv-6-32b-instruct-preview
         | 
         | This is a full drop in replacement for any transformer model
         | use cases on model sizes 32B and under, as it has equal
         | performance to existing open 32B models in most benchmarks
         | 
         | We are in works on a 70B, which will be a full drop in
         | replacement for most text use cases
        
           | swyx wrote:
           | how about finetuning your 32B to be R1QWQKV?
        
             | pico_creator wrote:
             | There is a current lack of "O1 style" reasoning dataset in
             | open source space. QWQ did not release their dataset. So
             | that would take some time for the community to prepare.
             | 
             | It's definitely something we are tracking to do as well =)
        
           | lostmsu wrote:
           | Why aren't you on lmarena (former chatbot arena) leaderboard?
        
             | pico_creator wrote:
             | kinda on a todo list, the model is open source on HF for
             | anyone who is willing to make it work with lmarena
        
       | swyx wrote:
       | for those who want a more conversational intro, we've been
       | covering RWKV for a bit!
       | 
       | 2023: https://latent.space/p/rwkv
       | 
       | 2024: https://www.youtube.com/watch?v=LPe6iC73lrc <- offers a bit
       | of professional compare and contrast vs state space models.
       | 
       | i think its cool that now both RNN and LSTM (with xLSTM) now have
       | modern attention-inspired variants that solve the previous
       | issues. I wonder if 1) its possible to overcome the "hardware
       | lottery" that transformers have now won, and 2) if
       | recurrent/selective state can do the kind of proper lookback on
       | extremely long context that we will want it to do to compete with
       | full attention (easy to say no, harder to propose what to do
       | about it).
       | 
       | there's also Liquid AI, whatever it is that they do.
        
         | inciampati wrote:
         | the recurrent model needs a mechanism to replay past context.
         | no need to go quadratic to access all of it. they could replay
         | multiple times to get effects similar to attention.
         | 
         | the hardware lottery, well... imo it's really about leveraging
         | fully parallel training to learn how to use a memory. attention
         | is quadratic but it can be computed in parallel. it's an end to
         | end learned memory. getting that kind of pattern into RNNs
         | won't be easy but it's going to be crucial before we boil the
         | ocean.
        
           | pico_creator wrote:
           | RWKV already solve the parallel compute problem for GPU,
           | based on the changes it has done - so it is a recurrent model
           | that can scale to thousands++ of GPU no issue.
           | 
           | Arguably with other recurrent architecture (State Space, etc)
           | with very different design implementation. The issue of old
           | recurrent design was just the way LSTM was designed. Not the
           | recurrent nature.
        
         | shawntan wrote:
         | Although marketed as such, RWKV isn't really an RNN.
         | 
         | In the recent RWKV7 incarnation, you could argue it's a type of
         | Linear RNN, but past versions had an issue of taking its
         | previous state from a lower layer, allowing for parallelism,
         | but makes it closer to a convolution than a recurrent
         | computation.
         | 
         | As for 1), I'd like to believe so, but it's hard to get people
         | away from the addictive drug that is the easily parallelised
         | transformer, 2) (actual) RNNs and attention mechanisms to me
         | seem fairly powerful (expressivity wise) and perhaps most
         | acceptable by the community.
        
           | bravura wrote:
           | Recent work by Feng et al from Bengio's lab focus on how
           | attention can be formulated as an RNN ("Attention as RNN":
           | https://arxiv.org/pdf/2405.13956) and how minimal versions of
           | GRUs and LSTMs can be trained in parallel by removing some
           | parameters ("Were RNNs All We Needed?"
           | https://arxiv.org/pdf/2410.01201).
           | 
           | It's possible we start seeing more blended version of
           | RNN/attention architecture exploring different LLM
           | properties.
           | 
           | In particular, Aaren architecture in the former paper "can
           | not only (i) be trained in parallel (like Transformers) but
           | also (ii) be updated efficiently with new tokens, requiring
           | only constant memory for inferences (like traditional RNNs)."
        
             | shawntan wrote:
             | The formulations in attention as rnn have similar issues as
             | rwkv. Fundamentally it's a question of what we call an RNN.
             | 
             | Personally I think it's important not to call some of these
             | recent architectures RNNs because they have theoretical
             | properties that do not match (read: they're worse) what
             | we've "classically" called RNNs.
             | 
             | Ref: https://arxiv.org/abs/2404.08819
             | 
             | As a rule of thumb: you generally don't get parallelism for
             | free, you pay for it with poorer expressivity.
        
         | HarHarVeryFunny wrote:
         | The Transformer was specifically conceived to take advantage of
         | pre-existing massively parallel hardware, so it's a bit
         | backwards to say it "won the hardware lottery". Where the
         | Transformer did "win the lottery" is that the key-value form of
         | self-attention (invented by Noam Shazeer) needed to make
         | parallel processing work seems to have accidentally unlocked
         | capabilities like "induction heads" that make this type of
         | architecture extremely well suited to language prediction.
         | 
         | Given limits on clock speed, massive parallelism is always
         | going to be the way to approach brain-like levels of parallel
         | computation, so any model architecture aspiring to human level
         | AGI needs to be able to take advantage of that.
        
           | swyx wrote:
           | you are correct of course but i meant hardware lottery in the
           | sense of dedicated silicon companies like Etched and MatX
           | that have now emerged to make chips that only run
           | transformers (not exactly true for matx but hey i am
           | simplifying. would be cool if matx ran other arch's but its
           | not a priority)
        
       | pico_creator wrote:
       | Hey there, im Eugene / PicoCreator - co-leading the RWKV project
       | - feel free to AMA =)
        
         | low_tech_punk wrote:
         | Thanks! The 0.1B version looks perfect for embedded system.
         | What is the key benefit of attention-free architecture?
        
           | pico_creator wrote:
           | lower compute cost especially over longer sequence length.
           | Depending on context length, its 10x, 100x, or even 1000x+
           | cheaper. (quadratic vs linear cost difference)
        
         | Ey7NFZ3P0nzAe wrote:
         | Has there been progress towards making RWKV multimodal? Can be
         | use projector layers to send images to RWKV?
        
           | pico_creator wrote:
           | There is work done for Vision RWKV, and audio RWKV, an
           | example paper is here: https://arxiv.org/abs/2403.02308
           | 
           | Its the same principle as open transformer models where an
           | adapter is used to generate the embedding
           | 
           | However currently the core team focus is in scaling the core
           | text model, as this would be the key performance driver,
           | before adapting multi-modal.
           | 
           | The tech is there, the base model needs to be better
        
         | Ey7NFZ3P0nzAe wrote:
         | I'm quite interested in repeng [0] (representztion engineering)
         | for steerability of (so fzr transformer based) LLMs and was
         | wondering if anyone had tried such methods on rwkv (or mamba
         | for that matter). Maybe there are some low hanging fruits about
         | it.
         | 
         | [0] https://github.com/vgel/repeng/issues
        
           | pico_creator wrote:
           | One of the interesting "new direction" for RWKV and Mamba (or
           | any recurrent model), is the monitoring and manipulation of
           | the state in between token. For steerability, alignment, etc
           | =)
           | 
           | Not saying its a good or bad idea, but pointing out that
           | having a fixed state in between has interesting applications
           | in this space
        
         | theLiminator wrote:
         | Do you have an in depth comparison between RWKV and models like
         | mamba or s4?
        
           | pico_creator wrote:
           | Not sure how indepth you want it to be. But we did do a co-
           | presentation with one of the coauthors of mamba at latent
           | space : https://www.youtube.com/watch?v=LPe6iC73lrc
        
         | jharohit wrote:
         | congrats and great work on RWKV and Recursal.ai
        
         | Ey7NFZ3P0nzAe wrote:
         | I noticed the lack of support from ollama and llama.cpp for
         | RWKV. As those are (to my eyes) very strong drivers of
         | experimentation (i.e. supporting them means vastly more
         | outreach) I was considering whether you were considering taking
         | this into your own hands by contributing code to them? Or
         | rather is the fact that you are not (AFAIK) doing it because
         | you lack the bandwidth in terms of man power or any other
         | reason?
        
         | littlestymaar wrote:
         | Has there been any plans to build a "reasoning" llm using RWKV?
         | With the increase in inference token count caused by such
         | methods, the muhc lower footprint of recurrent architecture
         | could really make a difference for such a use-case.
        
         | bratao wrote:
         | What would be the most performant way to run a inference using
         | RWKV? Do you have and speed comparison to a similar sized
         | transformer?
         | 
         | I have a task(OCR cleaning) that I'm evaluating faster options
         | and look like RWKV would be a nice alternative.
        
         | nickpsecurity wrote:
         | It's really, interesting work. I'm glad you've kept at it. I'd
         | like to ask you about two issues.
         | 
         | I keep seeing papers like "Repeat After Me" claiming serious
         | weaknesses of state space vs transformer models. What are the
         | current weaknesses of RWKV vs transformers? Have you mitigated
         | them? If so, how?
         | 
         | The other issue is that file sharing being illegal, Wikipedia
         | requiring derivatives to be copyleft, etc means I can't train
         | models with most data legally. Pre-1920's works in Project
         | Gutenberg are totally public domain. Both the model and the
         | training data would be 100% legal for reproducible research.
         | Would your team be willing to train a 3B-7B model on only
         | Gutenberg and release it to the public domain?
         | 
         | (Note: The Stack without GitHub Issues can be used for
         | permissive code. However, there could be contamination issues
         | like incorrect licenses, PII, etc. So, maybe at least one, 100%
         | legal model. Maybe a second with Gutenberg and The Stack for
         | coding research.)
         | 
         | Example use of Gutenberg:
         | 
         | https://www.tensorflow.org/datasets/catalog/pg19
        
           | anon373839 wrote:
           | > The other issue is that file sharing being illegal,
           | Wikipedia requiring derivatives to be copyleft, etc means I
           | can't train models with most data legally.
           | 
           | That really depends on whether LLM pretraining ends up held
           | as an infringing use. (Of course, it'll take a while for the
           | cases to work through the courts and for a body of
           | jurisprudence to be developed on this subject.)
        
             | nickpsecurity wrote:
             | There's two legal issues: sharing copyrighted data;
             | training on it. It's the latter that's ambiguous. My
             | problem is the former.
             | 
             | Making copies of and sharing copyrighted works without the
             | authors' permission is already illegal as proven in
             | countless, file-sharing cases. The AI trainers do this with
             | data sets like Common Crawl, The Pile, and RefinedWeb. Just
             | sharing them is illegal for most of the content in them.
             | 
             | I got ideas for how to deal with that in countries with TDM
             | exceptions, like Singapore. For now, the only things we can
             | share with others for model training are (a) public domain
             | works and (b) content licensed for permissive use and
             | sharing. Gutenberg entries before a certain year should be
             | pretty risk-free.
        
       | smusamashah wrote:
       | How does it compare with other LLMs in terms of performance?
       | Is.it near GPT 3 or Llama or what?
        
       | upghost wrote:
       | Seems really cool. Does anyone have any sample code to link to?
       | Do RNN models use the same pytorch/hugging face Python stuff or
       | is it completely different...?
        
       | ianand wrote:
       | Reminder that Microsoft ships RWKV with Windows (~1.5 billion
       | devices), making it probably the most widely deployed non
       | transformer model out there. Amazing work!
       | https://blog.rwkv.com/p/rwkvcpp-shipping-to-half-a-billion
       | 
       | ps Eugene you should brag about that on the homepage of RWKV.
        
       | Fischgericht wrote:
       | "RWKV (pronounced RwaKuv)" - love it. How does the corw make?
       | Rwa! Rwa! Rwa!
        
         | bbor wrote:
         | Thank god I'm not the only one stunned by that. I don't need
         | IPA, but this isn't even vaguely pronounceable!
        
       ___________________________________________________________________
       (page generated 2025-01-02 23:01 UTC)