[HN Gopher] RWKV Language Model
___________________________________________________________________
RWKV Language Model
Author : simonpure
Score : 156 points
Date : 2024-12-30 22:19 UTC (3 days ago)
(HTM) web link (www.rwkv.com)
(TXT) w3m dump (www.rwkv.com)
| nullc wrote:
| Anyone ever look at doing a MoE like composition with RWKV and a
| transformer?
| pico_creator wrote:
| Not an MoE, but we have already done hybrid models. And found
| it to be highly performant (as per the training budget)
|
| https://arxiv.org/abs/2407.12077
| intalentive wrote:
| Idea for killer app for recurrent models: low latency, low memory
| LLM / TTS coupling. Start decoding / generating speech as soon as
| new tokens are generated. When the LLM is cranking out token t,
| the TTS is already working on token t-1. It doesn't have to wait.
| Then when the LLM is finished, the TTS is nearly finished too.
| The two models being colocated you just saved another network
| call as well.
|
| Recurrent models with constant hidden state are naturally suited
| to streaming data, potentially opening the door to unexplored new
| use cases.
| computerex wrote:
| New multimodal models take raw speech input and provide raw
| speech output, no tts in the middle.
| benob wrote:
| A relatively detailed description of such systems:
| https://arxiv.org/abs/2410.00037
| Closi wrote:
| Seems like the future - so much meaning and context is lost
| otherwise.
| intalentive wrote:
| Very cool. Logical next step. Would be interested to know
| what the dataset looks like.
| moffkalast wrote:
| Youtube. Youtube is the dataset.
| cootsnuck wrote:
| This can currently already be done using a streaming capable
| LLM with a streaming input/output TTS model.
| lostmsu wrote:
| Any LLM is "streaming capable".
| Xmd5a wrote:
| https://github.com/mit-han-lab/streaming-llm
|
| On a side node, and that's what led me to the link above, I
| wonder if it would be possible to chain N streaming LLMs in
| an agent workflow and get a final output stream almost
| instantaneously without waiting for N-1 LLM to complete
| their reply.
| pico_creator wrote:
| This is actually the hypothesis for cartesia (state space
| team), and hence their deep focus on voice model specifically.
| Taking full advantage of recurrent models constant time
| compute, for low latencies.
|
| RWKV team's focus is still however is first in the multi-
| lingual text space, then multi-modal space in the future.
| swyx wrote:
| Karan from Cartesia explains SSMs+voice really well:
| https://www.youtube.com/watch?v=U9DPRZ0lSIQ
|
| its one of those retrospectively obvious/genius insights that
| i wish i understood when i first met him
| yshui wrote:
| Any autoregressive model can do what you are describing.
| transformers are generating one token at a time too, not all at
| once.
| whimsicalism wrote:
| yes but transformers are much slower than state space models
| intalentive wrote:
| True but memory requirements grow with sequence length. For
| recurrent models the memory requirement is constant. This is
| why I qualified with "low memory".
| sushidev wrote:
| Interesting. Very cryptic for simple user like me. I wonder if
| it's useful today and for what purposes
| pico_creator wrote:
| Currently the strongest RWKV model is 32B in size:
| https://substack.recursal.ai/p/q-rwkv-6-32b-instruct-preview
|
| This is a full drop in replacement for any transformer model
| use cases on model sizes 32B and under, as it has equal
| performance to existing open 32B models in most benchmarks
|
| We are in works on a 70B, which will be a full drop in
| replacement for most text use cases
| swyx wrote:
| how about finetuning your 32B to be R1QWQKV?
| pico_creator wrote:
| There is a current lack of "O1 style" reasoning dataset in
| open source space. QWQ did not release their dataset. So
| that would take some time for the community to prepare.
|
| It's definitely something we are tracking to do as well =)
| lostmsu wrote:
| Why aren't you on lmarena (former chatbot arena) leaderboard?
| pico_creator wrote:
| kinda on a todo list, the model is open source on HF for
| anyone who is willing to make it work with lmarena
| swyx wrote:
| for those who want a more conversational intro, we've been
| covering RWKV for a bit!
|
| 2023: https://latent.space/p/rwkv
|
| 2024: https://www.youtube.com/watch?v=LPe6iC73lrc <- offers a bit
| of professional compare and contrast vs state space models.
|
| i think its cool that now both RNN and LSTM (with xLSTM) now have
| modern attention-inspired variants that solve the previous
| issues. I wonder if 1) its possible to overcome the "hardware
| lottery" that transformers have now won, and 2) if
| recurrent/selective state can do the kind of proper lookback on
| extremely long context that we will want it to do to compete with
| full attention (easy to say no, harder to propose what to do
| about it).
|
| there's also Liquid AI, whatever it is that they do.
| inciampati wrote:
| the recurrent model needs a mechanism to replay past context.
| no need to go quadratic to access all of it. they could replay
| multiple times to get effects similar to attention.
|
| the hardware lottery, well... imo it's really about leveraging
| fully parallel training to learn how to use a memory. attention
| is quadratic but it can be computed in parallel. it's an end to
| end learned memory. getting that kind of pattern into RNNs
| won't be easy but it's going to be crucial before we boil the
| ocean.
| pico_creator wrote:
| RWKV already solve the parallel compute problem for GPU,
| based on the changes it has done - so it is a recurrent model
| that can scale to thousands++ of GPU no issue.
|
| Arguably with other recurrent architecture (State Space, etc)
| with very different design implementation. The issue of old
| recurrent design was just the way LSTM was designed. Not the
| recurrent nature.
| shawntan wrote:
| Although marketed as such, RWKV isn't really an RNN.
|
| In the recent RWKV7 incarnation, you could argue it's a type of
| Linear RNN, but past versions had an issue of taking its
| previous state from a lower layer, allowing for parallelism,
| but makes it closer to a convolution than a recurrent
| computation.
|
| As for 1), I'd like to believe so, but it's hard to get people
| away from the addictive drug that is the easily parallelised
| transformer, 2) (actual) RNNs and attention mechanisms to me
| seem fairly powerful (expressivity wise) and perhaps most
| acceptable by the community.
| bravura wrote:
| Recent work by Feng et al from Bengio's lab focus on how
| attention can be formulated as an RNN ("Attention as RNN":
| https://arxiv.org/pdf/2405.13956) and how minimal versions of
| GRUs and LSTMs can be trained in parallel by removing some
| parameters ("Were RNNs All We Needed?"
| https://arxiv.org/pdf/2410.01201).
|
| It's possible we start seeing more blended version of
| RNN/attention architecture exploring different LLM
| properties.
|
| In particular, Aaren architecture in the former paper "can
| not only (i) be trained in parallel (like Transformers) but
| also (ii) be updated efficiently with new tokens, requiring
| only constant memory for inferences (like traditional RNNs)."
| shawntan wrote:
| The formulations in attention as rnn have similar issues as
| rwkv. Fundamentally it's a question of what we call an RNN.
|
| Personally I think it's important not to call some of these
| recent architectures RNNs because they have theoretical
| properties that do not match (read: they're worse) what
| we've "classically" called RNNs.
|
| Ref: https://arxiv.org/abs/2404.08819
|
| As a rule of thumb: you generally don't get parallelism for
| free, you pay for it with poorer expressivity.
| HarHarVeryFunny wrote:
| The Transformer was specifically conceived to take advantage of
| pre-existing massively parallel hardware, so it's a bit
| backwards to say it "won the hardware lottery". Where the
| Transformer did "win the lottery" is that the key-value form of
| self-attention (invented by Noam Shazeer) needed to make
| parallel processing work seems to have accidentally unlocked
| capabilities like "induction heads" that make this type of
| architecture extremely well suited to language prediction.
|
| Given limits on clock speed, massive parallelism is always
| going to be the way to approach brain-like levels of parallel
| computation, so any model architecture aspiring to human level
| AGI needs to be able to take advantage of that.
| swyx wrote:
| you are correct of course but i meant hardware lottery in the
| sense of dedicated silicon companies like Etched and MatX
| that have now emerged to make chips that only run
| transformers (not exactly true for matx but hey i am
| simplifying. would be cool if matx ran other arch's but its
| not a priority)
| pico_creator wrote:
| Hey there, im Eugene / PicoCreator - co-leading the RWKV project
| - feel free to AMA =)
| low_tech_punk wrote:
| Thanks! The 0.1B version looks perfect for embedded system.
| What is the key benefit of attention-free architecture?
| pico_creator wrote:
| lower compute cost especially over longer sequence length.
| Depending on context length, its 10x, 100x, or even 1000x+
| cheaper. (quadratic vs linear cost difference)
| Ey7NFZ3P0nzAe wrote:
| Has there been progress towards making RWKV multimodal? Can be
| use projector layers to send images to RWKV?
| pico_creator wrote:
| There is work done for Vision RWKV, and audio RWKV, an
| example paper is here: https://arxiv.org/abs/2403.02308
|
| Its the same principle as open transformer models where an
| adapter is used to generate the embedding
|
| However currently the core team focus is in scaling the core
| text model, as this would be the key performance driver,
| before adapting multi-modal.
|
| The tech is there, the base model needs to be better
| Ey7NFZ3P0nzAe wrote:
| I'm quite interested in repeng [0] (representztion engineering)
| for steerability of (so fzr transformer based) LLMs and was
| wondering if anyone had tried such methods on rwkv (or mamba
| for that matter). Maybe there are some low hanging fruits about
| it.
|
| [0] https://github.com/vgel/repeng/issues
| pico_creator wrote:
| One of the interesting "new direction" for RWKV and Mamba (or
| any recurrent model), is the monitoring and manipulation of
| the state in between token. For steerability, alignment, etc
| =)
|
| Not saying its a good or bad idea, but pointing out that
| having a fixed state in between has interesting applications
| in this space
| theLiminator wrote:
| Do you have an in depth comparison between RWKV and models like
| mamba or s4?
| pico_creator wrote:
| Not sure how indepth you want it to be. But we did do a co-
| presentation with one of the coauthors of mamba at latent
| space : https://www.youtube.com/watch?v=LPe6iC73lrc
| jharohit wrote:
| congrats and great work on RWKV and Recursal.ai
| Ey7NFZ3P0nzAe wrote:
| I noticed the lack of support from ollama and llama.cpp for
| RWKV. As those are (to my eyes) very strong drivers of
| experimentation (i.e. supporting them means vastly more
| outreach) I was considering whether you were considering taking
| this into your own hands by contributing code to them? Or
| rather is the fact that you are not (AFAIK) doing it because
| you lack the bandwidth in terms of man power or any other
| reason?
| littlestymaar wrote:
| Has there been any plans to build a "reasoning" llm using RWKV?
| With the increase in inference token count caused by such
| methods, the muhc lower footprint of recurrent architecture
| could really make a difference for such a use-case.
| bratao wrote:
| What would be the most performant way to run a inference using
| RWKV? Do you have and speed comparison to a similar sized
| transformer?
|
| I have a task(OCR cleaning) that I'm evaluating faster options
| and look like RWKV would be a nice alternative.
| nickpsecurity wrote:
| It's really, interesting work. I'm glad you've kept at it. I'd
| like to ask you about two issues.
|
| I keep seeing papers like "Repeat After Me" claiming serious
| weaknesses of state space vs transformer models. What are the
| current weaknesses of RWKV vs transformers? Have you mitigated
| them? If so, how?
|
| The other issue is that file sharing being illegal, Wikipedia
| requiring derivatives to be copyleft, etc means I can't train
| models with most data legally. Pre-1920's works in Project
| Gutenberg are totally public domain. Both the model and the
| training data would be 100% legal for reproducible research.
| Would your team be willing to train a 3B-7B model on only
| Gutenberg and release it to the public domain?
|
| (Note: The Stack without GitHub Issues can be used for
| permissive code. However, there could be contamination issues
| like incorrect licenses, PII, etc. So, maybe at least one, 100%
| legal model. Maybe a second with Gutenberg and The Stack for
| coding research.)
|
| Example use of Gutenberg:
|
| https://www.tensorflow.org/datasets/catalog/pg19
| anon373839 wrote:
| > The other issue is that file sharing being illegal,
| Wikipedia requiring derivatives to be copyleft, etc means I
| can't train models with most data legally.
|
| That really depends on whether LLM pretraining ends up held
| as an infringing use. (Of course, it'll take a while for the
| cases to work through the courts and for a body of
| jurisprudence to be developed on this subject.)
| nickpsecurity wrote:
| There's two legal issues: sharing copyrighted data;
| training on it. It's the latter that's ambiguous. My
| problem is the former.
|
| Making copies of and sharing copyrighted works without the
| authors' permission is already illegal as proven in
| countless, file-sharing cases. The AI trainers do this with
| data sets like Common Crawl, The Pile, and RefinedWeb. Just
| sharing them is illegal for most of the content in them.
|
| I got ideas for how to deal with that in countries with TDM
| exceptions, like Singapore. For now, the only things we can
| share with others for model training are (a) public domain
| works and (b) content licensed for permissive use and
| sharing. Gutenberg entries before a certain year should be
| pretty risk-free.
| smusamashah wrote:
| How does it compare with other LLMs in terms of performance?
| Is.it near GPT 3 or Llama or what?
| upghost wrote:
| Seems really cool. Does anyone have any sample code to link to?
| Do RNN models use the same pytorch/hugging face Python stuff or
| is it completely different...?
| ianand wrote:
| Reminder that Microsoft ships RWKV with Windows (~1.5 billion
| devices), making it probably the most widely deployed non
| transformer model out there. Amazing work!
| https://blog.rwkv.com/p/rwkvcpp-shipping-to-half-a-billion
|
| ps Eugene you should brag about that on the homepage of RWKV.
| Fischgericht wrote:
| "RWKV (pronounced RwaKuv)" - love it. How does the corw make?
| Rwa! Rwa! Rwa!
| bbor wrote:
| Thank god I'm not the only one stunned by that. I don't need
| IPA, but this isn't even vaguely pronounceable!
___________________________________________________________________
(page generated 2025-01-02 23:01 UTC)