[HN Gopher] LWM - Open LLM with 1M Tokens Context Window
___________________________________________________________________
LWM - Open LLM with 1M Tokens Context Window
Author : amrrs
Score : 136 points
Date : 2024-02-16 15:54 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Havoc wrote:
| That's exciting!
|
| Any views on the license? Github says Apache 2 for weights...but
| hugging face says llama license
| machinekob wrote:
| If you don't have GPU/TPU farm this isn't as exciting as it
| seems to need around 2TiB with 8bit quantization.
| azinman2 wrote:
| We saw this before. Anyone know what the VRAM requirements are,
| if it's been quantized, and if/when llama.cpp might support?
| visarga wrote:
| Probably never gonna run on normal boxes, it uses RingAttention
| which only makes sense when you have many GPUs. It's a
| datacenter-only model.
| azinman2 wrote:
| I'd still like to know the requirements. You can rent A100's
| or have access thru work, etc.
| ramoz wrote:
| > Scaling Inference We additionally scale our inference
| code to support million-length sequences by implementing
| RingAttention for decoding. Inference for such long
| sequences requires a minimum of v4-128 with a TPU mesh
| sharding of 32 tensor parallelism, and 4 sequence
| parallelism (ring dimension). We perform inference in pure
| single precision, where additional improvements can be made
| through techniques in scalability such as quantization.
| Filligree wrote:
| It might even fit in a max-sized macbook pro.
| machinekob wrote:
| Maybe in few years with 4bit quantisation. (around 1TiB
| is needed for that)
| 5kg wrote:
| From https://arxiv.org/pdf/2402.08268.pdf
|
| > We trained our models using TPUv4-1024, which is
| approximately equivalent to 450 A100s
|
| > Inference for such long sequences requires a minimum of
| v4-128
|
| So you'll need ~60 A100 for inference.
| K0balt wrote:
| There are models with shorter attention sizes that probably
| are much smaller in vram needs. The model itself is only
| about 32g, so in 5K-M quant it wouldn't be that bad, and
| the smaller attention versions might actually be workable
| in higher end workstations with cleverness like llama.cpp.
| Back of the napkin guess tells me maybe around 60g for the
| 32k context version
| woadwarrior01 wrote:
| From the paper[1]:
|
| > We additionally scale our inference code to support million-
| length sequences by implementing RingAttention for decoding.
| Inference for such long sequences requires a minimum of v4-128
| with a TPU mesh sharding of 32 tensor parallelism, and 4
| sequence parallelism (ring dimension).
|
| Each TPU-v4 has 32 GiB of HBM memory, so about 4 TiB (128 x
| 32GiB) of memory in fp16, without quantization.
|
| [1]: https://arxiv.org/abs/2402.08268
| Mathnerd314 wrote:
| The model itself is only 13.6GB though:
| https://huggingface.co/LargeWorldModel/LWM-
| Chat-1M-Jax/tree/...
| woadwarrior01 wrote:
| Indeed. Almost all of the inference memory overhead comes
| from attention matrices and the KV cache.
| moffkalast wrote:
| So uh, they thought "we have this datacenter full of GPUs
| laying around, you know what would be best, if we made a
| model that needs the entire thing just for inference".
|
| Brute forcing a quadratic complexity problem seems like
| wastefulness at its worst.
| 3abiton wrote:
| Sounds like they're loading the internet into those vrams.
| faizshah wrote:
| So what approach do these large context models take with caching
| the context, or do they re-compute each time?
| bluelightning2k wrote:
| Incredible achievement. Awkward timing with Gemini announcing the
| same.
| lxe wrote:
| Why Jax over Pytorch for Video/Text model?
| benpacker wrote:
| easier to scale to a distributed TPU cluster I think?
| lacoolj wrote:
| surprised this beat a GPT-4 equivalent when Google already
| announced. Maybe they were too busy with Sora to care about 1
| million ctx right now
| sebzim4500 wrote:
| Presumably they had Sora almost ready to go, and the LLM team
| didn't have anything that they could publish with a few hours
| notice.
| thefourthchime wrote:
| Was it yesterday Gemini Pro 1.5 announced the 1M token size?
| Jesus stuff is moving fast.
| derac wrote:
| This paper was released Feb 13th.
| guluarte wrote:
| 10M end of week 1 billon end of month
| m3kw9 wrote:
| How is the needle in haystack test on this thing?
| derac wrote:
| https://github.com/LargeWorldModel/LWM#lwm-capabilities
| canadiantim wrote:
| wow, dam that's crazy impressive
| blainm wrote:
| I would be curious to know if anyone has tried a hybrid approach
| where you have a Mamba-like architecture for longer term recall
| but it's combined with a transformer for short term memory?
| enonimal wrote:
| maybe a fun karpathy video here...
| logicchains wrote:
| Yep, https://arxiv.org/abs/2402.04248 tried a Mambaformer which
| seemed to perform well.
| dang wrote:
| Recent and related:
|
| _World model on million-length video and language with
| RingAttention_ - https://news.ycombinator.com/item?id=39367141 -
| Feb 2024 (58 comments)
| xiphias2 wrote:
| They are using RingAttention that scales self attention
| computation linearly by number of devices by passing KV result
| blocks in a ring:
|
| https://arxiv.org/abs/2310.01889 (Submitted on 3 Oct 2023)
| canadiantim wrote:
| Can you feed pdf's into this? Seems like it handles images like a
| champ and those needle in haystack benchmarks are wild. And it's
| open?! wow, very very impressive!!
___________________________________________________________________
(page generated 2024-02-16 23:01 UTC)