[HN Gopher] LWM - Open LLM with 1M Tokens Context Window
       ___________________________________________________________________
        
       LWM - Open LLM with 1M Tokens Context Window
        
       Author : amrrs
       Score  : 136 points
       Date   : 2024-02-16 15:54 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Havoc wrote:
       | That's exciting!
       | 
       | Any views on the license? Github says Apache 2 for weights...but
       | hugging face says llama license
        
         | machinekob wrote:
         | If you don't have GPU/TPU farm this isn't as exciting as it
         | seems to need around 2TiB with 8bit quantization.
        
       | azinman2 wrote:
       | We saw this before. Anyone know what the VRAM requirements are,
       | if it's been quantized, and if/when llama.cpp might support?
        
         | visarga wrote:
         | Probably never gonna run on normal boxes, it uses RingAttention
         | which only makes sense when you have many GPUs. It's a
         | datacenter-only model.
        
           | azinman2 wrote:
           | I'd still like to know the requirements. You can rent A100's
           | or have access thru work, etc.
        
             | ramoz wrote:
             | > Scaling Inference We additionally scale our inference
             | code to support million-length sequences by implementing
             | RingAttention for decoding. Inference for such long
             | sequences requires a minimum of v4-128 with a TPU mesh
             | sharding of 32 tensor parallelism, and 4 sequence
             | parallelism (ring dimension). We perform inference in pure
             | single precision, where additional improvements can be made
             | through techniques in scalability such as quantization.
        
             | Filligree wrote:
             | It might even fit in a max-sized macbook pro.
        
               | machinekob wrote:
               | Maybe in few years with 4bit quantisation. (around 1TiB
               | is needed for that)
        
             | 5kg wrote:
             | From https://arxiv.org/pdf/2402.08268.pdf
             | 
             | > We trained our models using TPUv4-1024, which is
             | approximately equivalent to 450 A100s
             | 
             | > Inference for such long sequences requires a minimum of
             | v4-128
             | 
             | So you'll need ~60 A100 for inference.
        
             | K0balt wrote:
             | There are models with shorter attention sizes that probably
             | are much smaller in vram needs. The model itself is only
             | about 32g, so in 5K-M quant it wouldn't be that bad, and
             | the smaller attention versions might actually be workable
             | in higher end workstations with cleverness like llama.cpp.
             | Back of the napkin guess tells me maybe around 60g for the
             | 32k context version
        
         | woadwarrior01 wrote:
         | From the paper[1]:
         | 
         | > We additionally scale our inference code to support million-
         | length sequences by implementing RingAttention for decoding.
         | Inference for such long sequences requires a minimum of v4-128
         | with a TPU mesh sharding of 32 tensor parallelism, and 4
         | sequence parallelism (ring dimension).
         | 
         | Each TPU-v4 has 32 GiB of HBM memory, so about 4 TiB (128 x
         | 32GiB) of memory in fp16, without quantization.
         | 
         | [1]: https://arxiv.org/abs/2402.08268
        
           | Mathnerd314 wrote:
           | The model itself is only 13.6GB though:
           | https://huggingface.co/LargeWorldModel/LWM-
           | Chat-1M-Jax/tree/...
        
             | woadwarrior01 wrote:
             | Indeed. Almost all of the inference memory overhead comes
             | from attention matrices and the KV cache.
        
           | moffkalast wrote:
           | So uh, they thought "we have this datacenter full of GPUs
           | laying around, you know what would be best, if we made a
           | model that needs the entire thing just for inference".
           | 
           | Brute forcing a quadratic complexity problem seems like
           | wastefulness at its worst.
        
           | 3abiton wrote:
           | Sounds like they're loading the internet into those vrams.
        
       | faizshah wrote:
       | So what approach do these large context models take with caching
       | the context, or do they re-compute each time?
        
       | bluelightning2k wrote:
       | Incredible achievement. Awkward timing with Gemini announcing the
       | same.
        
       | lxe wrote:
       | Why Jax over Pytorch for Video/Text model?
        
         | benpacker wrote:
         | easier to scale to a distributed TPU cluster I think?
        
       | lacoolj wrote:
       | surprised this beat a GPT-4 equivalent when Google already
       | announced. Maybe they were too busy with Sora to care about 1
       | million ctx right now
        
         | sebzim4500 wrote:
         | Presumably they had Sora almost ready to go, and the LLM team
         | didn't have anything that they could publish with a few hours
         | notice.
        
       | thefourthchime wrote:
       | Was it yesterday Gemini Pro 1.5 announced the 1M token size?
       | Jesus stuff is moving fast.
        
         | derac wrote:
         | This paper was released Feb 13th.
        
         | guluarte wrote:
         | 10M end of week 1 billon end of month
        
       | m3kw9 wrote:
       | How is the needle in haystack test on this thing?
        
         | derac wrote:
         | https://github.com/LargeWorldModel/LWM#lwm-capabilities
        
           | canadiantim wrote:
           | wow, dam that's crazy impressive
        
       | blainm wrote:
       | I would be curious to know if anyone has tried a hybrid approach
       | where you have a Mamba-like architecture for longer term recall
       | but it's combined with a transformer for short term memory?
        
         | enonimal wrote:
         | maybe a fun karpathy video here...
        
         | logicchains wrote:
         | Yep, https://arxiv.org/abs/2402.04248 tried a Mambaformer which
         | seemed to perform well.
        
       | dang wrote:
       | Recent and related:
       | 
       |  _World model on million-length video and language with
       | RingAttention_ - https://news.ycombinator.com/item?id=39367141 -
       | Feb 2024 (58 comments)
        
       | xiphias2 wrote:
       | They are using RingAttention that scales self attention
       | computation linearly by number of devices by passing KV result
       | blocks in a ring:
       | 
       | https://arxiv.org/abs/2310.01889 (Submitted on 3 Oct 2023)
        
       | canadiantim wrote:
       | Can you feed pdf's into this? Seems like it handles images like a
       | champ and those needle in haystack benchmarks are wild. And it's
       | open?! wow, very very impressive!!
        
       ___________________________________________________________________
       (page generated 2024-02-16 23:01 UTC)