[HN Gopher] Megalodon: Efficient LLM Pretraining and Inference w...
       ___________________________________________________________________
        
       Megalodon: Efficient LLM Pretraining and Inference with Unlimited
       Context Length
        
       Author : amichail
       Score  : 91 points
       Date   : 2024-04-16 17:40 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | zer00eyz wrote:
       | 1. Open paper
       | 
       | 2. Find GitHub
       | 
       | 3. Read Source
       | 
       | https://github.com/XuezheMax/megalodon Dead link
       | 
       | I have stoped reading the papers, I only care about working code
       | that can be used. Arvix LLM papers have reached the level of
       | academic mastrubation in general.
       | 
       | Fix the bad link and then we have something to talk about.
        
         | abathur wrote:
         | Looks like probably https://github.com/dumpmemory/megalodon
        
           | bitvoid wrote:
           | I think that is a fork before the actual repo was made
           | private.
           | 
           | XuezheMax's GitHub profile shows that his contributions in
           | April were all to private repositories.
        
             | abathur wrote:
             | Seems like a reasonable reading.
             | 
             | It's only got 3 commits and 2 are from 18 hours ago--one of
             | which fiddled the repo-made-public date and added the arxiv
             | link--so perhaps there's a private working repo and they
             | decided to make a ~clean public repo at the last minute
             | (i.e., maybe the authors are self-conscious about the
             | commit log).
        
             | dartos wrote:
             | I wonder what happened
        
             | elvircrn wrote:
             | Perhaps it was made private as part of a conference
             | submission process?
        
               | amitport wrote:
               | Unlikely. Having code somewhere is OK, you just don't
               | refer to it in the submitted version.
        
         | lumost wrote:
         | This happened to WizardLM2 yesterday as well
         | https://wizardlm.github.io/WizardLM2/
        
           | viksit wrote:
           | they released an announcement on this.
           | 
           | > We are sorry for that.
           | 
           | > It's been a while since we've released a model months ago,
           | so we're unfamiliar with the new release process now: We
           | accidentally missed an item required in the model release
           | process - toxicity testing.
           | 
           | > We are currently completing this test quickly and then will
           | re-release our model as soon as possible.
           | 
           | https://x.com/wizardlm_ai/status/1780101465950105775?s=46
        
             | zer00eyz wrote:
             | Well that's interesting.
             | 
             | It took a bunch of detective work, for what should have
             | just been a NOTE on the read me on the repo.
             | 
             | Is this why "science communicators" are needed?
             | 
             | https://xkcd.com/1254/ <<< very relevant.
        
             | kristjansson wrote:
             | very happy my first step with any model with big claims is
             | 'huggingface-cli donwload ...'
        
       | cs702 wrote:
       | The two issues I see with this work:
       | 
       | * There's no mention of model performance on _recall tasks_.
       | Models with attention do well on recall tasks, but models without
       | it, like this one, tend to do poorly.[a] What is the performance
       | of this model on recall tasks?
       | 
       | * As others here have pointed out, the github link is dead. In
       | addition, the pretrained weights don't seem to be available on HF
       | or anywhere else. It's hard to check any claims if we have
       | neither code nor weights!
       | 
       | ---
       | 
       | [a] https://arxiv.org/abs/2402.01032
        
         | phh wrote:
         | > * There's no mention of model performance on recall tasks.
         | Models with attention do well on recall tasks, but models
         | without it, like this one, tend to do poorly.[a] What is the
         | performance of this model on recall tasks?
         | 
         | For what it's worth, RWKV's (another "transformer-less")
         | website on that matter mention that yes it's bad on recall, but
         | for the vast majority of tasks you can just ask the question
         | *before* the content, and it'll handle the task just fine. (I'm
         | just reporting, I haven't spent time trying myself)
        
           | a2128 wrote:
           | I thought the recommendation for long contexts with RWKV was
           | to put the question after the content, otherwise it can
           | forget the question
        
             | ronsor wrote:
             | This is no longer the case for RWKV-5/6
        
         | refulgentis wrote:
         | Section 4.3 addresses this, runs 3 benchmarks, tl;dr: 7B
         | roundly beats LLamA 2 7B, almost same performance as LLamA 2
         | 7B-L, which got an extra 500K of training tokens specifically
         | at long context length.
         | 
         | * section 4.3, in both the paper you linked, and the paper
         | we're commenting on. why did I notice that? it took me several
         | minutes to understand "how the paper changed": it was 2 papers
         | all along, I just switched tabs without realizing. And it's
         | only Tuesday.
        
       | YetAnotherNick wrote:
       | This model has attention, just the sequence in broken into chunks
       | of length 4096 and the attention is only applied for the chunk.
       | Llama 2 was trained on chunks of length 4096 so this model has
       | the same quadratic complexity for any sequence which fits within
       | llama 2 context size.
        
       | patrickhogan1 wrote:
       | Show me a working request/response that does better than state of
       | the art.
        
       | qwertox wrote:
       | I was just chatting with ChatGPT about unlimited context length,
       | and even if you theoretically could archive to have a personal
       | assistant this way, one which would know all your chat history,
       | an unlimited context length doesn't seem efficient enough.
       | 
       | It would make more sense to create a new context every day and
       | integrate it into the model at night. Or a every day a new
       | context of the aggregated last several days. Giving it time to
       | sleep on it every day and it being capable to use it the next day
       | without it needing to get passed in the context again.
        
       ___________________________________________________________________
       (page generated 2024-04-16 23:00 UTC)