[HN Gopher] Megalodon: Efficient LLM Pretraining and Inference w...
___________________________________________________________________
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Author : amichail
Score : 91 points
Date : 2024-04-16 17:40 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| zer00eyz wrote:
| 1. Open paper
|
| 2. Find GitHub
|
| 3. Read Source
|
| https://github.com/XuezheMax/megalodon Dead link
|
| I have stoped reading the papers, I only care about working code
| that can be used. Arvix LLM papers have reached the level of
| academic mastrubation in general.
|
| Fix the bad link and then we have something to talk about.
| abathur wrote:
| Looks like probably https://github.com/dumpmemory/megalodon
| bitvoid wrote:
| I think that is a fork before the actual repo was made
| private.
|
| XuezheMax's GitHub profile shows that his contributions in
| April were all to private repositories.
| abathur wrote:
| Seems like a reasonable reading.
|
| It's only got 3 commits and 2 are from 18 hours ago--one of
| which fiddled the repo-made-public date and added the arxiv
| link--so perhaps there's a private working repo and they
| decided to make a ~clean public repo at the last minute
| (i.e., maybe the authors are self-conscious about the
| commit log).
| dartos wrote:
| I wonder what happened
| elvircrn wrote:
| Perhaps it was made private as part of a conference
| submission process?
| amitport wrote:
| Unlikely. Having code somewhere is OK, you just don't
| refer to it in the submitted version.
| lumost wrote:
| This happened to WizardLM2 yesterday as well
| https://wizardlm.github.io/WizardLM2/
| viksit wrote:
| they released an announcement on this.
|
| > We are sorry for that.
|
| > It's been a while since we've released a model months ago,
| so we're unfamiliar with the new release process now: We
| accidentally missed an item required in the model release
| process - toxicity testing.
|
| > We are currently completing this test quickly and then will
| re-release our model as soon as possible.
|
| https://x.com/wizardlm_ai/status/1780101465950105775?s=46
| zer00eyz wrote:
| Well that's interesting.
|
| It took a bunch of detective work, for what should have
| just been a NOTE on the read me on the repo.
|
| Is this why "science communicators" are needed?
|
| https://xkcd.com/1254/ <<< very relevant.
| kristjansson wrote:
| very happy my first step with any model with big claims is
| 'huggingface-cli donwload ...'
| cs702 wrote:
| The two issues I see with this work:
|
| * There's no mention of model performance on _recall tasks_.
| Models with attention do well on recall tasks, but models without
| it, like this one, tend to do poorly.[a] What is the performance
| of this model on recall tasks?
|
| * As others here have pointed out, the github link is dead. In
| addition, the pretrained weights don't seem to be available on HF
| or anywhere else. It's hard to check any claims if we have
| neither code nor weights!
|
| ---
|
| [a] https://arxiv.org/abs/2402.01032
| phh wrote:
| > * There's no mention of model performance on recall tasks.
| Models with attention do well on recall tasks, but models
| without it, like this one, tend to do poorly.[a] What is the
| performance of this model on recall tasks?
|
| For what it's worth, RWKV's (another "transformer-less")
| website on that matter mention that yes it's bad on recall, but
| for the vast majority of tasks you can just ask the question
| *before* the content, and it'll handle the task just fine. (I'm
| just reporting, I haven't spent time trying myself)
| a2128 wrote:
| I thought the recommendation for long contexts with RWKV was
| to put the question after the content, otherwise it can
| forget the question
| ronsor wrote:
| This is no longer the case for RWKV-5/6
| refulgentis wrote:
| Section 4.3 addresses this, runs 3 benchmarks, tl;dr: 7B
| roundly beats LLamA 2 7B, almost same performance as LLamA 2
| 7B-L, which got an extra 500K of training tokens specifically
| at long context length.
|
| * section 4.3, in both the paper you linked, and the paper
| we're commenting on. why did I notice that? it took me several
| minutes to understand "how the paper changed": it was 2 papers
| all along, I just switched tabs without realizing. And it's
| only Tuesday.
| YetAnotherNick wrote:
| This model has attention, just the sequence in broken into chunks
| of length 4096 and the attention is only applied for the chunk.
| Llama 2 was trained on chunks of length 4096 so this model has
| the same quadratic complexity for any sequence which fits within
| llama 2 context size.
| patrickhogan1 wrote:
| Show me a working request/response that does better than state of
| the art.
| qwertox wrote:
| I was just chatting with ChatGPT about unlimited context length,
| and even if you theoretically could archive to have a personal
| assistant this way, one which would know all your chat history,
| an unlimited context length doesn't seem efficient enough.
|
| It would make more sense to create a new context every day and
| integrate it into the model at night. Or a every day a new
| context of the aggregated last several days. Giving it time to
| sleep on it every day and it being capable to use it the next day
| without it needing to get passed in the context again.
___________________________________________________________________
(page generated 2024-04-16 23:00 UTC)