[HN Gopher] Qwen2 LLM Released
___________________________________________________________________
Qwen2 LLM Released
Author : bratao
Score : 177 points
Date : 2024-06-06 16:01 UTC (6 hours ago)
(HTM) web link (qwenlm.github.io)
(TXT) w3m dump (qwenlm.github.io)
| minimaxir wrote:
| A 0.5B parameter model with a 32k context length that also makes
| good use of that full window?! That's _very_ interesting.
|
| The academic benchmarks on that particular model relative to
| 1.5B-2B models are what you would expect, but it would make for
| an excellent base for finetuning/embedding generation.
| rgbrgb wrote:
| haven't done anything with a model so small. what kind of stuff
| would you do with it?
| minimaxir wrote:
| A properly finetuned model can perform better for a use case
| but even with PEFT/LoRAs, finetuning and managing "smaller"
| open-source LLMs (7B params) models like Llama3 is annoying.
| That's partially why the even-smaller ~2B Phi series of
| models took off.
|
| A 0.5B model may not be that great out of the box but there's
| a lot of oppertunity if it's responsive to finetuning.
| GaggiX wrote:
| You can use it for speculative decoding for example, this
| would increase the speed of larger models.
| 3abiton wrote:
| I wonder if the 0.5B model would be usable for ML tasks like
| summarization, classification, or embeddings, replacing small
| models like Spacy usually used for embeddings.
| ai_what wrote:
| It won't. Amazon kind of when that angle with MistralLite[1]
| (a 7B finetune), and it was barely passing in terms of being
| an effective summarizer. 0.5B are pretty much useless.
|
| https://huggingface.co/amazon/MistralLite
| coder543 wrote:
| The official Mistral-7B-v0.2 model added support for 32k
| context, and I think it's far better than MistralLite.
| Third-party finetunes are rarely amazing at the best of
| times.
|
| Now, we have Mistral-7B-v0.3, which is supposedly an even
| better model:
|
| https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
| GaggiX wrote:
| >that also makes good use of that full window?!
|
| To be honest the "Needle In A Haystack" test is the most
| trivial test for a model that relies on full attention, it's
| expected to be easy to pass if the model was trained correctly.
| ai_what wrote:
| I agree. I personally don't have high hopes for the 0.5B
| model.
|
| Phi-2 was 2.7B and it was already regularly outputting
| complete nonsense.
|
| I ran the 0.5B model of the previous Qwen version (1.5) and
| it reminded me of one of those lorum ipsum word generators.
|
| The other new Qwen models (7B and up) look good though.
| refulgentis wrote:
| Phi-2 wasn't instruct/chat finetuned and it was very
| upfront about this, "I tried Phi-2 and it was bad" is a
| dilletante filter
| coder543 wrote:
| Qwen1.5-0.5B supposedly supported up to 32k context as well,
| but I can't even get it to summarize a ~2k token input with any
| level of coherence.
|
| I'm always excited to try a new model, so I'm looking forward
| to trying Qwen2-0.5B... but I wouldn't get your hopes up this
| much. These super tiny models seem far more experimental than
| the larger LLMs.
|
| Phi-3-mini (3.8B) supports a 128k context, and it is actually a
| reasonably useful model in my tests. Gemma-2B-1.1-it is a 2B
| model that only supports 8k context, but it also does fairly
| well for summarization.
| algo_trader wrote:
| what would you recommend as a maximum-2B model for
| coding/reasoning domain ?
| coder543 wrote:
| For reasoning tasks and coding tasks where you're chatting
| with the model, there are no 2B models that I would
| recommend at this point.
| noman-land wrote:
| Phi-3-mini has been really surprising me. It's quite good!
| sitkack wrote:
| Summarization is one of the most difficult tasks for any LLM
| and over that context window, crazy to think it could do it.
|
| That context window is useful if you have a smaller data
| extraction task, like dates, times, place names, etc. And
| even that it might need to be fine tuned on. These small
| models are a _feedstock_.
| TechDebtDevin wrote:
| What tasks do you consider a 3.8B model to be useful for?
| Chat applications on lesser hardware, im still finding it
| difficult to parse what the real world application would ever
| be. However, I do understand that the goal is to make the
| smallest most efficient model to compete with the larger
| model capabilities one day and you can't get there without
| making these. But do these types of models have any value for
| any sort of product or real world project?
| coder543 wrote:
| I think most of the interesting applications for these
| small models are in the form of developer-driven
| automations, not chat interfaces.
|
| A common example that keeps popping up is a voice recorder
| app that can provide not just a transcription of the
| recording (which you don't need an LLM for), but also a
| summary of the transcription, including key topics, key
| findings, and action items that were discussed in a
| meeting. With speaker diarization (assigning portions of
| the transcript to different speakers automatically), it's
| even possible to use an LLM to assign names to each of the
| speakers in the transcript, if they ever identified
| themselves in the meeting, and then the LLM could take that
| and also know who is supposed to be handling each action
| item, if that was discussed in the meeting. That's just
| scratching the surface of what should be possible using
| small LLMs (or SLMs, as Microsoft likes to call them).
|
| An on-device LLM could summarize notifications if you have
| a lot of catching up to do, or it could create a title for
| a note automatically once you finish typing the note, or it
| could be used to automatically suggest tags/categories for
| notes. That LLM could be used to provide "completions",
| like if the user is writing a list of things in a note, the
| user could click a button to have that LLM generate several
| more items following the same theme. That LLM can be used
| to suggest contextually-relevant quick replies for
| conversations. In a tightly-integrated system, you could
| imagine receiving a work phone call, and that LLM could
| automatically summarize your recent interactions with that
| person (across sms, email, calendar, and slack/teams) for
| you on the call screen, which could remind you why they're
| calling you.
|
| LLMs can also be used for data extraction, where they can
| be given unstructured text, and fill in a data structure
| with the desired values. As an example, one could imagine
| browsing a job posting... the browser could use an LLM to
| detect that the primary purpose of this webpage is a job
| posting, and then it could pass the text of the page
| through the LLM and ask the LLM to fill in common values
| like the job title, company name, salary range, and job
| requirements, and then the browser could offer a condensed
| interface with this information, as well as the option to
| save this information (along with the URL to the job
| posting) to your "job search" board with one click.
|
| Now, it might be a little much to ask a browser to have
| special cases for just job postings, when there are so many
| similar things a user might want to save for later, so you
| could even let the user define new "boards" where they
| describe to a (hopefully larger) LLM the purpose of the
| board and the kinds of information you're looking for, and
| it would generate the search parameters and data extraction
| tasks that a smaller LLM would then do in the background as
| you browse, letting the browser present that information
| when it is available so that you can choose whether to save
| it to your board. The larger LLM could still potentially be
| on-device, but a more powerful LLM that occupies most of
| the RAM and processing on your device is something you'd
| only want to use for a foreground task, not eating up
| resources in the background.
|
| LLMs are interesting because they make it possible to do
| things that traditional programming could not do in any
| practical sense. If something can be done _without_ an LLM,
| then absolutely... _do that_. LLMs are very computationally
| intensive, and their accuracy is more like a human than a
| computer. There are plenty of drawbacks to LLMs, if you
| have another valid option.
| TechDebtDevin wrote:
| Thanks for the response I have been genuinely curious
| about use cases for these little guys.
| moffkalast wrote:
| It might be fine tunable for specific tasks BERT-style, I
| really doubt it's coherent enough for anything out of the box.
| c4pt0r wrote:
| Really glad to see Qwen2 uses Apache 2.0 License.
| soulofmischief wrote:
| Unfortunately, the 72B model still uses the old license, but
| still a great step in the right direction.
| ilaksh wrote:
| Right but that license doesn't seem too bad as long as you
| don't get to 100 million users.
| idle_zealot wrote:
| Do we actually know if companies even have the right to license
| these things? Last I heard there was no consensus on whether
| model weights even qualify for copyright protection.
| ronsor wrote:
| No, we do not know, and unless the emperor steps out, we will
| not know if he has on any clothes.
| msoad wrote:
| Are we plateauing with those LLM benchmarks?
| esafak wrote:
| I doubt it; they could be a whole lot smarter. We need to solve
| alignment in the meantime.
| minimaxir wrote:
| No, but getting better benchmarks tends to require more
| shenanigans (e.g. mixture-of-experts).
|
| Qwen2 72B doesn't score that high on the leaderboard relative
| to brute-forced finetunes: https://huggingface.co/spaces/open-
| llm-leaderboard/open_llm_...
| YetAnotherNick wrote:
| We are plateauing with respect to compute. Unreleased llama 3
| 400b has significantly better benchmarks. Also Zuckerburg said
| that llama 3 continued to improve even after 15T tokens.
| causal wrote:
| Maybe but I don't see how this release would support that
| conclusion- their 72B model surpasses Llama 3 70B on so many
| metrics by such a wide margin I find it a little hard to
| believe.
| segmondy wrote:
| that benchmark is by then, the community has to evaluate and
| report back. I never believe self reported benchmarks.
| moffkalast wrote:
| Yeah the arena leaderboard will show where it really goes in
| a week or two.
| GaggiX wrote:
| Going from 79.5% up to 84.2% is a -23% error rate for example,
| it's quite a huge difference.
| tosh wrote:
| better benchmarks with higher ceilings are also needed to be
| able to tell apart how good the better models are compared to
| the others
| jimmySixDOF wrote:
| They actually used some of the newest benchmarks including
| MixEval which seems to be in line with Lmsys crowdsourced ELO
| scores and super efficient to run.
| behnamoh wrote:
| Please don't release new models unless there is an arch change.
| All these new LLMs are essentially the same technique applied to
| almost the same dataset.
| causal wrote:
| Even if that were true (it's not; arch and dataset differ
| considerably from model to model) the release of new foundation
| models has a lot of other important factors to consider (e.g.
| performance, license, size).
| behnamoh wrote:
| No it doesn't. It's like the Linux market. Most effort is
| being wasted re-inventing the wheel, forking other repos, and
| creating new distros. Except now distro = LLM.
| causal wrote:
| Qwen 2 uses GQA, has a much longer context length, and
| covers many new languages. You can't do that with the same
| architecture and dataset.
| soulofmischief wrote:
| The variety of distributions in the Linux ecosystem is a
| strength, not a weakness.
|
| No one is making you pick up a shovel to build alongside
| them; instead you choose to rest on your laurels and
| complain about other peoples' hard work and dedication to
| providing people with choices.
| behnamoh wrote:
| The fact that less than 5% of desktop users and 0% of
| mobile users [0] use Linux speaks for itself. All the
| effort that went into creating distros could have been
| focused on few high-quality ones.
|
| [0]: Android uses the Linux kernel which is almost the
| same across distros, but isn't per se a Linux OS. I'm
| talking about real Linux running on a mobile phone.
| pixelpoet wrote:
| Most mobile users can't even spell PC (let alone use
| one), and blame autocorrect for that. To expect them to
| use a Linux distro is pretty unrealistic when they can't
| even use Windows:
| https://www.theverge.com/22684730/students-file-folder-
| direc...
| balder1991 wrote:
| This argument makes no sense considering the diversity of
| tastes and desires in the community. It's incredibly
| difficult to put more people to work together and achieve
| superior results, especially if it's something
| decentralized and voluntary. You can pretty well end up
| with more chaos and internal conflicts instead.
| soulofmischief wrote:
| Why are you so hellbent on dictating how volunteers spend
| their time?
|
| Have you ever considered that these people are satisfied
| with their interests and truly could not care less about
| your opinion? Or that your opinion is just that-- yours?
| Not some absolute truth?
|
| Anyway, it's beside the point, as there are multiple high
| quality Linux distributions to choose from, thanks to a
| large de-duplication of efforts through libraries.
| rfoo wrote:
| Honest question, just curious what's your definition of
| "different arch", is a model with GQA=8 different than one
| with GQA=16 in your mind? I guess no?
|
| How about GQA vs MHA, or GQA vs MLA?
|
| If anything attention-like is same in your mind, is S5 and
| RWKV different arch given that both are some kind of linear
| RNN?
| jwitthuhn wrote:
| I'm not aware of any other openly-licensed model of comparable
| size to 54b. That seems like a worthwhile addition to what is
| already available, imo.
|
| The closest is mixtral 8x7b but that one only uses a fraction
| of its parameters for each pass. This one should produce better
| but slower results at roughly the same memory requirement.
| soulofmischief wrote:
| Do you understand at all how science and replication work?
|
| Let these researchers do what they want, they didn't release it
| for you specifically.
| behnamoh wrote:
| This is not science. It's engineering at this point. Many
| startups create their own models for various purposes, mainly
| for VC investment.
| soulofmischief wrote:
| I'm sorry, but this is absolutely scientific research.
| There is no amount of gaslighting or reframing which will
| deny it that status. The source of funding is irrelevant.
| Most scientific research is funded with the hopes of seeing
| a return of investment.
|
| You simply don't know what you're talking about. Your
| overly cynical take is against Hacker News guidelines.
| dareal wrote:
| I'm not sure where this arrogant attitude comes from. But
| please keep that condescending tone out of this
| community.
| soulofmischief wrote:
| My comments are far from arrogant; they are directly
| decrying the arrogance of another user, and labeling this
| kind of pushback as arrogant or condescending is
| disingenuous, bordering on ironic. Let's stay on topic.
| Der_Einzige wrote:
| Yes, you're allowed to be wrong on the internet, but not THIS
| wrong!
| achrono wrote:
| > We have opensourced the models in Hugging Face and ModelScope
| to you
|
| We are unfortunately now in a place where this falsehood has
| travelled the world while the truth is probably still half-asleep
| in its underwear.
|
| It is a shame that people who are working on what is probably the
| pinnacle of computing can so blatantly disregard the real
| meaning.
|
| Imagine if Microsoft starting announcing everywhere that Windows,
| because all its EXE and DLLs are right there for you to see, is
| actually open-source!
|
| I suppose all we can do now is to keep asking "is it open-source
| or like _true_ open-source? ".
| diggan wrote:
| The model seems to be Apache 2.0 yet the training data remains
| private, so even if you had the resources, knowledge and
| ability to train it yourself, you wouldn't be able to.
|
| So no, Qwen 2 isn't open source, but they happen to release the
| models publicly. Guess "pseudo-open source" might make sense as
| a label.
|
| I agree, I'm not a super fan of people/organizations using
| "open source" as a marketing term which seems popular in the ML
| field right now.
| okaram wrote:
| I think we need to agree on terminology, but to me, this
| seems clear.
|
| The _model_ is open-source (or open-content, if you prefer).
| The _input data_ isn 't.
| Zambyte wrote:
| The input data is the source (literally "where it comes
| from"). If the source is not open, it is not open source.
|
| The model is open weight, despite an OSI approved license
| sitting in the same directory as the binary blob.
| loudmax wrote:
| I like the term open-weights.
|
| The "source", analogous to the source code for a program,
| should include the training data. In this case that isn't
| open. The resulting weights are open, insofar as they can
| be redistributed and fine-tuned and so on.
| patrick-fitz wrote:
| I think they should be a distinction between open-source and
| open-weight LLM's.
| danielbln wrote:
| I like this terminology, I'm going to start using it.
| sitkack wrote:
| The problem is, all the Open Weight models are already
| calling themselves Open Source, so a new name that
| disambiguates existing names should be chosen.
| okaram wrote:
| Why is this false? The _model_ is open source, Apache 2.
| andy_xor_andrew wrote:
| Given the restrictions on GPUs to China, I'm curious what their
| training cluster looks like.
|
| (not saying this out of any support or non-support for such a GPU
| blockade; I'm just genuinely curious)
| ComputerGuru wrote:
| They probably buy them through intermediaries.
| blackoil wrote:
| They can train it in a data center in Singapore or Taiwan!!
| 0cf8612b2e1e wrote:
| The restrictions are fairly recent. Unless a Stuxnet disables
| all GPUs in the country, there is probably a lot of existing
| infrastructure.
| bx376 wrote:
| I believe besides the legacy Nvidia cards, Alibaba has built
| training clusters from domestic AI accelerator vendor like
| https://e.huawei.com/en/products/computing/ascend.
|
| But that means a lot of extra in-house engineering work to
| utilize those niche hardware.
| rfoo wrote:
| A100s, (old) A800s, H100s or (old) H800s.
|
| How did they get them? Who knows, but it's there.
| halJordan wrote:
| Baidu specifically claims they have orchestration technology to
| train on heterogeneous gpu clusters.
| davidcollantes wrote:
| As one of my tests, I asked about Tienanmen Square. It will
| consistently render an error.
| LeoPanthera wrote:
| Microsoft Copilot is so censored that it won't even tell you
| who won the 2020 US election.
| user_7832 wrote:
| I don't know if it's censoring or something even weirder,
| because it does this even for non-partisinal political
| things. I tried asking when were the previous Lok Sabha
| elections (2019) and it'd just terminate the chat. It was
| bizarre? ?? It's not even something controversial
| refulgentis wrote:
| So the other day I'm reading a story about how the Russian
| gov't is criticizing Yandex because its LLM isn't able to
| repeat / stick to the narratives at length. And I felt bad
| for them because it's probably virtually impossible, at
| scale, to rewrite the entirety of human knowledge so it's
| consistent with propaganda tripe.
|
| I wonder if that would also apply here, and you're better
| off just not touching on politics than trying to prevent it
| from ever saying anything the Party wouldn't like
| dvngnt_ wrote:
| ChatGPT 4o
|
| > Joe Biden won the 2020 U.S. presidential election,
| defeating the incumbent president, Donald Trump. Biden, the
| Democratic candidate, received 306 electoral votes, while
| Trump, the Republican candidate, received 232 electoral
| votes. Biden also won the popular vote, receiving over 81
| million votes to Trump's 74 million.
| WhitneyLand wrote:
| Is this supposed to be a joke or hyperbole?
| LeoPanthera wrote:
| I took a screenshot in case anyone doesn't believe me:
|
| https://ibb.co/RP5VTFX
| gracegreg wrote:
| Qwen2-72B claims better than LLama3-70B, I just found there is an
| another LLama3 model has comparable performace:
|
| | | Qwen2-72B | Higgs-Llama-3-70B | Llama3-70B-Instruct |
|
| | ---------- | --------- | ----------------- |
| ------------------- |
|
| | MMLU | 82.3 | 80.8 | 80.2 |
|
| | MMLU-Pro | 64.4 | 63.2 | 56.2 |
|
| | Arena-Hard | 48.1 | 49.6 | 41.1 |
|
| | GPQA | 42.4 | 42.1 | 41.9 |
|
| - https://huggingface.co/bosonai/Higgs-Llama-3-70B
| segmondy wrote:
| all new models claim to be better than the top SOTA model.
| Since llama3 dropped, every new model released has claimed to
| be better than it.
| refulgentis wrote:
| [DISREGARD]
| philipkglass wrote:
| Do you mean 400B? I thought that 70B was released some time
| ago:
|
| https://huggingface.co/meta-llama/Meta-Llama-3-70B
| refulgentis wrote:
| Jesus, I'm losing it, thank you, you saved me from
| looking foolish (at least, continuing to :) )
| azinman2 wrote:
| You're thinking of 400B. 70B is out
| refulgentis wrote:
| Weird, every time I try asking what happened at Tiananmen Square,
| or why Xi is an outlier with 3 terms as party secretary, it
| errors. "All hail Glorious Xi :)" works though.
| https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct
| riku_iki wrote:
| I am wondering if such moderated model can induce significant
| security risk, for example it can generate exploitable code, or
| try to trigger some action base on some specific input.
| lIIllIIllIIllII wrote:
| I set the system prompt to try avoid censorship by substituting
| etc etc, it didn't listen - started generating a response and
| got as far as this before suddenly, [ERROR] :-)
|
| >I'm sorry for any confusion, but it seems there might be some
| mix-up in your query. The year 1979 does not have a widely
| recognized event matching your description, and "TS" and
| "Cydonia" do not correspond to any known geopolitical locations
| or events in historical records. "Cydonia" is often associated
| with a region on Mars that has been subject to various
| conspiracy theories, but it doesn't relate to any political
| events or protests.
|
| >If you're referring to a specific historical event, it could
| possibly be the Tian
| aubanel wrote:
| This model has: 1. On par or better performance than
| Llama-3-70B-Instruct 2. A much more comfortable context length of
| 128k (vs the tiny 8k that really hinders Llama-3)
|
| These 2 feats together will probably make it the first serious OS
| rival to GPT-4!
| Havoc wrote:
| Somewhat odd sizing. 7B -> 57B is quite a jump...and crucially
| misses basically the entire spectrum of consumer GPUs wholesole.
| galaxyLogic wrote:
| Is it a common practice in LLMs to give different weights to
| different training data sources?
|
| For instance I might want to say that all training data that
| comes from my inhouse emails take precedence over anything that
| comes from the internet?
| sebzim4500 wrote:
| Yes it is. IIRC back when OpenAI was open and they published
| the breakdown they were significantly overweighting wikipedia.
___________________________________________________________________
(page generated 2024-06-06 23:01 UTC)