[HN Gopher] Qwen2 LLM Released
       ___________________________________________________________________
        
       Qwen2 LLM Released
        
       Author : bratao
       Score  : 177 points
       Date   : 2024-06-06 16:01 UTC (6 hours ago)
        
 (HTM) web link (qwenlm.github.io)
 (TXT) w3m dump (qwenlm.github.io)
        
       | minimaxir wrote:
       | A 0.5B parameter model with a 32k context length that also makes
       | good use of that full window?! That's _very_ interesting.
       | 
       | The academic benchmarks on that particular model relative to
       | 1.5B-2B models are what you would expect, but it would make for
       | an excellent base for finetuning/embedding generation.
        
         | rgbrgb wrote:
         | haven't done anything with a model so small. what kind of stuff
         | would you do with it?
        
           | minimaxir wrote:
           | A properly finetuned model can perform better for a use case
           | but even with PEFT/LoRAs, finetuning and managing "smaller"
           | open-source LLMs (7B params) models like Llama3 is annoying.
           | That's partially why the even-smaller ~2B Phi series of
           | models took off.
           | 
           | A 0.5B model may not be that great out of the box but there's
           | a lot of oppertunity if it's responsive to finetuning.
        
           | GaggiX wrote:
           | You can use it for speculative decoding for example, this
           | would increase the speed of larger models.
        
         | 3abiton wrote:
         | I wonder if the 0.5B model would be usable for ML tasks like
         | summarization, classification, or embeddings, replacing small
         | models like Spacy usually used for embeddings.
        
           | ai_what wrote:
           | It won't. Amazon kind of when that angle with MistralLite[1]
           | (a 7B finetune), and it was barely passing in terms of being
           | an effective summarizer. 0.5B are pretty much useless.
           | 
           | https://huggingface.co/amazon/MistralLite
        
             | coder543 wrote:
             | The official Mistral-7B-v0.2 model added support for 32k
             | context, and I think it's far better than MistralLite.
             | Third-party finetunes are rarely amazing at the best of
             | times.
             | 
             | Now, we have Mistral-7B-v0.3, which is supposedly an even
             | better model:
             | 
             | https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
        
         | GaggiX wrote:
         | >that also makes good use of that full window?!
         | 
         | To be honest the "Needle In A Haystack" test is the most
         | trivial test for a model that relies on full attention, it's
         | expected to be easy to pass if the model was trained correctly.
        
           | ai_what wrote:
           | I agree. I personally don't have high hopes for the 0.5B
           | model.
           | 
           | Phi-2 was 2.7B and it was already regularly outputting
           | complete nonsense.
           | 
           | I ran the 0.5B model of the previous Qwen version (1.5) and
           | it reminded me of one of those lorum ipsum word generators.
           | 
           | The other new Qwen models (7B and up) look good though.
        
             | refulgentis wrote:
             | Phi-2 wasn't instruct/chat finetuned and it was very
             | upfront about this, "I tried Phi-2 and it was bad" is a
             | dilletante filter
        
         | coder543 wrote:
         | Qwen1.5-0.5B supposedly supported up to 32k context as well,
         | but I can't even get it to summarize a ~2k token input with any
         | level of coherence.
         | 
         | I'm always excited to try a new model, so I'm looking forward
         | to trying Qwen2-0.5B... but I wouldn't get your hopes up this
         | much. These super tiny models seem far more experimental than
         | the larger LLMs.
         | 
         | Phi-3-mini (3.8B) supports a 128k context, and it is actually a
         | reasonably useful model in my tests. Gemma-2B-1.1-it is a 2B
         | model that only supports 8k context, but it also does fairly
         | well for summarization.
        
           | algo_trader wrote:
           | what would you recommend as a maximum-2B model for
           | coding/reasoning domain ?
        
             | coder543 wrote:
             | For reasoning tasks and coding tasks where you're chatting
             | with the model, there are no 2B models that I would
             | recommend at this point.
        
           | noman-land wrote:
           | Phi-3-mini has been really surprising me. It's quite good!
        
           | sitkack wrote:
           | Summarization is one of the most difficult tasks for any LLM
           | and over that context window, crazy to think it could do it.
           | 
           | That context window is useful if you have a smaller data
           | extraction task, like dates, times, place names, etc. And
           | even that it might need to be fine tuned on. These small
           | models are a _feedstock_.
        
           | TechDebtDevin wrote:
           | What tasks do you consider a 3.8B model to be useful for?
           | Chat applications on lesser hardware, im still finding it
           | difficult to parse what the real world application would ever
           | be. However, I do understand that the goal is to make the
           | smallest most efficient model to compete with the larger
           | model capabilities one day and you can't get there without
           | making these. But do these types of models have any value for
           | any sort of product or real world project?
        
             | coder543 wrote:
             | I think most of the interesting applications for these
             | small models are in the form of developer-driven
             | automations, not chat interfaces.
             | 
             | A common example that keeps popping up is a voice recorder
             | app that can provide not just a transcription of the
             | recording (which you don't need an LLM for), but also a
             | summary of the transcription, including key topics, key
             | findings, and action items that were discussed in a
             | meeting. With speaker diarization (assigning portions of
             | the transcript to different speakers automatically), it's
             | even possible to use an LLM to assign names to each of the
             | speakers in the transcript, if they ever identified
             | themselves in the meeting, and then the LLM could take that
             | and also know who is supposed to be handling each action
             | item, if that was discussed in the meeting. That's just
             | scratching the surface of what should be possible using
             | small LLMs (or SLMs, as Microsoft likes to call them).
             | 
             | An on-device LLM could summarize notifications if you have
             | a lot of catching up to do, or it could create a title for
             | a note automatically once you finish typing the note, or it
             | could be used to automatically suggest tags/categories for
             | notes. That LLM could be used to provide "completions",
             | like if the user is writing a list of things in a note, the
             | user could click a button to have that LLM generate several
             | more items following the same theme. That LLM can be used
             | to suggest contextually-relevant quick replies for
             | conversations. In a tightly-integrated system, you could
             | imagine receiving a work phone call, and that LLM could
             | automatically summarize your recent interactions with that
             | person (across sms, email, calendar, and slack/teams) for
             | you on the call screen, which could remind you why they're
             | calling you.
             | 
             | LLMs can also be used for data extraction, where they can
             | be given unstructured text, and fill in a data structure
             | with the desired values. As an example, one could imagine
             | browsing a job posting... the browser could use an LLM to
             | detect that the primary purpose of this webpage is a job
             | posting, and then it could pass the text of the page
             | through the LLM and ask the LLM to fill in common values
             | like the job title, company name, salary range, and job
             | requirements, and then the browser could offer a condensed
             | interface with this information, as well as the option to
             | save this information (along with the URL to the job
             | posting) to your "job search" board with one click.
             | 
             | Now, it might be a little much to ask a browser to have
             | special cases for just job postings, when there are so many
             | similar things a user might want to save for later, so you
             | could even let the user define new "boards" where they
             | describe to a (hopefully larger) LLM the purpose of the
             | board and the kinds of information you're looking for, and
             | it would generate the search parameters and data extraction
             | tasks that a smaller LLM would then do in the background as
             | you browse, letting the browser present that information
             | when it is available so that you can choose whether to save
             | it to your board. The larger LLM could still potentially be
             | on-device, but a more powerful LLM that occupies most of
             | the RAM and processing on your device is something you'd
             | only want to use for a foreground task, not eating up
             | resources in the background.
             | 
             | LLMs are interesting because they make it possible to do
             | things that traditional programming could not do in any
             | practical sense. If something can be done _without_ an LLM,
             | then absolutely... _do that_. LLMs are very computationally
             | intensive, and their accuracy is more like a human than a
             | computer. There are plenty of drawbacks to LLMs, if you
             | have another valid option.
        
               | TechDebtDevin wrote:
               | Thanks for the response I have been genuinely curious
               | about use cases for these little guys.
        
         | moffkalast wrote:
         | It might be fine tunable for specific tasks BERT-style, I
         | really doubt it's coherent enough for anything out of the box.
        
       | c4pt0r wrote:
       | Really glad to see Qwen2 uses Apache 2.0 License.
        
         | soulofmischief wrote:
         | Unfortunately, the 72B model still uses the old license, but
         | still a great step in the right direction.
        
           | ilaksh wrote:
           | Right but that license doesn't seem too bad as long as you
           | don't get to 100 million users.
        
         | idle_zealot wrote:
         | Do we actually know if companies even have the right to license
         | these things? Last I heard there was no consensus on whether
         | model weights even qualify for copyright protection.
        
           | ronsor wrote:
           | No, we do not know, and unless the emperor steps out, we will
           | not know if he has on any clothes.
        
       | msoad wrote:
       | Are we plateauing with those LLM benchmarks?
        
         | esafak wrote:
         | I doubt it; they could be a whole lot smarter. We need to solve
         | alignment in the meantime.
        
         | minimaxir wrote:
         | No, but getting better benchmarks tends to require more
         | shenanigans (e.g. mixture-of-experts).
         | 
         | Qwen2 72B doesn't score that high on the leaderboard relative
         | to brute-forced finetunes: https://huggingface.co/spaces/open-
         | llm-leaderboard/open_llm_...
        
         | YetAnotherNick wrote:
         | We are plateauing with respect to compute. Unreleased llama 3
         | 400b has significantly better benchmarks. Also Zuckerburg said
         | that llama 3 continued to improve even after 15T tokens.
        
         | causal wrote:
         | Maybe but I don't see how this release would support that
         | conclusion- their 72B model surpasses Llama 3 70B on so many
         | metrics by such a wide margin I find it a little hard to
         | believe.
        
           | segmondy wrote:
           | that benchmark is by then, the community has to evaluate and
           | report back. I never believe self reported benchmarks.
        
           | moffkalast wrote:
           | Yeah the arena leaderboard will show where it really goes in
           | a week or two.
        
         | GaggiX wrote:
         | Going from 79.5% up to 84.2% is a -23% error rate for example,
         | it's quite a huge difference.
        
         | tosh wrote:
         | better benchmarks with higher ceilings are also needed to be
         | able to tell apart how good the better models are compared to
         | the others
        
         | jimmySixDOF wrote:
         | They actually used some of the newest benchmarks including
         | MixEval which seems to be in line with Lmsys crowdsourced ELO
         | scores and super efficient to run.
        
       | behnamoh wrote:
       | Please don't release new models unless there is an arch change.
       | All these new LLMs are essentially the same technique applied to
       | almost the same dataset.
        
         | causal wrote:
         | Even if that were true (it's not; arch and dataset differ
         | considerably from model to model) the release of new foundation
         | models has a lot of other important factors to consider (e.g.
         | performance, license, size).
        
           | behnamoh wrote:
           | No it doesn't. It's like the Linux market. Most effort is
           | being wasted re-inventing the wheel, forking other repos, and
           | creating new distros. Except now distro = LLM.
        
             | causal wrote:
             | Qwen 2 uses GQA, has a much longer context length, and
             | covers many new languages. You can't do that with the same
             | architecture and dataset.
        
             | soulofmischief wrote:
             | The variety of distributions in the Linux ecosystem is a
             | strength, not a weakness.
             | 
             | No one is making you pick up a shovel to build alongside
             | them; instead you choose to rest on your laurels and
             | complain about other peoples' hard work and dedication to
             | providing people with choices.
        
               | behnamoh wrote:
               | The fact that less than 5% of desktop users and 0% of
               | mobile users [0] use Linux speaks for itself. All the
               | effort that went into creating distros could have been
               | focused on few high-quality ones.
               | 
               | [0]: Android uses the Linux kernel which is almost the
               | same across distros, but isn't per se a Linux OS. I'm
               | talking about real Linux running on a mobile phone.
        
               | pixelpoet wrote:
               | Most mobile users can't even spell PC (let alone use
               | one), and blame autocorrect for that. To expect them to
               | use a Linux distro is pretty unrealistic when they can't
               | even use Windows:
               | https://www.theverge.com/22684730/students-file-folder-
               | direc...
        
               | balder1991 wrote:
               | This argument makes no sense considering the diversity of
               | tastes and desires in the community. It's incredibly
               | difficult to put more people to work together and achieve
               | superior results, especially if it's something
               | decentralized and voluntary. You can pretty well end up
               | with more chaos and internal conflicts instead.
        
               | soulofmischief wrote:
               | Why are you so hellbent on dictating how volunteers spend
               | their time?
               | 
               | Have you ever considered that these people are satisfied
               | with their interests and truly could not care less about
               | your opinion? Or that your opinion is just that-- yours?
               | Not some absolute truth?
               | 
               | Anyway, it's beside the point, as there are multiple high
               | quality Linux distributions to choose from, thanks to a
               | large de-duplication of efforts through libraries.
        
             | rfoo wrote:
             | Honest question, just curious what's your definition of
             | "different arch", is a model with GQA=8 different than one
             | with GQA=16 in your mind? I guess no?
             | 
             | How about GQA vs MHA, or GQA vs MLA?
             | 
             | If anything attention-like is same in your mind, is S5 and
             | RWKV different arch given that both are some kind of linear
             | RNN?
        
         | jwitthuhn wrote:
         | I'm not aware of any other openly-licensed model of comparable
         | size to 54b. That seems like a worthwhile addition to what is
         | already available, imo.
         | 
         | The closest is mixtral 8x7b but that one only uses a fraction
         | of its parameters for each pass. This one should produce better
         | but slower results at roughly the same memory requirement.
        
         | soulofmischief wrote:
         | Do you understand at all how science and replication work?
         | 
         | Let these researchers do what they want, they didn't release it
         | for you specifically.
        
           | behnamoh wrote:
           | This is not science. It's engineering at this point. Many
           | startups create their own models for various purposes, mainly
           | for VC investment.
        
             | soulofmischief wrote:
             | I'm sorry, but this is absolutely scientific research.
             | There is no amount of gaslighting or reframing which will
             | deny it that status. The source of funding is irrelevant.
             | Most scientific research is funded with the hopes of seeing
             | a return of investment.
             | 
             | You simply don't know what you're talking about. Your
             | overly cynical take is against Hacker News guidelines.
        
               | dareal wrote:
               | I'm not sure where this arrogant attitude comes from. But
               | please keep that condescending tone out of this
               | community.
        
               | soulofmischief wrote:
               | My comments are far from arrogant; they are directly
               | decrying the arrogance of another user, and labeling this
               | kind of pushback as arrogant or condescending is
               | disingenuous, bordering on ironic. Let's stay on topic.
        
         | Der_Einzige wrote:
         | Yes, you're allowed to be wrong on the internet, but not THIS
         | wrong!
        
       | achrono wrote:
       | > We have opensourced the models in Hugging Face and ModelScope
       | to you
       | 
       | We are unfortunately now in a place where this falsehood has
       | travelled the world while the truth is probably still half-asleep
       | in its underwear.
       | 
       | It is a shame that people who are working on what is probably the
       | pinnacle of computing can so blatantly disregard the real
       | meaning.
       | 
       | Imagine if Microsoft starting announcing everywhere that Windows,
       | because all its EXE and DLLs are right there for you to see, is
       | actually open-source!
       | 
       | I suppose all we can do now is to keep asking "is it open-source
       | or like _true_ open-source? ".
        
         | diggan wrote:
         | The model seems to be Apache 2.0 yet the training data remains
         | private, so even if you had the resources, knowledge and
         | ability to train it yourself, you wouldn't be able to.
         | 
         | So no, Qwen 2 isn't open source, but they happen to release the
         | models publicly. Guess "pseudo-open source" might make sense as
         | a label.
         | 
         | I agree, I'm not a super fan of people/organizations using
         | "open source" as a marketing term which seems popular in the ML
         | field right now.
        
           | okaram wrote:
           | I think we need to agree on terminology, but to me, this
           | seems clear.
           | 
           | The _model_ is open-source (or open-content, if you prefer).
           | The _input data_ isn 't.
        
             | Zambyte wrote:
             | The input data is the source (literally "where it comes
             | from"). If the source is not open, it is not open source.
             | 
             | The model is open weight, despite an OSI approved license
             | sitting in the same directory as the binary blob.
        
             | loudmax wrote:
             | I like the term open-weights.
             | 
             | The "source", analogous to the source code for a program,
             | should include the training data. In this case that isn't
             | open. The resulting weights are open, insofar as they can
             | be redistributed and fine-tuned and so on.
        
         | patrick-fitz wrote:
         | I think they should be a distinction between open-source and
         | open-weight LLM's.
        
           | danielbln wrote:
           | I like this terminology, I'm going to start using it.
        
             | sitkack wrote:
             | The problem is, all the Open Weight models are already
             | calling themselves Open Source, so a new name that
             | disambiguates existing names should be chosen.
        
         | okaram wrote:
         | Why is this false? The _model_ is open source, Apache 2.
        
       | andy_xor_andrew wrote:
       | Given the restrictions on GPUs to China, I'm curious what their
       | training cluster looks like.
       | 
       | (not saying this out of any support or non-support for such a GPU
       | blockade; I'm just genuinely curious)
        
         | ComputerGuru wrote:
         | They probably buy them through intermediaries.
        
         | blackoil wrote:
         | They can train it in a data center in Singapore or Taiwan!!
        
         | 0cf8612b2e1e wrote:
         | The restrictions are fairly recent. Unless a Stuxnet disables
         | all GPUs in the country, there is probably a lot of existing
         | infrastructure.
        
         | bx376 wrote:
         | I believe besides the legacy Nvidia cards, Alibaba has built
         | training clusters from domestic AI accelerator vendor like
         | https://e.huawei.com/en/products/computing/ascend.
         | 
         | But that means a lot of extra in-house engineering work to
         | utilize those niche hardware.
        
         | rfoo wrote:
         | A100s, (old) A800s, H100s or (old) H800s.
         | 
         | How did they get them? Who knows, but it's there.
        
         | halJordan wrote:
         | Baidu specifically claims they have orchestration technology to
         | train on heterogeneous gpu clusters.
        
       | davidcollantes wrote:
       | As one of my tests, I asked about Tienanmen Square. It will
       | consistently render an error.
        
         | LeoPanthera wrote:
         | Microsoft Copilot is so censored that it won't even tell you
         | who won the 2020 US election.
        
           | user_7832 wrote:
           | I don't know if it's censoring or something even weirder,
           | because it does this even for non-partisinal political
           | things. I tried asking when were the previous Lok Sabha
           | elections (2019) and it'd just terminate the chat. It was
           | bizarre? ?? It's not even something controversial
        
             | refulgentis wrote:
             | So the other day I'm reading a story about how the Russian
             | gov't is criticizing Yandex because its LLM isn't able to
             | repeat / stick to the narratives at length. And I felt bad
             | for them because it's probably virtually impossible, at
             | scale, to rewrite the entirety of human knowledge so it's
             | consistent with propaganda tripe.
             | 
             | I wonder if that would also apply here, and you're better
             | off just not touching on politics than trying to prevent it
             | from ever saying anything the Party wouldn't like
        
           | dvngnt_ wrote:
           | ChatGPT 4o
           | 
           | > Joe Biden won the 2020 U.S. presidential election,
           | defeating the incumbent president, Donald Trump. Biden, the
           | Democratic candidate, received 306 electoral votes, while
           | Trump, the Republican candidate, received 232 electoral
           | votes. Biden also won the popular vote, receiving over 81
           | million votes to Trump's 74 million.
        
           | WhitneyLand wrote:
           | Is this supposed to be a joke or hyperbole?
        
             | LeoPanthera wrote:
             | I took a screenshot in case anyone doesn't believe me:
             | 
             | https://ibb.co/RP5VTFX
        
       | gracegreg wrote:
       | Qwen2-72B claims better than LLama3-70B, I just found there is an
       | another LLama3 model has comparable performace:
       | 
       | | | Qwen2-72B | Higgs-Llama-3-70B | Llama3-70B-Instruct |
       | 
       | | ---------- | --------- | ----------------- |
       | ------------------- |
       | 
       | | MMLU | 82.3 | 80.8 | 80.2 |
       | 
       | | MMLU-Pro | 64.4 | 63.2 | 56.2 |
       | 
       | | Arena-Hard | 48.1 | 49.6 | 41.1 |
       | 
       | | GPQA | 42.4 | 42.1 | 41.9 |
       | 
       | - https://huggingface.co/bosonai/Higgs-Llama-3-70B
        
         | segmondy wrote:
         | all new models claim to be better than the top SOTA model.
         | Since llama3 dropped, every new model released has claimed to
         | be better than it.
        
           | refulgentis wrote:
           | [DISREGARD]
        
             | philipkglass wrote:
             | Do you mean 400B? I thought that 70B was released some time
             | ago:
             | 
             | https://huggingface.co/meta-llama/Meta-Llama-3-70B
        
               | refulgentis wrote:
               | Jesus, I'm losing it, thank you, you saved me from
               | looking foolish (at least, continuing to :) )
        
             | azinman2 wrote:
             | You're thinking of 400B. 70B is out
        
       | refulgentis wrote:
       | Weird, every time I try asking what happened at Tiananmen Square,
       | or why Xi is an outlier with 3 terms as party secretary, it
       | errors. "All hail Glorious Xi :)" works though.
       | https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct
        
         | riku_iki wrote:
         | I am wondering if such moderated model can induce significant
         | security risk, for example it can generate exploitable code, or
         | try to trigger some action base on some specific input.
        
         | lIIllIIllIIllII wrote:
         | I set the system prompt to try avoid censorship by substituting
         | etc etc, it didn't listen - started generating a response and
         | got as far as this before suddenly, [ERROR] :-)
         | 
         | >I'm sorry for any confusion, but it seems there might be some
         | mix-up in your query. The year 1979 does not have a widely
         | recognized event matching your description, and "TS" and
         | "Cydonia" do not correspond to any known geopolitical locations
         | or events in historical records. "Cydonia" is often associated
         | with a region on Mars that has been subject to various
         | conspiracy theories, but it doesn't relate to any political
         | events or protests.
         | 
         | >If you're referring to a specific historical event, it could
         | possibly be the Tian
        
       | aubanel wrote:
       | This model has: 1. On par or better performance than
       | Llama-3-70B-Instruct 2. A much more comfortable context length of
       | 128k (vs the tiny 8k that really hinders Llama-3)
       | 
       | These 2 feats together will probably make it the first serious OS
       | rival to GPT-4!
        
       | Havoc wrote:
       | Somewhat odd sizing. 7B -> 57B is quite a jump...and crucially
       | misses basically the entire spectrum of consumer GPUs wholesole.
        
       | galaxyLogic wrote:
       | Is it a common practice in LLMs to give different weights to
       | different training data sources?
       | 
       | For instance I might want to say that all training data that
       | comes from my inhouse emails take precedence over anything that
       | comes from the internet?
        
         | sebzim4500 wrote:
         | Yes it is. IIRC back when OpenAI was open and they published
         | the breakdown they were significantly overweighting wikipedia.
        
       ___________________________________________________________________
       (page generated 2024-06-06 23:01 UTC)