[HN Gopher] Qwen2.5: A Party of Foundation Models
       ___________________________________________________________________
        
       Qwen2.5: A Party of Foundation Models
        
       Author : apsec112
       Score  : 103 points
       Date   : 2024-09-18 17:42 UTC (5 hours ago)
        
 (HTM) web link (qwenlm.github.io)
 (TXT) w3m dump (qwenlm.github.io)
        
       | GaggiX wrote:
       | >our latest large-scale dataset, encompassing up to 18 trillion
       | tokens
       | 
       | I remember when GPT-3 was trained on 300B tokens.
        
         | imjonse wrote:
         | and was considered too dangerous to be released publicly.
        
           | GaggiX wrote:
           | The larger GPT-2s were also considered too dangerous to
           | release publicly at first.
        
             | Workaccount2 wrote:
             | I remember be very understanding of it too after seeing the
             | incredible (but absolutely terrible in retrospect) outputs.
        
               | GaggiX wrote:
               | I wasn't really compelled at the time, nothing has
               | changed.
        
           | baq wrote:
           | they are dangerous... for folks who need to scrape the web
           | for low background tokens to train their transformers.
        
           | abc-1 wrote:
           | Nobody ever really believed this, the truth is rarely in
           | vogue.
        
       | ekojs wrote:
       | Actually really impressive. They went up from 7T tokens to 18T
       | tokens. Curious to see how they perform after finetuning.
        
       | Flux159 wrote:
       | It would be nice to have comparisons to Claude 3.5 for the coder
       | model, only comparing to open source models isn't super helpful
       | because I would want to compare to the model I'm currently using
       | for development work.
        
         | imjonse wrote:
         | Aider will probably have some numbers at
         | https://aider.chat/docs/leaderboards/
        
           | Deathmax wrote:
           | They've posted their own run of the Aider benchmark [1] if
           | you want to compare, it achieved 57.1%.
           | 
           | [1]: https://qianwen-res.oss-cn-
           | beijing.aliyuncs.com/Qwen2.5/Qwen...
        
         | diggan wrote:
         | Here is a comparison of the prompt "I want to create a basic
         | Flight simulator in Bevy and Rust. Help me figure out the core
         | properties I need for take off, in air flight and landing"
         | between Claude Sonnet 3.5 and Qwen2.5-14B-Instruct-Q4_K_M.gguf:
         | 
         | https://gist.github.com/victorb/7749e76f7c27674f3ae36d791e20...
         | 
         | AFAIK, there isn't any (micro)benchmark comparisons out yet.
        
           | yourMadness wrote:
           | 14B with Q4_K_M quantization is about 9 GB.
           | 
           | Remarkable that it is at all comparable to Sonnet 3.5
        
             | diggan wrote:
             | Comparable, I guess. But the result is a lot worse compared
             | to Sonnet for sure. Parts of the example code doesn't make
             | much sense. Meanwhile Sonnet seems to have the latest API
             | of Bevy considered, and mostly makes sense.
        
         | Sn0wCoder wrote:
         | This might be what you are asking for...
         | https://qwenlm.github.io/blog/qwen2.5-coder/
         | 
         | Ctrl F - Code Reasoning:
        
       | freeqaz wrote:
       | 32B is a nice size for 2x 3090s. That comfortably fits on the GPU
       | with minimal quantization and still leaves extra memory for the
       | long context length.
       | 
       | 70B is just a littttle rough trying to run without offloading
       | some layers to the CPU.
        
         | a_wild_dandan wrote:
         | 70B+ models typically run great with my MacBook's 96GB of
         | (V)RAM. I want a Mac Studio to run e.g. llama-405B, but I can't
         | justify the marginal model quality ROI for like $7k or
         | whatever. (But I waaant iiit!)
        
       | cateye wrote:
       | _> we are inspired by the recent advancements in reinforcement
       | learning (e.g., o1)_
       | 
       | It is interesting to see what the future will bring when models
       | incorporate chain of thought approaches and whether o1 will get
       | outperformed by open source models.
        
       | covi wrote:
       | For anyone looking to play with a self-hosted version, here's a
       | quick guide deploy to your own Kubernetes cluster or VPC:
       | https://github.com/skypilot-org/skypilot/tree/master/llm/qwe...
        
       | jcoc611 wrote:
       | Probably an ignorant question, but could someone explain why the
       | Context Length is much larger than the Generation Length?
        
         | dacox wrote:
         | When doing inference for an LLM, there are two stages.
         | 
         | The first phase is referred to as "prefill", where the input is
         | processed to create the KV Cache.
         | 
         | After that phase, the "decode" phase is called auto-
         | regressively. Each decode yields one new token.
         | 
         | This post on [Inference Memory
         | Requirements](https://huggingface.co/blog/llama31#inference-
         | memory-require...) is quite good.
         | 
         | These two phases have pretty different performance
         | characteristics - prefill can really maximize GPU memory. For
         | long contexts, its can be nigh impossible to do it all in a
         | single pass - frameworks like vLLM use a technique called
         | "chunked prefill".
         | 
         | The decode phase is compute intensive, but tends not to
         | maximize GPU memory.
         | 
         | If you are serving these models, you really want to be able to
         | have larger batch sizes during inference, which can only really
         | come with scale - for a smaller app, you won't want to make the
         | user wait that long.
         | 
         | So, long contexts only have to be processed _once_ per
         | inference, which is basically a scheduling problem.
         | 
         | But the number of decode passes scales linearly with the output
         | length. If it was unlimited, you could get some requests just
         | _always_ present in an inference batch, reducing throughput for
         | everyone.
        
           | jcoc611 wrote:
           | That's a great explanation, thank you!
        
       ___________________________________________________________________
       (page generated 2024-09-18 23:00 UTC)