[HN Gopher] Qwen2.5: A Party of Foundation Models
___________________________________________________________________
Qwen2.5: A Party of Foundation Models
Author : apsec112
Score : 103 points
Date : 2024-09-18 17:42 UTC (5 hours ago)
(HTM) web link (qwenlm.github.io)
(TXT) w3m dump (qwenlm.github.io)
| GaggiX wrote:
| >our latest large-scale dataset, encompassing up to 18 trillion
| tokens
|
| I remember when GPT-3 was trained on 300B tokens.
| imjonse wrote:
| and was considered too dangerous to be released publicly.
| GaggiX wrote:
| The larger GPT-2s were also considered too dangerous to
| release publicly at first.
| Workaccount2 wrote:
| I remember be very understanding of it too after seeing the
| incredible (but absolutely terrible in retrospect) outputs.
| GaggiX wrote:
| I wasn't really compelled at the time, nothing has
| changed.
| baq wrote:
| they are dangerous... for folks who need to scrape the web
| for low background tokens to train their transformers.
| abc-1 wrote:
| Nobody ever really believed this, the truth is rarely in
| vogue.
| ekojs wrote:
| Actually really impressive. They went up from 7T tokens to 18T
| tokens. Curious to see how they perform after finetuning.
| Flux159 wrote:
| It would be nice to have comparisons to Claude 3.5 for the coder
| model, only comparing to open source models isn't super helpful
| because I would want to compare to the model I'm currently using
| for development work.
| imjonse wrote:
| Aider will probably have some numbers at
| https://aider.chat/docs/leaderboards/
| Deathmax wrote:
| They've posted their own run of the Aider benchmark [1] if
| you want to compare, it achieved 57.1%.
|
| [1]: https://qianwen-res.oss-cn-
| beijing.aliyuncs.com/Qwen2.5/Qwen...
| diggan wrote:
| Here is a comparison of the prompt "I want to create a basic
| Flight simulator in Bevy and Rust. Help me figure out the core
| properties I need for take off, in air flight and landing"
| between Claude Sonnet 3.5 and Qwen2.5-14B-Instruct-Q4_K_M.gguf:
|
| https://gist.github.com/victorb/7749e76f7c27674f3ae36d791e20...
|
| AFAIK, there isn't any (micro)benchmark comparisons out yet.
| yourMadness wrote:
| 14B with Q4_K_M quantization is about 9 GB.
|
| Remarkable that it is at all comparable to Sonnet 3.5
| diggan wrote:
| Comparable, I guess. But the result is a lot worse compared
| to Sonnet for sure. Parts of the example code doesn't make
| much sense. Meanwhile Sonnet seems to have the latest API
| of Bevy considered, and mostly makes sense.
| Sn0wCoder wrote:
| This might be what you are asking for...
| https://qwenlm.github.io/blog/qwen2.5-coder/
|
| Ctrl F - Code Reasoning:
| freeqaz wrote:
| 32B is a nice size for 2x 3090s. That comfortably fits on the GPU
| with minimal quantization and still leaves extra memory for the
| long context length.
|
| 70B is just a littttle rough trying to run without offloading
| some layers to the CPU.
| a_wild_dandan wrote:
| 70B+ models typically run great with my MacBook's 96GB of
| (V)RAM. I want a Mac Studio to run e.g. llama-405B, but I can't
| justify the marginal model quality ROI for like $7k or
| whatever. (But I waaant iiit!)
| cateye wrote:
| _> we are inspired by the recent advancements in reinforcement
| learning (e.g., o1)_
|
| It is interesting to see what the future will bring when models
| incorporate chain of thought approaches and whether o1 will get
| outperformed by open source models.
| covi wrote:
| For anyone looking to play with a self-hosted version, here's a
| quick guide deploy to your own Kubernetes cluster or VPC:
| https://github.com/skypilot-org/skypilot/tree/master/llm/qwe...
| jcoc611 wrote:
| Probably an ignorant question, but could someone explain why the
| Context Length is much larger than the Generation Length?
| dacox wrote:
| When doing inference for an LLM, there are two stages.
|
| The first phase is referred to as "prefill", where the input is
| processed to create the KV Cache.
|
| After that phase, the "decode" phase is called auto-
| regressively. Each decode yields one new token.
|
| This post on [Inference Memory
| Requirements](https://huggingface.co/blog/llama31#inference-
| memory-require...) is quite good.
|
| These two phases have pretty different performance
| characteristics - prefill can really maximize GPU memory. For
| long contexts, its can be nigh impossible to do it all in a
| single pass - frameworks like vLLM use a technique called
| "chunked prefill".
|
| The decode phase is compute intensive, but tends not to
| maximize GPU memory.
|
| If you are serving these models, you really want to be able to
| have larger batch sizes during inference, which can only really
| come with scale - for a smaller app, you won't want to make the
| user wait that long.
|
| So, long contexts only have to be processed _once_ per
| inference, which is basically a scheduling problem.
|
| But the number of decode passes scales linearly with the output
| length. If it was unlimited, you could get some requests just
| _always_ present in an inference batch, reducing throughput for
| everyone.
| jcoc611 wrote:
| That's a great explanation, thank you!
___________________________________________________________________
(page generated 2024-09-18 23:00 UTC)