[HN Gopher] Qwen2.5-1M: Deploy your own Qwen with context length...
___________________________________________________________________
Qwen2.5-1M: Deploy your own Qwen with context length up to 1M
tokens
Author : meetpateltech
Score : 80 points
Date : 2025-01-26 17:24 UTC (5 hours ago)
(HTM) web link (qwenlm.github.io)
(TXT) w3m dump (qwenlm.github.io)
| simonw wrote:
| I'm really interested in hearing from anyone who _does_ manage to
| successfully run a long prompt through one these on a Mac (using
| one of the GGUF versions, or through other means).
| rcarmo wrote:
| I'm missing what `files-to-prompt` does. I have an M3 Max and
| can take a stab at it, although I'm currently fussing with a
| few quantized -r1 models...
| simonw wrote:
| It's this tool: https://pypi.org/project/files-to-prompt/
| rcarmo wrote:
| You might want to use the file markers that the model
| outputs while being loaded by ollama:
| lm_load_print_meta: general.name = Qwen2.5 7B Instruct
| 1M llm_load_print_meta: BOS token = 151643
| '<|endoftext|>' llm_load_print_meta: EOS token
| = 151645 '<|im_end|>' llm_load_print_meta: EOT
| token = 151645 '<|im_end|>'
| llm_load_print_meta: PAD token = 151643
| '<|endoftext|>' llm_load_print_meta: LF token
| = 148848 'AI' llm_load_print_meta: FIM PRE token
| = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM
| SUF token = 151661 '<|fim_suffix|>'
| llm_load_print_meta: FIM MID token = 151660
| '<|fim_middle|>' llm_load_print_meta: FIM PAD token
| = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP
| token = 151663 '<|repo_name|>'
| llm_load_print_meta: FIM SEP token = 151664
| '<|file_sep|>' llm_load_print_meta: EOG token
| = 151643 '<|endoftext|>' llm_load_print_meta: EOG
| token = 151645 '<|im_end|>'
| llm_load_print_meta: EOG token = 151662
| '<|fim_pad|>' llm_load_print_meta: EOG token
| = 151663 '<|repo_name|>' llm_load_print_meta: EOG
| token = 151664 '<|file_sep|>'
| llm_load_print_meta: max token length = 256
| terhechte wrote:
| I bought an M4 Max with 128g of ram just for these use cases.
| Currently downloading the 7b
| tmcdonald wrote:
| Ollama has a num_ctx parameter that controls the context window
| length - it defaults to 2048. At a guess you will need to set
| that.
| simonw wrote:
| Huh! I had incorrectly assumed that was for output, not input.
| Thanks!
|
| YES that was it: files-to-prompt \
| ~/Dropbox/Development/llm \ -e py -c | \ llm -m
| q1m 'describe this codebase in detail' \ -o num_ctx
| 80000
|
| I was watching my memory usage and it quickly maxed out my 64GB
| so I hit Ctrl+C before my Mac crashed.
| amrrs wrote:
| This has been the problem with a lot of long context use
| cases. It's not just the model's support but also sufficient
| compute and inference time. This is exactly why I was excited
| for Mamba and now possibly Lightning attention.
|
| Even though the new DCA based on which these models provide
| long context could be an interesting area to watch;
| jmorgan wrote:
| Sorry this isn't more obvious. Ideally VRAM usage for the
| context window (the KV cache) becomes dynamic, starting small
| and growing with token usage, whereas right now Ollama
| defaults to a size of 2K which can be overridden at runtime.
| A great example of this is vLLM's PagedAttention
| implementation [1] or Microsoft's vAttention [2] which is
| CUDA-specific (and there are quite a few others).
|
| 1M tokens will definitely require a lot of KV cache memory.
| One way to reduce the memory footprint is to use KV cache
| quantization, which has recently been added behind a flag [3]
| and will 1/4 the memory footprint if 4-bit KV cache
| quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)
|
| [1] https://arxiv.org/pdf/2309.06180
|
| [2] https://github.com/microsoft/vattention
|
| [3] https://smcleod.net/2024/12/bringing-k/v-context-
| quantisatio...
| rahimnathwani wrote:
| Yup, and this parameter is supported by the plugin he's using:
|
| https://github.com/taketwo/llm-ollama/blob/4ccd5181c099af963...
| thot_experiment wrote:
| Ollama is a "easymode" LLM runtime and as such has all the
| problems that every easymode thing has. It will assume things
| and the moment you want to do anything interesting those
| assumptions will shoot you in the foot, though I've found
| ollama plays so fast and loose even first party things that
| "should just work" do not. For example if you run R1 (at least
| as of 2 days ago when i tried this) using the default `ollama
| run deepseek-r1:7b` you will get different context size, top_p
| and temperature vs what Deepseek recommends in their release
| post.
| iamnotagenius wrote:
| requires obscene amount of memory for context.
| hmottestad wrote:
| More than other models? I thought that context used a lot of
| memory on all models.
|
| And I'd hardly call it obscene. You can buy a Mac Studio with
| 192GB of memory, that should allow you to max out the context
| window of the 7B model. Probably not going to be very fast
| though.
| varispeed wrote:
| Not attainable to working class though. _can_ is doing a lot
| of heavy lifting here. Seems like after a brief period where
| technology was essentially class agnostic, now only the
| wealthy can enjoy being part of development and everyone else
| can just be a consumer.
| sbarre wrote:
| I mean... when has this not been the case?
|
| Technology has never been class-agnostic or universally
| accessible.
|
| Even saying that, I would argue that there is more, not
| less, technology that is accessible to more people today
| than there ever has been.
| hmottestad wrote:
| Not sure what you mean. Cutting edge computing has never
| been cheap. And a Mac Studio is definitely within the
| budget of a software developer in Norway. Not going to feel
| like a cheap investment, but definitely something that
| would be doable. Unlike a cluster of H100 GPUs, which would
| cost as much as a small apartment in Oslo.
|
| And you can easily get a dev job in Norway without having
| to run an LLM locally on your computer.
| sgt wrote:
| Agreed - it's probably not unreasonable. So are the M4
| Macs becoming the de-facto solution to running an LLM
| locally? Due to the insane 800 GB/sec internal bandwidth
| of Apple Silicon at its best?
| manmal wrote:
| No they are lacking compute power to be great at
| inference.
| simonw wrote:
| Can you back that up?
| simonw wrote:
| The advantage the Macs have is that they can share RAM
| between GPU and CPU, and GPU-accessible RAM is everything
| when you want to run a decent sized LLM.
|
| The problem is that most ML models are released for
| NVIDIA CUDA. Getting them to work on macOS requires
| translating them, usually to either GGUF (the llama.cpp
| format) or MLX (using Apple's own MLX array framework).
|
| As such, as a Mac user I remain envious of people with
| NVIDIA/CUDA rigs with decent amounts of VRAM.
|
| The NVIDIA "Digits" product may change things when it
| ships: https://www.theverge.com/2025/1/6/24337530/nvidia-
| ces-digits... - it may become the new cheapest convenient
| way to get 128GB of GPU-accessible RAM for running
| models.
| manmal wrote:
| The money would be better invested in a 2-4 3090 x86
| build, than in a Mac Studio. While the Macs have a
| fantastic performance-per-watt ratio, and have decent
| memory support (both bus width and memory size), they are
| not great at compute power. A multi RTX 3090 build
| totally smokes a Mac at the same price point, at
| inference speed.
| cma wrote:
| Not much more than something like a used jetski, but
| possibly depreciates even faster.
| woadwarrior01 wrote:
| It's on the model's huggingface README[1].
|
| > For processing 1 million-token sequences:
|
| > Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across
| GPUs).
|
| > Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across
| GPUs).
|
| [1]: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M
| bloomingkales wrote:
| I've heard rumbling about native context length. I don't know too
| much about it, but is this natively 1M context length?
|
| So even models like llama3 8b say they have a larger context, but
| they really don't in practice. I have a hard time getting past 8k
| on 16gb vram (you can definitely set the context length higher,
| but the quality and speed degradation is obvious).
|
| I'm curious how people are doing this on modest hardware.
| segmondy wrote:
| You can't on modest hardware, VRAM size is a function of model
| size, KV cache that depends on context length and the quant
| size of the model and K/V. 16gb isn't much really. You need
| more vram, the best way for most folks is to buy a macbook with
| unified memory. You can get a 128gb mac, but it's not cheap. If
| you are handy and resourceful you can build a GPU cluster.
| elorant wrote:
| You need a model that has specifically been extended for larger
| context windows. For Llama-3 there's Llama3-gradient with up to
| 1M tokens. You can find it at ollama.com
| buyucu wrote:
| first, this is amazing!
|
| second, how does one increase the context window without
| requiring obscene amounts of RAM? we're really hitting the
| limitations of the transformer architecture's quadratic
| scaling...
| 35mm wrote:
| Chain of agents seems to be a promising approach for splitting
| up tasks into smaller parts and then synthesising the
| results[1]
|
| [1] https://research.google/blog/chain-of-agents-large-
| language-...
| mmaunder wrote:
| Just want to confirm: so this is the first locally runnable model
| with a context length of greater than 128K and it's gone straight
| to 1M, correct?
| terhechte wrote:
| Yes. It requires a lot of ram, and even on a M4 with a lot of
| ram, if you give it 1mio tokens the prompt processing alone
| (that is, before you get the first response token) will
| probably take ~30min or more. However I'm looking forward to
| check if indeed I can give it a whole codebase and ask
| questions about it.
| segmondy wrote:
| No, this is not the first local model with a context length of
| greater than 128k, there have been such models, for example the
| following
|
| https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini 256k
| https://huggingface.co/THUDM/glm-4-9b-chat-1m 1M
|
| and many other's that supposedly extended traditional models
| via finetune/rope scaling
| mmaunder wrote:
| Thanks.
| jkbbwr wrote:
| Everyone keeps making the context windows bigger, which is nice.
|
| But what about output? I want to generate a few thousand lines of
| code, anyone got any tips?
| anotheryou wrote:
| Isn't that the same? limit-wise.
|
| Now you just need to convince it to output that much :)
| mmaunder wrote:
| Repeatedly ask it for more providing the previous output as
| context. (Back to context length as a limitation)
| AyyEye wrote:
| These things already produce embarrassing output. If you make
| it longer it's just going to get worse.
| bugglebeetle wrote:
| So context size actually helps with this, relative to how LLMs
| are actually deployed as applications. For example, if you look
| at how the "continue" option in the DeepSeek web app works for
| code gen, what they're likely doing is reinserting the prior
| messages (in some form) to a new one to prompt further
| completion. The more context size a model has and can manage
| successfully, the better it will likely be able at generating
| longer code blocks.
| gpualerts wrote:
| You tried to run it on CPU? I can't imagine how long that would
| take you. im tempted to try it out on half tb ram server
| mmaunder wrote:
| This API only model with a 1M context window was released back in
| Nov. Just for some historical context.
|
| https://qwenlm.github.io/blog/qwen2.5-turbo/
| simonw wrote:
| That's a different model - the 2.5 Turbo one. Today's release
| is something different.
| simonw wrote:
| Here are tips for running it on macOS using MLX:
| https://twitter.com/awnihannun/status/1883611098081099914 - using
| https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-...
| woadwarrior01 wrote:
| MLX does not support dual chunk attention[1] that these models
| use for long contexts, yet.
|
| [1]: https://arxiv.org/abs/2402.17463
| ilaksh wrote:
| What's the SOTA for memory-centric computing? I feel like maybe
| we need a new paradigm or something to bring the price of AI
| memory down.
|
| Maybe they can take some of those hundreds of billions and invest
| in new approaches.
|
| Because racks of H100s are not sustainable. But it's clear that
| increasing the amount of memory available is key to getting more
| intelligence or capabilities.
|
| Maybe there is a way to connect DRAM with photonic interconnects
| that doesn't require much data ordering for AI if the neural
| network software model changes somewhat.
|
| Is there something that has the same capabilities of a
| transformer but doesn't operate on sequences?
|
| If I was a little smarter and had any math ability I feel like I
| could contribute.
|
| But I am smart enough to know that just building bigger and
| bigger data centers is not the ideal path forward.
| mkroman wrote:
| The AI hardware race is still going strong, but with so many
| rapid changes to the fundamental architectures, it doesn't make
| sense to bet everything on specialized hardware just yet.. It's
| happening, but it's expensive and slow.
|
| There's just not enough capacity to build memory fast enough
| right now. Everyone needs the biggest and fastest modules they
| can get, since it directly impacts the performance of the
| models.
|
| There's still a lot of happening to improve memory, like the
| latest Titans paper: https://arxiv.org/abs/2501.00663
|
| So I think until a breakthrough happens or the fabs catch up,
| it'll be this painful race to build more datacenters.
| rfoo wrote:
| > Because racks of H100s are not sustainable.
|
| Huh? Racks of H100s are the most sustainable thing we can have
| for LLMs for now.
| dang wrote:
| Related: https://simonwillison.net/2025/Jan/26/qwen25-1m/
|
| (via https://news.ycombinator.com/item?id=42832838, but we merged
| that thread hither)
| refulgentis wrote:
| People are getting pretty...clever?...with long context retrieval
| benchmarking in papers.
|
| Here, the prose says "nearly perfect", the graph is all green
| except for a little yellow section, and you have to parse a 96
| cell table, having familiarity with several models and technical
| techniques to get the real # (84.4%, and that tops out at 128K,
| not anywhere near the claimed 1M)
|
| I don't bring this up to denigrate, but rather to highlight that
| "nearly perfect" is quite far off still. Don't rely on long
| context for anything you build
| anotherpaulg wrote:
| In my experience with AI coding, very large context windows
| aren't useful in practice. Every model seems to get confused when
| you feed them more than ~25-30k tokens. The models stop obeying
| their system prompts, can't correctly find/transcribe pieces of
| code in the context, etc.
|
| Developing aider, I've seen this problem with gpt-4o, Sonnet,
| DeepSeek, etc. Many aider users report this too. It's perhaps the
| #1 problem users have, so I created a dedicated help page [0].
|
| Very large context may be useful for certain tasks with lots of
| "low value" context. But for coding, it seems to lure users into
| a problematic regime.
|
| [0] https://aider.chat/docs/troubleshooting/edit-
| errors.html#don...
| lifty wrote:
| Thanks for aider! It has become an integral part of my
| workflow. Looking forward to try DeepSeek in architect mode
| with Sonnet as the driver. Curious if it will be a noticeable
| improvement as compared to using Sonnet by itself.
| cma wrote:
| Claude works incredibly well for me with asking for code
| changes to projects filling up 80% of context (160K tokens).
| It's way expensive with the API though but reasonable through
| the web interface with pro.
| seunosewa wrote:
| The behaviour you described is what happens when you have small
| context windows. Perhaps you're feeding the models with more
| tokens than you think you are. I have enjoyed loading large
| codebases into AI Studio and getting very satisfying and
| accurate answers because the models have 1M to 2M token context
| windows.
| adamgordonbell wrote:
| Aider is great, but you need specific formats from the llm.
| That might be where the challenge is.
|
| I've used the giant context in Gemini to dump a code base and
| say: describe the major data structures and data flows.
|
| Things like that, overview documents, work great. It's amazing
| for orienting in an unfamiliar codebase.
___________________________________________________________________
(page generated 2025-01-26 23:00 UTC)