[HN Gopher] Qwen2.5-1M: Deploy your own Qwen with context length...
       ___________________________________________________________________
        
       Qwen2.5-1M: Deploy your own Qwen with context length up to 1M
       tokens
        
       Author : meetpateltech
       Score  : 80 points
       Date   : 2025-01-26 17:24 UTC (5 hours ago)
        
 (HTM) web link (qwenlm.github.io)
 (TXT) w3m dump (qwenlm.github.io)
        
       | simonw wrote:
       | I'm really interested in hearing from anyone who _does_ manage to
       | successfully run a long prompt through one these on a Mac (using
       | one of the GGUF versions, or through other means).
        
         | rcarmo wrote:
         | I'm missing what `files-to-prompt` does. I have an M3 Max and
         | can take a stab at it, although I'm currently fussing with a
         | few quantized -r1 models...
        
           | simonw wrote:
           | It's this tool: https://pypi.org/project/files-to-prompt/
        
             | rcarmo wrote:
             | You might want to use the file markers that the model
             | outputs while being loaded by ollama:
             | lm_load_print_meta: general.name     = Qwen2.5 7B Instruct
             | 1M         llm_load_print_meta: BOS token        = 151643
             | '<|endoftext|>'         llm_load_print_meta: EOS token
             | = 151645 '<|im_end|>'         llm_load_print_meta: EOT
             | token        = 151645 '<|im_end|>'
             | llm_load_print_meta: PAD token        = 151643
             | '<|endoftext|>'         llm_load_print_meta: LF token
             | = 148848 'AI'         llm_load_print_meta: FIM PRE token
             | = 151659 '<|fim_prefix|>'         llm_load_print_meta: FIM
             | SUF token    = 151661 '<|fim_suffix|>'
             | llm_load_print_meta: FIM MID token    = 151660
             | '<|fim_middle|>'         llm_load_print_meta: FIM PAD token
             | = 151662 '<|fim_pad|>'         llm_load_print_meta: FIM REP
             | token    = 151663 '<|repo_name|>'
             | llm_load_print_meta: FIM SEP token    = 151664
             | '<|file_sep|>'         llm_load_print_meta: EOG token
             | = 151643 '<|endoftext|>'         llm_load_print_meta: EOG
             | token        = 151645 '<|im_end|>'
             | llm_load_print_meta: EOG token        = 151662
             | '<|fim_pad|>'         llm_load_print_meta: EOG token
             | = 151663 '<|repo_name|>'         llm_load_print_meta: EOG
             | token        = 151664 '<|file_sep|>'
             | llm_load_print_meta: max token length = 256
        
         | terhechte wrote:
         | I bought an M4 Max with 128g of ram just for these use cases.
         | Currently downloading the 7b
        
       | tmcdonald wrote:
       | Ollama has a num_ctx parameter that controls the context window
       | length - it defaults to 2048. At a guess you will need to set
       | that.
        
         | simonw wrote:
         | Huh! I had incorrectly assumed that was for output, not input.
         | Thanks!
         | 
         | YES that was it:                 files-to-prompt \
         | ~/Dropbox/Development/llm \         -e py -c | \       llm -m
         | q1m 'describe this codebase in detail' \        -o num_ctx
         | 80000
         | 
         | I was watching my memory usage and it quickly maxed out my 64GB
         | so I hit Ctrl+C before my Mac crashed.
        
           | amrrs wrote:
           | This has been the problem with a lot of long context use
           | cases. It's not just the model's support but also sufficient
           | compute and inference time. This is exactly why I was excited
           | for Mamba and now possibly Lightning attention.
           | 
           | Even though the new DCA based on which these models provide
           | long context could be an interesting area to watch;
        
           | jmorgan wrote:
           | Sorry this isn't more obvious. Ideally VRAM usage for the
           | context window (the KV cache) becomes dynamic, starting small
           | and growing with token usage, whereas right now Ollama
           | defaults to a size of 2K which can be overridden at runtime.
           | A great example of this is vLLM's PagedAttention
           | implementation [1] or Microsoft's vAttention [2] which is
           | CUDA-specific (and there are quite a few others).
           | 
           | 1M tokens will definitely require a lot of KV cache memory.
           | One way to reduce the memory footprint is to use KV cache
           | quantization, which has recently been added behind a flag [3]
           | and will 1/4 the memory footprint if 4-bit KV cache
           | quantization is used (OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve)
           | 
           | [1] https://arxiv.org/pdf/2309.06180
           | 
           | [2] https://github.com/microsoft/vattention
           | 
           | [3] https://smcleod.net/2024/12/bringing-k/v-context-
           | quantisatio...
        
         | rahimnathwani wrote:
         | Yup, and this parameter is supported by the plugin he's using:
         | 
         | https://github.com/taketwo/llm-ollama/blob/4ccd5181c099af963...
        
         | thot_experiment wrote:
         | Ollama is a "easymode" LLM runtime and as such has all the
         | problems that every easymode thing has. It will assume things
         | and the moment you want to do anything interesting those
         | assumptions will shoot you in the foot, though I've found
         | ollama plays so fast and loose even first party things that
         | "should just work" do not. For example if you run R1 (at least
         | as of 2 days ago when i tried this) using the default `ollama
         | run deepseek-r1:7b` you will get different context size, top_p
         | and temperature vs what Deepseek recommends in their release
         | post.
        
       | iamnotagenius wrote:
       | requires obscene amount of memory for context.
        
         | hmottestad wrote:
         | More than other models? I thought that context used a lot of
         | memory on all models.
         | 
         | And I'd hardly call it obscene. You can buy a Mac Studio with
         | 192GB of memory, that should allow you to max out the context
         | window of the 7B model. Probably not going to be very fast
         | though.
        
           | varispeed wrote:
           | Not attainable to working class though. _can_ is doing a lot
           | of heavy lifting here. Seems like after a brief period where
           | technology was essentially class agnostic, now only the
           | wealthy can enjoy being part of development and everyone else
           | can just be a consumer.
        
             | sbarre wrote:
             | I mean... when has this not been the case?
             | 
             | Technology has never been class-agnostic or universally
             | accessible.
             | 
             | Even saying that, I would argue that there is more, not
             | less, technology that is accessible to more people today
             | than there ever has been.
        
             | hmottestad wrote:
             | Not sure what you mean. Cutting edge computing has never
             | been cheap. And a Mac Studio is definitely within the
             | budget of a software developer in Norway. Not going to feel
             | like a cheap investment, but definitely something that
             | would be doable. Unlike a cluster of H100 GPUs, which would
             | cost as much as a small apartment in Oslo.
             | 
             | And you can easily get a dev job in Norway without having
             | to run an LLM locally on your computer.
        
               | sgt wrote:
               | Agreed - it's probably not unreasonable. So are the M4
               | Macs becoming the de-facto solution to running an LLM
               | locally? Due to the insane 800 GB/sec internal bandwidth
               | of Apple Silicon at its best?
        
               | manmal wrote:
               | No they are lacking compute power to be great at
               | inference.
        
               | simonw wrote:
               | Can you back that up?
        
               | simonw wrote:
               | The advantage the Macs have is that they can share RAM
               | between GPU and CPU, and GPU-accessible RAM is everything
               | when you want to run a decent sized LLM.
               | 
               | The problem is that most ML models are released for
               | NVIDIA CUDA. Getting them to work on macOS requires
               | translating them, usually to either GGUF (the llama.cpp
               | format) or MLX (using Apple's own MLX array framework).
               | 
               | As such, as a Mac user I remain envious of people with
               | NVIDIA/CUDA rigs with decent amounts of VRAM.
               | 
               | The NVIDIA "Digits" product may change things when it
               | ships: https://www.theverge.com/2025/1/6/24337530/nvidia-
               | ces-digits... - it may become the new cheapest convenient
               | way to get 128GB of GPU-accessible RAM for running
               | models.
        
               | manmal wrote:
               | The money would be better invested in a 2-4 3090 x86
               | build, than in a Mac Studio. While the Macs have a
               | fantastic performance-per-watt ratio, and have decent
               | memory support (both bus width and memory size), they are
               | not great at compute power. A multi RTX 3090 build
               | totally smokes a Mac at the same price point, at
               | inference speed.
        
             | cma wrote:
             | Not much more than something like a used jetski, but
             | possibly depreciates even faster.
        
         | woadwarrior01 wrote:
         | It's on the model's huggingface README[1].
         | 
         | > For processing 1 million-token sequences:
         | 
         | > Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across
         | GPUs).
         | 
         | > Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across
         | GPUs).
         | 
         | [1]: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M
        
       | bloomingkales wrote:
       | I've heard rumbling about native context length. I don't know too
       | much about it, but is this natively 1M context length?
       | 
       | So even models like llama3 8b say they have a larger context, but
       | they really don't in practice. I have a hard time getting past 8k
       | on 16gb vram (you can definitely set the context length higher,
       | but the quality and speed degradation is obvious).
       | 
       | I'm curious how people are doing this on modest hardware.
        
         | segmondy wrote:
         | You can't on modest hardware, VRAM size is a function of model
         | size, KV cache that depends on context length and the quant
         | size of the model and K/V. 16gb isn't much really. You need
         | more vram, the best way for most folks is to buy a macbook with
         | unified memory. You can get a 128gb mac, but it's not cheap. If
         | you are handy and resourceful you can build a GPU cluster.
        
         | elorant wrote:
         | You need a model that has specifically been extended for larger
         | context windows. For Llama-3 there's Llama3-gradient with up to
         | 1M tokens. You can find it at ollama.com
        
       | buyucu wrote:
       | first, this is amazing!
       | 
       | second, how does one increase the context window without
       | requiring obscene amounts of RAM? we're really hitting the
       | limitations of the transformer architecture's quadratic
       | scaling...
        
         | 35mm wrote:
         | Chain of agents seems to be a promising approach for splitting
         | up tasks into smaller parts and then synthesising the
         | results[1]
         | 
         | [1] https://research.google/blog/chain-of-agents-large-
         | language-...
        
       | mmaunder wrote:
       | Just want to confirm: so this is the first locally runnable model
       | with a context length of greater than 128K and it's gone straight
       | to 1M, correct?
        
         | terhechte wrote:
         | Yes. It requires a lot of ram, and even on a M4 with a lot of
         | ram, if you give it 1mio tokens the prompt processing alone
         | (that is, before you get the first response token) will
         | probably take ~30min or more. However I'm looking forward to
         | check if indeed I can give it a whole codebase and ask
         | questions about it.
        
         | segmondy wrote:
         | No, this is not the first local model with a context length of
         | greater than 128k, there have been such models, for example the
         | following
         | 
         | https://huggingface.co/ai21labs/AI21-Jamba-1.5-Mini 256k
         | https://huggingface.co/THUDM/glm-4-9b-chat-1m 1M
         | 
         | and many other's that supposedly extended traditional models
         | via finetune/rope scaling
        
           | mmaunder wrote:
           | Thanks.
        
       | jkbbwr wrote:
       | Everyone keeps making the context windows bigger, which is nice.
       | 
       | But what about output? I want to generate a few thousand lines of
       | code, anyone got any tips?
        
         | anotheryou wrote:
         | Isn't that the same? limit-wise.
         | 
         | Now you just need to convince it to output that much :)
        
         | mmaunder wrote:
         | Repeatedly ask it for more providing the previous output as
         | context. (Back to context length as a limitation)
        
         | AyyEye wrote:
         | These things already produce embarrassing output. If you make
         | it longer it's just going to get worse.
        
         | bugglebeetle wrote:
         | So context size actually helps with this, relative to how LLMs
         | are actually deployed as applications. For example, if you look
         | at how the "continue" option in the DeepSeek web app works for
         | code gen, what they're likely doing is reinserting the prior
         | messages (in some form) to a new one to prompt further
         | completion. The more context size a model has and can manage
         | successfully, the better it will likely be able at generating
         | longer code blocks.
        
       | gpualerts wrote:
       | You tried to run it on CPU? I can't imagine how long that would
       | take you. im tempted to try it out on half tb ram server
        
       | mmaunder wrote:
       | This API only model with a 1M context window was released back in
       | Nov. Just for some historical context.
       | 
       | https://qwenlm.github.io/blog/qwen2.5-turbo/
        
         | simonw wrote:
         | That's a different model - the 2.5 Turbo one. Today's release
         | is something different.
        
       | simonw wrote:
       | Here are tips for running it on macOS using MLX:
       | https://twitter.com/awnihannun/status/1883611098081099914 - using
       | https://huggingface.co/mlx-community/Qwen2.5-7B-Instruct-1M-...
        
         | woadwarrior01 wrote:
         | MLX does not support dual chunk attention[1] that these models
         | use for long contexts, yet.
         | 
         | [1]: https://arxiv.org/abs/2402.17463
        
       | ilaksh wrote:
       | What's the SOTA for memory-centric computing? I feel like maybe
       | we need a new paradigm or something to bring the price of AI
       | memory down.
       | 
       | Maybe they can take some of those hundreds of billions and invest
       | in new approaches.
       | 
       | Because racks of H100s are not sustainable. But it's clear that
       | increasing the amount of memory available is key to getting more
       | intelligence or capabilities.
       | 
       | Maybe there is a way to connect DRAM with photonic interconnects
       | that doesn't require much data ordering for AI if the neural
       | network software model changes somewhat.
       | 
       | Is there something that has the same capabilities of a
       | transformer but doesn't operate on sequences?
       | 
       | If I was a little smarter and had any math ability I feel like I
       | could contribute.
       | 
       | But I am smart enough to know that just building bigger and
       | bigger data centers is not the ideal path forward.
        
         | mkroman wrote:
         | The AI hardware race is still going strong, but with so many
         | rapid changes to the fundamental architectures, it doesn't make
         | sense to bet everything on specialized hardware just yet.. It's
         | happening, but it's expensive and slow.
         | 
         | There's just not enough capacity to build memory fast enough
         | right now. Everyone needs the biggest and fastest modules they
         | can get, since it directly impacts the performance of the
         | models.
         | 
         | There's still a lot of happening to improve memory, like the
         | latest Titans paper: https://arxiv.org/abs/2501.00663
         | 
         | So I think until a breakthrough happens or the fabs catch up,
         | it'll be this painful race to build more datacenters.
        
         | rfoo wrote:
         | > Because racks of H100s are not sustainable.
         | 
         | Huh? Racks of H100s are the most sustainable thing we can have
         | for LLMs for now.
        
       | dang wrote:
       | Related: https://simonwillison.net/2025/Jan/26/qwen25-1m/
       | 
       | (via https://news.ycombinator.com/item?id=42832838, but we merged
       | that thread hither)
        
       | refulgentis wrote:
       | People are getting pretty...clever?...with long context retrieval
       | benchmarking in papers.
       | 
       | Here, the prose says "nearly perfect", the graph is all green
       | except for a little yellow section, and you have to parse a 96
       | cell table, having familiarity with several models and technical
       | techniques to get the real # (84.4%, and that tops out at 128K,
       | not anywhere near the claimed 1M)
       | 
       | I don't bring this up to denigrate, but rather to highlight that
       | "nearly perfect" is quite far off still. Don't rely on long
       | context for anything you build
        
       | anotherpaulg wrote:
       | In my experience with AI coding, very large context windows
       | aren't useful in practice. Every model seems to get confused when
       | you feed them more than ~25-30k tokens. The models stop obeying
       | their system prompts, can't correctly find/transcribe pieces of
       | code in the context, etc.
       | 
       | Developing aider, I've seen this problem with gpt-4o, Sonnet,
       | DeepSeek, etc. Many aider users report this too. It's perhaps the
       | #1 problem users have, so I created a dedicated help page [0].
       | 
       | Very large context may be useful for certain tasks with lots of
       | "low value" context. But for coding, it seems to lure users into
       | a problematic regime.
       | 
       | [0] https://aider.chat/docs/troubleshooting/edit-
       | errors.html#don...
        
         | lifty wrote:
         | Thanks for aider! It has become an integral part of my
         | workflow. Looking forward to try DeepSeek in architect mode
         | with Sonnet as the driver. Curious if it will be a noticeable
         | improvement as compared to using Sonnet by itself.
        
         | cma wrote:
         | Claude works incredibly well for me with asking for code
         | changes to projects filling up 80% of context (160K tokens).
         | It's way expensive with the API though but reasonable through
         | the web interface with pro.
        
         | seunosewa wrote:
         | The behaviour you described is what happens when you have small
         | context windows. Perhaps you're feeding the models with more
         | tokens than you think you are. I have enjoyed loading large
         | codebases into AI Studio and getting very satisfying and
         | accurate answers because the models have 1M to 2M token context
         | windows.
        
         | adamgordonbell wrote:
         | Aider is great, but you need specific formats from the llm.
         | That might be where the challenge is.
         | 
         | I've used the giant context in Gemini to dump a code base and
         | say: describe the major data structures and data flows.
         | 
         | Things like that, overview documents, work great. It's amazing
         | for orienting in an unfamiliar codebase.
        
       ___________________________________________________________________
       (page generated 2025-01-26 23:00 UTC)