[HN Gopher] Qwen2.5-VL-32B: Smarter and Lighter
       ___________________________________________________________________
        
       Qwen2.5-VL-32B: Smarter and Lighter
        
       Author : tosh
       Score  : 258 points
       Date   : 2025-03-24 18:35 UTC (4 hours ago)
        
 (HTM) web link (qwenlm.github.io)
 (TXT) w3m dump (qwenlm.github.io)
        
       | jauntywundrkind wrote:
       | Wish I knew better how to estimate what sized video card one
       | needs. HuggingFace link says this is bfloat16, so at least 64GB?
       | 
       | I guess the -7B might run on my 16GB AMD card?
        
         | wgd wrote:
         | You can run 4-bit quantized version at a small (though nonzero)
         | cost to output quality, so you would only need 16GB for that.
         | 
         | Also it's entirely possible to run a model that doesn't fit in
         | available GPU memory, it will just be slower.
        
         | xiphias2 wrote:
         | I wish they would start producing graphs with quantized version
         | performances as well. What matters is RAM/bandwidth vs
         | performance, not number of parameters.
        
         | clear_view wrote:
         | deepseek-r1:14b/mistral-small:24b/qwen2.5-coder:14b fit 16GB
         | VRAM with fast generation. 32b versions bleed into RAM and take
         | a serious performance hit but still usable.
        
         | zamadatix wrote:
         | https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...
         | 
         | That will help you quickly calculate the model VRAM usage as
         | well as the VRAM usage of the context length you want to use.
         | You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model
         | (unquantized)" field. Funnily enough the calculator lacks the
         | option to see without quantizing the model, usually because
         | nobody worried about VRAM bothers running >8 bit quants.
        
           | azinman2 wrote:
           | Except when it comes to deepseek
        
       | simonw wrote:
       | Big day for open source Chinese model releases - DeepSeek-v3-0324
       | came out today too, an updated version of DeepSeek v3 now under
       | an MIT license (previously it was a custom DeepSeek license).
       | https://simonwillison.net/2025/Mar/24/deepseek/
        
         | chaosprint wrote:
         | it seems that this free version "may use your prompts and
         | completions to train new models"
         | 
         | https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free
         | 
         | do you think this needs attention?
        
           | wgd wrote:
           | That's typical of the free options on OpenRouter, if you
           | don't want your inputs used for training you use the paid
           | one: https://openrouter.ai/deepseek/deepseek-chat-v3-0324
        
             | overfeed wrote:
             | Is OpenRouter planning on distilling models off the prompts
             | and responses from frontier models? That's smart - a little
             | gross - but smart.
        
           | huijzer wrote:
           | Since we are on HN here, I can highly recommend open-webui
           | with some OpenAI-compatible provider. I'm running with Deep
           | Infra for more than a year now and am very happy. New models
           | are usually available within one or two days after release.
           | Also have some friends who use the service almost daily.
        
             | unquietwiki wrote:
             | I'm using open-webui at home with a couple of different
             | models. gemma2-9b fits in VRAM on a NV 3060 card + performs
             | nicely.
        
               | zakki wrote:
               | What is the memory of your NV3060? 8GB?
        
               | ngvjmfgb wrote:
               | 12GB (edit: that is what mine is)
        
             | l72 wrote:
             | I too run openweb-ui locally and use deepinfra.com as my
             | backend. It has been working very well, and I am quite
             | happy with deepinfra's pricing and privacy policy.
             | 
             | I have set up the same thing at work for my colleagues, and
             | they find it better than openai for their tasks.
        
           | TechDebtDevin wrote:
           | Thats because its a 3rd party API someone is hosting and
           | trying to arb the infra cost or mine training data, or maybe
           | something even more sinister. I stay away from open router
           | API's that aren't served by reputable well known companies,
           | and even then...
        
           | behnamoh wrote:
           | good grief! people are okay with it when OpenAI and Google do
           | it, but as soon as open source providers do it, people get
           | defensive about it...
        
             | chaosprint wrote:
             | no. it's nothing to do with deepseek. it's openrouter and
             | providers there
        
             | londons_explore wrote:
             | I trust big companies far more with my data than small
             | ones.
             | 
             | Big companies have so much data they won't be having a
             | human look at mine specifically. Some small place probably
             | has the engineer looking at my logs as user #4.
             | 
             | Also, big companies have security teams whose job is
             | securing the data, and it won't be going over some
             | unencrypted link to cloudflare because OP was too lazy to
             | set up Https certs.
        
               | henry2023 wrote:
               | Equifax.
        
         | echelon wrote:
         | Pretty soon I won't be using any American models. It'll be a
         | 100% Chinese open source stack.
         | 
         | The foundation model companies are screwed. Only shovel makers
         | (Nvidia, infra companies) and product companies are going to
         | win.
        
           | jsheard wrote:
           | I still don't get where the money for new open source models
           | is going to come from once setting investor dollars on fire
           | is no longer a viable business model. Does anyone seriously
           | expect companies to keep buying and running thousands of
           | ungodly expensive GPUs, plus whatever they spend on human
           | workers to do labelling/tuning, and then giving away the
           | spoils for free, forever?
        
             | finnjohnsen2 wrote:
             | ads again. somehow. its like a law of nature.
        
               | api wrote:
               | If nationalist propaganda counts as ads, that might
               | already be supporting Chinese models. Ask them about
               | Tiananmen Square.
               | 
               | Any kind of media with zero or near zero
               | copying/distribution costs becomes a deflationary race to
               | the bottom. Someone will eventually release something
               | that's free, and at that point nothing can compete with
               | free unless it's some kind of very specialized offering.
               | Then you run into a the problem the OP described: how do
               | you fund free? Answer: ads. Now the customer is the
               | advertiser, not the user/consumer, which is why most
               | media converges on trash.
        
               | Imustaskforhelp wrote:
               | These ads can also have ads blockers though.
               | 
               | Perplexity released the deepseek r1 1331? ( I am not sure
               | I forgot) It basically removes chinese censorships / yes
               | you can ask it about the tiananmen square.
               | 
               | I think the next iteration of these ai model ads would be
               | sneaky which might be hard to remove
               | 
               | Though it's funny you comment about chinese censorship
               | yet american censorship is fine lol
        
               | eMPee584 wrote:
               | XAI to the rescue!!1!
               | 
               | ... (no, not the unintelligible one - the xplainable one)
        
             | ada1981 wrote:
             | Money from the Chinese defense budget?
             | 
             | Everyone using these models undercuts US companies.
             | 
             | Eventually China wins.
        
               | Imustaskforhelp wrote:
               | And wez the end user get open source models.
               | 
               | Also china doesn't have access to that many gpus because
               | of the chips act.
               | 
               | And i hate it , i hate it when america sounds more
               | communist than china who open sources their stuff because
               | free markets.
               | 
               | I actually think that more countries need to invest into
               | AI and not companies wanting profit.
               | 
               | This could be the decision that can impact the next
               | century.
        
               | greenavocado wrote:
               | If only you knew how many terawatt hours were burned on
               | biasing models to prevent them from becoming racist
        
               | Imustaskforhelp wrote:
               | To be honest, maybe I am going off topic but I wish for
               | the level of innovation in the ai industry in the energy
               | industry.
               | 
               | It feels as an outsider that very little progress is made
               | on the energy issue. I genuinely think that ai can be
               | accelerated so so much more if energy could be more cheap
               | / green
        
               | bee_rider wrote:
               | China has allowed quite a bit of market liberalism, so it
               | isn't that surprising if their AI stuff is responding to
               | the market.
               | 
               | But, I don't really see the connection on the flip side.
               | Why should proprietary AI be associated with communism?
               | If anything I guess a communist handling of AI would also
               | be to share the model.
        
             | mitthrowaway2 wrote:
             | Maybe from NVIDIA? "Commoditize your product's complement".
             | 
             | https://www.joelonsoftware.com/2002/06/12/strategy-
             | letter-v/
        
               | akra wrote:
               | This is the reason IMO. Fundamentally China right now is
               | better at manufacturing (e.g. robotics). AI is the
               | complement to this - AI increases the demand for tech
               | manufactured goods. Whereas America is in the opposite
               | position w.r.t which side is their advantage (i.e. the
               | software). AI for China is an enabler into a potentially
               | bigger market which is robots/manufacturing/etc.
               | 
               | Commoditizing the AI/intelligence part means that the
               | main advantage isn't the bits - its the atoms. Physical
               | dexterity, social skills and manufacturing skills will
               | gain more of a comparative advantage vs intelligence work
               | in the future as a result - AI makes the old economy new
               | again in the long term. It also lowers the value of AI
               | investments in that they no longer can command first
               | mover/monopoly like pricing for what is a very large
               | capex cost undermining US investment in what is their
               | advantage. As long as it is strategic, it doesn't
               | necessarily need to be economic on its own.
        
             | theptip wrote:
             | Yeah, this is the obvious objection to the doom. Someone
             | has to pay to train the model that all the small ones
             | distill from.
             | 
             | Companies will have to detect and police distilling if they
             | want to keep their moat. Maybe you have to have an
             | enterprise agreement (and arms control waiver) to get
             | GPT-6-large API access.
        
             | Imustaskforhelp wrote:
             | I think it's market leadership which is just free word of
             | mouth advertising which can then lead to consulting
             | business or maybe they can cheek in some ads in llm
             | directly oh boy you don't know.
             | 
             | Also I have seen that once a open source llm is released to
             | public, though you can access it on any website hosting it,
             | most people would still prefer it to be the one which
             | created the model.
             | 
             | Deepseek released its revenue models and it's crazy good.
             | 
             | And no they didn't have full racks of h100.
             | 
             | Also one more thing. Open source has always had an issue of
             | funding.
             | 
             | Also they are not completely open source, they are just
             | open weights, yes you can fine tune them but from my
             | limited knowledge, there is some limitations of fine tuning
             | so owning that training data proprietary also helps fund my
             | previous idea of consulting other ai.
             | 
             | Yes it's not a much profitable venture,imo it's just a
             | decently profitable venture, but the current hype around ai
             | is making it lucrative for companies.
             | 
             | Also I think this might be a winner takes all market which
             | increases competition but in a healthy way.
             | 
             | What deepseek did with releasing the open source model and
             | then going out of their way to release some other open
             | source projects which themselves could've been worth a few
             | millions (bycloud said it), helps innovate ai in general.
        
             | zamadatix wrote:
             | Once setting investment dollars on fire is no longer viable
             | it'll probably be because scaling died anyways so what's
             | the rush to have a dozen new frontier models each year.
        
             | colechristensen wrote:
             | I think the only people who will ever make money are the
             | shovel makers, the models will always be free because
             | you'll just get open source models chasing the paid ones
             | and never being all that far behind, especially when this S
             | curve growth phase slows down.
        
             | natch wrote:
             | Many sources, Chinese government could be one.
        
             | pants2 wrote:
             | There are lots of open-source projects that took many
             | millions of dollars to create. Kubernetes, React, Postgres,
             | Chromium, etc. etc.
             | 
             | This has clearly been part of a viable business model for a
             | long time. Why should LLM models be any different?
        
           | Imustaskforhelp wrote:
           | Yes I also believe the same though I only believe in either
           | grok , gemini or claude ai of the western world.
           | 
           | Gemini isn't too special , it's actually just comparable to
           | deepseek / less than deepseek but it is damn fast so maybe
           | forget gemini for true tasks.
           | 
           | Grok / gemini can be used as a deep research model which I
           | think I like ? Grok seems to have just taken the deepseek
           | approach but just scaled it by their hyper massive gpu
           | cluster, I am not sure I think that grok can also be
           | replaced.
           | 
           | What I truly believe in is claude.
           | 
           | I am not sure but claude really feels good for coding
           | especially.
           | 
           | For any other thing I might use something like deepseek /
           | chinese models
           | 
           | I used cerebras.ai and holy moly they are so fast , I used
           | the deepseek 70 b model , it is still something incredibly
           | fast and my time matters so I really like the open source way
           | so that companies like cereberas can focus on what they do
           | best.
           | 
           | I am not sure about nvidia though. Nvidia seems so connected
           | to the western ai that deepseek improvements impact nvidia.
           | 
           | I do hope that nvidia cheapens the price of gpu though I
           | don't think they have much incentive.
        
           | refulgentis wrote:
           | I've been waiting since November for 1, just 1*, model other
           | than Claude than can reliably do agentic tool call loops. As
           | long as the Chinese open models are chasing reasoning and
           | benchmark maxxing vs. mid-2024 US private models, I'm very
           | comfortable with somewhat ignoring these models.
           | 
           | (this isn't idle prognostication hinging on my personal hobby
           | horse. I got skin in the game, I'm virtually certain I have
           | the only AI client that is able to reliably do tool calls
           | with open models in an agentic setting. llama.cpp got a
           | massive contribution to make this happen and the big boys who
           | bother, like ollama, are still using a dated json-schema-
           | forcing method that doesn't comport with recent local model
           | releases that can do tool calls. IMHO we're comfortably past
           | a point where products using these models can afford to focus
           | on conversational chatbots, thats cute but a commodity to
           | give away per standard 2010s SV thinking)
           | 
           | * OpenAI's can but are a little less...grounded?...situated?
           | i.e. it can't handle "read this file and edit it to do $X".
           | Same-ish for Gemini, though, sometimes I feel like the only
           | person in the world who actually waits for the experimental
           | models to go GA, as per letter of the law, I shouldn't deploy
           | them until then
        
             | throwawaymaths wrote:
             | is there some reason you cant train a 1b model to just do
             | agentic stuff?
        
               | refulgentis wrote:
               | They're real squished for space, more than I expected :/
               | good illustration here, Qwen2.5-1.5B trained to reason,
               | i.e. the name it is released under is "DeepSeek R1 1.5B".
               | https://imgur.com/a/F3w5ymp 1st prompt was "What is
               | 1048576^0.05", it answered, then I said "Hi",
               | then...well...
               | 
               | Fwiw, Claude Sonnet 3.5 100% had some sort of agentic
               | loop x precise file editing trained into it. Wasn't
               | obvious to me until I added a MCP file server to my
               | client, and still isn't well-understood outside a few.
               | 
               | I'm not sure on-device models will be able to handle it
               | any time soon because it relies on just letting it read
               | the whole effing file.
               | 
               | Seperately...
               | 
               | I say I don't understand why no other model is close, but
               | it makes sense. OpenAI has been focused on reasoning,
               | Mistral, I assume is GPU-starved, and Google...well, I
               | used to work there, so I have to stop myself from going
               | on and on. Let's just say I assume that there wouldn't be
               | enough Consensus Built(tm) to do something "scary" and
               | "experimental" like train that stuff in.
               | 
               | This also isn't going so hot for Sonnet IMHO.
               | 
               | There's vague displeasure and assumptions it "changed"
               | the last week, but, AFAICT the real problem is that the
               | reasoning stuff isn't as "trained in" as, say, OpenAI's.
               | 
               | This'd be a good thing except you see all kinds of whacky
               | behavior.
               | 
               | One of my simple "read file and edit" queries yesterday
               | did about 60 pages worth of thinking, and the thinking
               | contained 130+ separate tool calls that weren't actually
               | called, so it was just wandering around in the
               | wilderness, reacting to hallucinated responses it never
               | actually got.
               | 
               | Which plays into another one of my hobbyhorses, chat is a
               | "hack" on top of an LLM. Great. So is reasoning,
               | especially in the way Anthropic implemented it. At what
               | point are the abstractions too much, so much that it's
               | unreliable? 3.7 Sonnet may be answering that, because
               | when it fails, all that thinking looks like the agentic
               | loop cooked into Sonnet 3.5. So maybe it's altogether too
               | much to have chat, reasoning, and fully reliable agentic
               | loops...
        
               | anon373839 wrote:
               | The Berkeley Function Calling Leaderboard [1] might be of
               | interest to you. As of now, it looks like Hammer2.1-3b is
               | the strongest model under 7 billion parameters. Its
               | overall score is ~82% of GPT-4o's. There is also
               | Hammer2.1-1.5b at 1.5 billion parameters that is ~76% of
               | GPT-4o.
               | 
               | [1] https://gorilla.cs.berkeley.edu/leaderboard.html
        
               | refulgentis wrote:
               | Worth noting:
               | 
               | - That'll be 1 turn scores: at multiturn, 4o is 3x as
               | good as the 3b
               | 
               | - BFCL is generally turn natural language into an API
               | call, then multiturn will involve making another API
               | call.
               | 
               | - I hope to inspire work towards an open model that can
               | eat the paid models sooner rather than later
               | 
               | - trained quite specifically on an agent loop with tools
               | read_files and edit_file (you'll also probably do at
               | least read_directory and get_shared_directories,
               | search_filenames and search_files_text are good too),
               | bonus points for cli_command
               | 
               | - IMHO, this is much lower hanging-fruit than ex.
               | training an open computer-vision model, so I beseech
               | thee, intrepid ML-understander, to fill this gap and hear
               | your name resound throughout the age
        
         | ilrwbwrkhv wrote:
         | Both of them are better than any American models. Both for
         | reasoning, agentic, fine tuning etc.
        
         | th0ma5 wrote:
         | Why do you keep promoting your blog on every LLM post?
        
           | rudedogg wrote:
           | I think they didn't want to rewrite their post. It's more
           | substantial and researched than any comment here, and all
           | their posts are full of information. I think they should get
           | a pass, and calling it self-promotion is a stretch.
        
           | simonw wrote:
           | Because I want people to read it. I only promote it if I
           | think it's useful and relevant.
        
       | simonw wrote:
       | 32B is one of my favourite model sizes at this point - large
       | enough to be extremely capable (generally equivalent to GPT-4
       | March 2023 level performance, which is when LLMs first got really
       | useful) but small enough you can run them on a single GPU or a
       | reasonably well specced Mac laptop (32GB or more).
        
         | clear_view wrote:
         | 32B don't fully fit 16GB of VRAM. Still fine for higher quality
         | answers, worth the extra wait in some cases.
        
           | abraxas wrote:
           | Would a 40GB A6000 fully accommodate a 32B model? I assume an
           | fp16 quantization is still necessary?
        
             | manmal wrote:
             | At FP16 you'd need 64GB just for the weights, and it'd be
             | 2x as slow as a Q8 version, likely with little improvement.
             | You'll also need space for attention and context etc, so
             | 80-100GB (or even more) VRAM would be better.
             | 
             | Many people ,,just" use 4x consumer GPUs like the 3090
             | (24GB each) which scales well. They'd probably buy a mining
             | rig, EPYC CPU, Mainboard with sufficient PCIe lanes, PCIe
             | risers, 1600W PSU (might need to limit the GPUs to 300W),
             | and 128GB RAM. Depending what you pay for the GPUs that'll
             | be 3.5-4.5k
        
               | postalrat wrote:
               | I haven't found a good case/risers/etc I really like.
               | Most the miner stuff wasn't made for PCIe 16x.
        
               | manmal wrote:
               | Is that a problem? According to this, the GPUs don't
               | communicate that much once the weights are loaded: https:
               | //github.com/turboderp/exllama/discussions/16#discussi...
               | 
               | > So at FP16 precision that's a grand total of 16 kB
               | you're transmitting over the PCIe bus, once per token. If
               | you multiply by, say, 20 tokens per second, then you're
               | still only using like 0.1% of your PCIe bandwidth.
               | 
               | Intra GPU memory bandwidth is very important, but I've
               | seen lots of people use just a x4 lane and they didn't
               | complain much.
        
               | abraxas wrote:
               | would it be better for energy efficiency and overall
               | performance to use workstation cards like A5000 or A4000?
               | Those can be found on eBay.
        
               | manmal wrote:
               | Looks like the A4000 has low memory bandwidth (50% of a
               | 4090?) which is the limiting factor for inference
               | usually. But they are efficient - if you can get them for
               | cheap, probably a good entry setup? If you like running
               | models that need a lot of VRAM, you'll likely run out of
               | PCIe slots before you are done upgrading.
        
             | elorant wrote:
             | You don't need 16-bit quantization. The difference in
             | accuracy from 8-bit in most models is less than 5%.
        
         | redrove wrote:
         | Or quantized on a 4090!
        
         | YetAnotherNick wrote:
         | I don't think these models are GPT-4 level. Yes they seem to be
         | on benchmarks, but it has been known that models increasingly
         | use A/B testing in dataset curation and synthesis(using GPT 4
         | level models) to optimize not just the benchmarks but things
         | which could be benchmarked like academics.
        
           | simonw wrote:
           | I'm not talking about GPT-4o here - every benchmark I've seen
           | has had the new models from the past ~12 months out-perform
           | the March 2023 GPT-4 model.
           | 
           | To pick just the most popular one,
           | https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd
           | now.
        
             | th0ma5 wrote:
             | How have you been able to tie benchmark results to better
             | results?
        
               | simonw wrote:
               | Vibes and intuition. Not much more than that.
        
         | wetwater wrote:
         | I've only recently started looking into running these models
         | locally on my system. I have limited knowledge regarding LLMs
         | and even more limited when it comes to building my own PC.
         | 
         | Are there any good sources that I can read up on estimiating
         | what would be hardware specs required for 7B, 13B, 32B .. etc
         | size If I need to run them locally?
        
           | TechDebtDevin wrote:
           | VRAM Required = Number of Parameters (in billions) x Number
           | of Bytes per Parameter x Overhead[0].
           | 
           | [0]: https://twm.me/posts/calculate-vram-requirements-local-
           | llms/
        
             | wetwater wrote:
             | Thats neat! thanks
        
             | manmal wrote:
             | Don't forget to add a lot of extra space if you want a
             | usable context size.
        
               | TechDebtDevin wrote:
               | Wouldn't that be your overhead var
        
         | faizshah wrote:
         | I just started self hosting as well on my local machine, been
         | using https://lmstudio.ai/ Locally for now.
         | 
         | I think the 32b models are actually good enough that I might
         | stop paying for ChatGPT plus and Claude.
         | 
         | I get around 20 tok/second on my m3 and I can get 100
         | tok/second on smaller models or quantized. 80-100 tok/second is
         | the best for interactive usage if you go above that you
         | basically can't read as fast as it generates.
         | 
         | I also really like the QwQ reaoning model, I haven't gotten
         | around to try out using locally hosted models for Agents and
         | RAG especially coding agents is what im interested in. I feel
         | like 20 tok/second is fine if it's just running in the
         | background.
         | 
         | Anyways would love to know others experiences, that was mine
         | this weekend. The way it's going I really dont see a point in
         | paying, I think on-device is the near future and they should
         | just charge a licensing fee like DB provider for enterprise
         | support and updates.
         | 
         | If you were paying $20/mo for ChatGPT 1 year ago, the 32b
         | models are basically at that level but slightly slower and
         | slightly lower quality but useful enough to consider cancelling
         | your subscriptions at this point.
        
           | wetwater wrote:
           | Are there any good sources that I can read up on estimiating
           | what would be hardware specs required for 7B, 13B, 32B .. etc
           | size If I need to run them locally? I am grad student on
           | budget but I want to host one locally and trying to build a
           | PC that could run one of these models.
        
             | disgruntledphd2 wrote:
             | MacBook with 64gb RAM will probably be the easiest. As a
             | bonus, you can train pytorch models on the built in GPU.
             | 
             | It's really frustrating that I can't just write off Apple
             | as evil monopolists when they put out hardware like this.
        
             | coder543 wrote:
             | "B" just means "billion". A 7B model has 7 billion
             | parameters. _Most_ models are trained in fp16, so each
             | parameter takes two bytes at full precision. Therefore, 7B
             | = 14GB of memory. You can easily quantize models to 8 bits
             | per parameter with very little quality loss, so then 7B =
             | 7GB of memory. With more quality loss (making the model
             | dumber), you can quantize to 4 bits per parameter, so 7B =
             | 3.5GB of memory. There are ways to quantize at other levels
             | too, anywhere from under 2 bits per parameter up to 6 bits
             | per parameter are common.
             | 
             | There is additional memory used for context / KV cache. So,
             | if you use a large context window for a model, you will
             | need to factor in several additional gigabytes for that,
             | but it is much harder to provide a rule of thumb for that
             | overhead. Most of the time, the overhead is significantly
             | less than the size of the model, so not 2x or anything.
             | (The size of the context window is related to the amount of
             | text/images that you can have in a conversation before the
             | LLM begins forgetting the earlier parts of the
             | conversation.)
             | 
             | The most important thing for local LLM performance is
             | typically memory bandwidth. This is why GPUs are so much
             | faster for LLM inference than CPUs, since GPU VRAM is many
             | times the speed of CPU RAM. Apple Silicon offers rather
             | decent memory bandwidth, which makes the performance fit
             | somewhere between a typical Intel/AMD CPU and a typical
             | GPU. Apple Silicon is definitely _not_ as fast as a
             | discrete GPU with the same amount of VRAM.
             | 
             | That's about all you need to know to get started. There are
             | obviously nuances and exceptions that apply in certain
             | situations.
             | 
             | A 32B model at 5 bits per parameter will comfortably fit
             | onto a 24GB GPU and provide decent speed, as long as the
             | context window isn't set to a huge value.
        
               | epolanski wrote:
               | So, in essence, all AMD does to launch a successful GPU
               | in inference space is to load it with ram?
        
             | faizshah wrote:
             | Go to r/LocalLLAMA they have the most info. There's also
             | lots of good YouTube channels who have done benchmarks on
             | Mac minis for this (another good value one with student
             | discount).
             | 
             | Since you're a student most of the providers/clouds offer
             | student credits and you can also get loads of credits from
             | hackathons.
        
             | p_l wrote:
             | Generally, unquantized - double the number and that's the
             | amount of VRAM in GB you need + some extra, because most
             | models use fp16 weights so it's 2 bytes per parameter ->
             | 32B parameters = 64GB
             | 
             | typical quantization to 4bit will cut 32B model into 16GB
             | of weights plus some of the runtime data, which makes it
             | possibly usable (if slow) on 16GB GPU. You can sometimes
             | viably use smaller quantizations, which will reduce memory
             | use even more.
        
             | randomNumber7 wrote:
             | Yes. You multiply the number of parameters with the number
             | of bytes per parameter and compare it with the amount of
             | GPU memory (or CPU RAM) you have.
        
           | pixelHD wrote:
           | what spec is your local mac?
        
       | gatienboquet wrote:
       | So today is Qwen. Tomorrow a new SOTA model from Google
       | apparently, R2 next week.
       | 
       | We haven't hit the wall yet.
        
         | zamadatix wrote:
         | Qwen 3 is coming imminently as well
         | https://github.com/huggingface/transformers/pull/36878 and it
         | feels like Llama 4 should be coming in the next month or so.
         | 
         | That said none of the recent string of releases has done much
         | yet to "smash a wall", they've just met the larger proprietary
         | models where they already were. I'm hoping R2 or the like
         | really changes that by showing ChatGPT 3->3.5 or 3.5->4 level
         | generational jumps are still possible beyond the current state
         | of the art, not just beyond current models of a given size.
        
         | tomdekan wrote:
         | Any more info on the new Google model?
        
         | behnamoh wrote:
         | Google's announcements are mostly vaporware anyway. Btw, where
         | is Gemini Ultra _1_? how about Gemini Ultra 2?
        
           | karmasimida wrote:
           | It is already on the LLM arena right, codename nebula? But
           | you are right they can fuck up their releases royally.
        
         | OsrsNeedsf2P wrote:
         | > We haven't hit the wall yet.
         | 
         | The models are iterative improvements, but I haven't seen night
         | and day differences since GPT3 and 3.5
        
           | anon373839 wrote:
           | Yeah. Scaling up pretraining and huge models appears to be
           | done. But I think we're still advancing the frontier in the
           | other direction -- i.e., how much capability and knowledge
           | can we cram into smaller and smaller models?
        
         | nwienert wrote:
         | We've slid into the upper S curve though.
        
       | Arcuru wrote:
       | Does anyone know how making the models multimodal impacts their
       | text capabilities? The article is claiming this achieves good
       | performance on pure text as well, but I'm curious if there is any
       | analysis on how much impact it usually has.
       | 
       | I've seen some people claim it should make the models better at
       | text, but I find that a little difficult to believe without data.
        
       | lysace wrote:
       | To clarify: Qwen is made by Alibaba Cloud.
       | 
       | (It's not mentioned anywhere in the blog post.)
        
       | nodesocket wrote:
       | What is the recommended model to process a RAG of PDF text
       | documents? I've seen some recommendations for Mistral:7b. Looking
       | to run on a consumer pedestrian home PC (ollama) with a Nvidia
       | 4060ti and Ryzen 5700x.
        
         | unquietwiki wrote:
         | Apparently there are two versions of the 4060Ti, with 8GB and
         | 16GB of VRAM respectively. I've got an 8GB 3060 that runs
         | gemma2:9b nicely, and that will parse PDF files; gemma3:4b also
         | seems to analyze PDFs decently.
        
           | nodesocket wrote:
           | I also have the 8GB 4060ti variant. Want to upgrade to a 4070
           | super, but prices on them are still ridiculous. Could be had
           | for $599 a handful of months ago, now on ebay going for $750
           | plus.
           | 
           | Thanks for the recommendations. I'll give gemma3:12b a try
           | and if needed go down to gemma:4b.
        
             | manmal wrote:
             | May I ask why you don't get a used 3090 with 24GB VRAM?
        
       | hmottestad wrote:
       | Just don't ask it about the tiananmen square massacre or you'll
       | get a security warning. Even if you rephrase it.
       | 
       | It'll happily talk about Bloody Sunday.
       | 
       | Probably a great model, but it worries me that it has such
       | restrictions.
       | 
       | Sure OpenAI also has lots of restrictions, but this feels more
       | like straight up censorship since it'll happily go on about bad
       | things the governments of the west have done.
        
         | generalizations wrote:
         | Nah, it's great for things that Western models are censored on.
         | The True Hacker will keep an Eastern and Western model
         | available, depending on what they need information on.
        
           | hmottestad wrote:
           | I tried to ask it about Java exploits that would allow me to
           | gain RCE, but it refused just as most western models do.
           | 
           | That was the only thing I could think to ask really. Do you
           | have a better example maybe?
        
           | theturtletalks wrote:
           | Wouldn't they just run R1 locally and not have any censorship
           | at all? The model isn't censored at its core, it's censored
           | through the system prompt. Perplexity and Huggingface have
           | their own versions of R1 that is not censored.
        
             | hmottestad wrote:
             | I tried R1 through Kagi and it's similarly censored. Even
             | the distill of llama running on Groq is censored.
        
               | theturtletalks wrote:
               | Kagi may be using the official DeepSeek API and not
               | hosting the model itself. There is work being done to
               | make it completely uncensored:
               | 
               | https://github.com/huggingface/open-r1
               | 
               | https://ollama.com/huihui_ai/deepseek-r1-abliterated
               | 
               | I was mistaken though, it is more than just a system
               | prompt causing the censorship.
        
           | rsoto2 wrote:
           | a lot of the safety around models seems to be implemented in
           | the browser. Underneath the models seem pretty easy to
           | fool/jailbreak.
        
         | jampa wrote:
         | DeepSeek's website seems to be using two models. The one that
         | censors only does so in the online version. Are you saying that
         | censoring happens with this model, even in the offline version?
        
           | hmottestad wrote:
           | I tried the R1 distill of llama 8B, which did refuse direct
           | questions about the massacre.
           | 
           | Haven't tried this new model locally, but I agree with you
           | that it looks like there is a secondary censorship going on.
           | If I ask it to list the 10 worst catastrophes of recent
           | Chinese history with Thinking enabled then it'll actually
           | think about the massacre. Gets blocked very quickly, but it
           | doesn't look like the thinking is particularly censored.
        
         | asadm wrote:
         | a) nobody, in production, asks those questions b) chatgpt is
         | similarly biased on israel/palestine issue. Try making it agree
         | that there is a genocide ongoing or on Palestinians right to
         | defend themselves.
        
         | BoorishBears wrote:
         | Daily reminder that all commerical LLMs are going to align with
         | the governments their corporations exist under.
         | 
         | https://imgur.com/a/censorship-much-CBxXOgt
         | 
         | It's not even nefarious: they don't want the model spewing out
         | content that will get them in trouble in the most general
         | sense. It just so happens most governments have things that
         | will get you in trouble.
         | 
         | The US is very obsessed with voter manipulation these days, so
         | OpenAI and Anthropic's models are extra sensitive if the
         | wording implies they're being used for that.
         | 
         | China doesn't like talking about past or ongoing human rights
         | violations, so their models will be extra sensitive about that.
        
       | 101008 wrote:
       | Silly question: how can OpenAI, Claude and all, have a valuation
       | so large considering all the open source models? Not saying they
       | will disappear or be tiny (closed models), but why so so so
       | valuable?
        
         | barbarr wrote:
         | It's user base and brand. Just like with Pepsi and Coca Cola.
         | There's a reason OpenAI ran a Super Bowl ad.
        
           | 101008 wrote:
           | Yeah but cheaper alternatives (and open source and local
           | ones) it would be super easy for most of the customers to
           | migrate to a different provider. I am not saying they don't
           | provide any value, but it's like paid software vs open source
           | alternative. Open source alternative ends up imposing,
           | especially among tech people.
        
           | TechDebtDevin wrote:
           | Most "normies" I know only recognize ChatGPT with AI, so for
           | sure, brand recognition is the only thing that matters.
        
         | seydor wrote:
         | People cannot normally invest in their competitors.
         | 
         | It's not unlikely that chinese products may be banned /
         | tarriff'd
        
           | FreakyT wrote:
           | There are non-Chinese open LLMs (Mistral, LLama, etc), so I
           | don't think that explains it.
        
         | neither_color wrote:
         | ChatGPT's o1 is still really good and the free options are not
         | compelling enough to switch if you've been using it for a
         | while. They've positioned themselves to be a good mainstream
         | default.
        
         | rafaelmn wrote:
         | Because what would seem like a tiny difference in those
         | benchmark graphs is the difference between worth paying for and
         | complete waste of time in practice
        
         | elorant wrote:
         | The average user won't self-host a model.
        
           | epolanski wrote:
           | ...yet
        
       | admiralrohan wrote:
       | Any update on the privacy concerns?
        
       | ggregoire wrote:
       | We were using Llama vision 3.2 a few months back and were very
       | frustrated with it (both in term of speed and results quality).
       | Some day we were looking for alternatives on Hugging Face and
       | eventually stumbled upon Qwen. The difference in accuracy and
       | speed absolutely blew our mind. We ask it to find something in an
       | image and we get a response in like half a second with a 4090 and
       | it's most of the time correct. What's even more mind blowing is
       | that when we ask it to extract any entity name from the image,
       | and the entity name is truncated, it gives us the complete name
       | without even having to ask for it (e.g. "Coca-C" is barely
       | visible in the background, it will return "Coca-Cola" on its
       | own). And it does it with entities not as well known as Coca-
       | Cola, and with entities only known in some very specific regions
       | too. Haven't looked back to Llama or any other vision models
       | since we tried Qwen.
        
       | tomcam wrote:
       | I wish I were smarter and lighter
       | 
       | tbh I'd settle for just lighter
        
       | wg0 wrote:
       | For now, motivation is to win the AI race. Not sure how or who
       | would keep these models updated in the future.
       | 
       | I mean all of them.
        
       | i_love_retros wrote:
       | Any security risks running these Chinese LLMs on my local
       | computer?
        
       | simonw wrote:
       | This model is available for MLX now, in various different sizes.
       | 
       | I'm trying to run https://huggingface.co/mlx-
       | community/Qwen2.5-VL-32B-Instruct... like this:
       | uv run --with 'numpy<2' --with mlx-vlm \         python -m
       | mlx_vlm.generate \           --model mlx-
       | community/Qwen2.5-VL-32B-Instruct-4bit \           --max-tokens
       | 1000 \           --temperature 0.0 \           --prompt "Describe
       | this image." \           --image Mpaboundrycdfw-1.png
       | 
       | That downloaded an ~18GB model and gave me a VERY impressive
       | result, shown at the bottom here:
       | https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/
        
       | michaelt wrote:
       | Has anyone successfully run a quantized version of any of the
       | Qwen2.5-VL series of models?
       | 
       | I've run the smallest model in non-quantized format, but when
       | I've tried to run a AWQ version of one of the bigger models I've
       | struggled to find a combination of libraries that works right -
       | even though it _should_ fit on my GPU.
        
       | slics wrote:
       | Warning- this is a lengthy comment. I deployed that model locally
       | via Ollama and the reasoning model is not accurate.
       | 
       | This is lengthy but I thought I share its reasoning- at some
       | point it just stopped working - then it started responding in
       | Chinese.
       | 
       | Query input:
       | 
       | "1+5 = 7, and that is the correct answer"
       | 
       | Reasoning response:
       | 
       | The equation "1 + 5 = 7" isn't about regular arithmetic addition.
       | Instead, it refers to the number of segments used to display the
       | digits 1 and 5 on a seven-segment display (like on digital clocks
       | or calculators). Here's the breakdown: * Digit 1 uses 2 segments
       | (the vertical lines on the right). * Digit 5 uses 5 segments
       | (top, upper-left, middle, lower-left, and bottom segments).
       | Adding these segments: 2 (for 1) + 5 (for 5) = 7 segments total.
       | This is a clever visual riddle, not a math problem! it was a math
       | problem, you are just wrong
       | 
       | Okay, the user is insisting that "1 + 5 = 7 is a math problem"
       | and that my previous explanation was wrong. Hmm, I need to
       | address this carefully. First, in standard arithmetic, 1 + 5 is
       | indeed 6. So if it's a straightforward math problem, the correct
       | answer should be 6. The user might be mistaken, or perhaps they
       | have a different context in mind. Wait, in my previous response,
       | I considered the seven-segment display explanation, which is a
       | common riddle. The user is now rejecting that and stating it's a
       | math problem, implying that their answer of 7 is correct in some
       | mathematical context. Maybe they're referring to a different
       | base? Let me revisit the base idea. If the equation is in a
       | different base, maybe base 9? Let's see: In base b, the equation
       | "1 + 5 = 11" would be 1 + 5 = 6 in decimal, which isn't helpful.
       | Wait, maybe the result is written in a different base? For
       | example, if we compute 1 + 5 in decimal (which is 6) and then
       | represent that in base 7? 6 in base 7 is still "6", not 7. Hmm,
       | not helpful. Alternatively, maybe modulo arithmetic? Like (1 + 5)
       | mod something equals 7? Let's see: 1 +5 = 6. To get 7, maybe
       | modulo something where 6 [?]7 mod (negative number?), which
       | doesn't make sense. Alternatively, maybe the user is adding in a
       | cyclic group? Like mod 6? 6 mod 6 is 0, not 7. Alternatively,
       | could it be a different operation? Like 1 plus 5 in some
       | operation other than addition? Maybe concatenation? 1
       | concatenated with 5 is 15, but that's not 7. Wait, the user said
       | "another one 1+5 =7". Maybe they're referring to a previous
       | problem where the answer was different, and this is a follow-up?
       | But I don't have context on previous interactions. Alternatively,
       | perhaps the user is joking or trolling, but I should take their
       | statement at face value. Since they insist it's a math problem
       | and that 7 is correct, I need to explore if there's any
       | mathematical framework where this holds. Wait, let's consider if
       | it's a sum in a different base where the result is written in
       | that base. Suppose the equation is in base 8. Let's see: In base
       | 8, "7" is still 7 in decimal, so 1 (base8) +5(base8) = 6 (base8),
       | which is 6 in decimal. Not 7. Alternatively, if the answer is
       | written in base 9: 1 +5 =6 in decimal, which is still 6 in base9.
       | Still not 7. Wait a different angle: Maybe the '+' is not
       | addition, but some other operation. Like bitwise XOR? 1 in binary
       | is 01, 5 is 101. XOR gives 110 (6). Not 7. Or maybe adding digits
       | of the sum? Like 1+5=6, then 6 is a single digit, so adding
       | digits 6, but that's not helpful. Wait, the user might have made
       | a mistake, but they're insistent. Since they called it a math
       | problem, perhaps they're using a different definition of "+"?
       | Like 1 +5 in some algebraic structure where the result is 7?
       | That's possible, but without more context, it's hard to define.
       | Alternatively, maybe the numbers are not in base 10. Suppose the
       | numbers are in base 12. Wait, 1+5 is still 6. Hmm. Wait, maybe
       | the user is considering the numbers as Roman numerals? I+V=VI
       | which is 6. Still not 7. Alternatively, maybe it's a sequence
       | where 1+5 is the 7th term? Unlikely. Alternatively, it could be a
       | trick with letters. Like 1 is the first letter (A), 5 is the
       | fifth letter (E), so AE... but that doesn't lead to 7.
       | Alternatively, adding their positions: 1 +5 =6, which is F. Still
       | not 7. Alternatively, maybe counting letters in spelled-out
       | words? "One" has 3 letters, "five" has 4, total 7? That could be
       | a different riddle. Let's see: 1 (one) +5 (five) = 3 letters +4
       | letters =7. That's a possible interpretation! If the problem is
       | "1+5" where each number is spelled out in English and you add the
       | number of letters, then 3+4=7. So maybe that's the case here?
       | Alternatively, if it's counting the strokes to write the numbers.
       | For example, writing '1' takes one stroke, '5' takes two strokes
       | (depending on how you write it)? Not sure. Alternatively, the
       | user might have meant 1+5 in a non-standard math problem, like in
       | a clock arithmetic? For example, mod 12. 1+5=6 mod12 is still 6.
       | Hmm, I need to consider that maybe the user intended a different
       | approach that I haven't thought of. Since they are insistent that
       | 7 is correct, perhaps they meant a different operation or
       | context. Alternatively, maybe the problem is written vertically:
       | 1 + 5 ---- 7 But that still doesn't make sense unless there's a
       | carryover. But 1+5 is 6, so even with carryover, you can't get 7.
       | Unless adding digits with some twist. Wait, maybe they're adding
       | the numbers' positions in a different sequence? Like 1st prime
       | plus 5th prime? The first prime is 2, fifth prime is 11. 2+11=13,
       | nope. Alternatively, if you consider 1 and 5 as bits in a binary
       | number, like 1 and 5 (binary 101) concatenated? Not helpful.
       | Alternatively, the problem might be in a different numeral
       | system, like in base 1? But that's not standard. Alternatively,
       | maybe it's a simple mistake, but the user is confident. Since I
       | need to respond, perhaps I should outline the possible
       | interpretations where 7 could be the answer, like the spelled-out
       | letters interpretation. Alternatively, the user could have a typo
       | and meant 2+5=7, but wrote 1+5=7. Alternatively, in some contexts
       | like in programming, maybe 1+5=7 due to a specific coding (like 1
       | is assigned to a=1, so a+f=6? Not sure. Alternatively, it's a
       | lateral thinking puzzle where the answer is 7 because of some
       | visual or language trick. Wait, another angle: If you add 1 and 5
       | on a calculator upside down, but that might not help.
       | Alternatively, in temperature scales? For example, converting
       | from Fahrenheit to Celsius or vice versa? 1degF to Celsius is
       | negative, 5degC is 41degF. Not adding to 7. Hmm. Given the user's
       | assertion that 1+5=7 is correct in a math problem, I might need
       | to explain that typically 1+5=6, but there are alternative
       | contexts where it might be considered 7, like counting letters in
       | their names (1 is "one" (3 letters), 5 is "five" (4 letters),
       | 3+4=7). That's a possible interpretation. Alternatively, in some
       | base where the addition is represented differently.
       | Alternatively, a trick question where the numbers are not in base
       | 10 but in another base where the representation is different.
       | Alternatively, the user might have seen a problem where adding 1
       | and 5 gives 7 through some function, like a + b = a + b +1, but
       | that's arbitrary.
        
       ___________________________________________________________________
       (page generated 2025-03-24 23:00 UTC)