[HN Gopher] Qwen2.5-VL-32B: Smarter and Lighter
___________________________________________________________________
Qwen2.5-VL-32B: Smarter and Lighter
Author : tosh
Score : 258 points
Date : 2025-03-24 18:35 UTC (4 hours ago)
(HTM) web link (qwenlm.github.io)
(TXT) w3m dump (qwenlm.github.io)
| jauntywundrkind wrote:
| Wish I knew better how to estimate what sized video card one
| needs. HuggingFace link says this is bfloat16, so at least 64GB?
|
| I guess the -7B might run on my 16GB AMD card?
| wgd wrote:
| You can run 4-bit quantized version at a small (though nonzero)
| cost to output quality, so you would only need 16GB for that.
|
| Also it's entirely possible to run a model that doesn't fit in
| available GPU memory, it will just be slower.
| xiphias2 wrote:
| I wish they would start producing graphs with quantized version
| performances as well. What matters is RAM/bandwidth vs
| performance, not number of parameters.
| clear_view wrote:
| deepseek-r1:14b/mistral-small:24b/qwen2.5-coder:14b fit 16GB
| VRAM with fast generation. 32b versions bleed into RAM and take
| a serious performance hit but still usable.
| zamadatix wrote:
| https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...
|
| That will help you quickly calculate the model VRAM usage as
| well as the VRAM usage of the context length you want to use.
| You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model
| (unquantized)" field. Funnily enough the calculator lacks the
| option to see without quantizing the model, usually because
| nobody worried about VRAM bothers running >8 bit quants.
| azinman2 wrote:
| Except when it comes to deepseek
| simonw wrote:
| Big day for open source Chinese model releases - DeepSeek-v3-0324
| came out today too, an updated version of DeepSeek v3 now under
| an MIT license (previously it was a custom DeepSeek license).
| https://simonwillison.net/2025/Mar/24/deepseek/
| chaosprint wrote:
| it seems that this free version "may use your prompts and
| completions to train new models"
|
| https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free
|
| do you think this needs attention?
| wgd wrote:
| That's typical of the free options on OpenRouter, if you
| don't want your inputs used for training you use the paid
| one: https://openrouter.ai/deepseek/deepseek-chat-v3-0324
| overfeed wrote:
| Is OpenRouter planning on distilling models off the prompts
| and responses from frontier models? That's smart - a little
| gross - but smart.
| huijzer wrote:
| Since we are on HN here, I can highly recommend open-webui
| with some OpenAI-compatible provider. I'm running with Deep
| Infra for more than a year now and am very happy. New models
| are usually available within one or two days after release.
| Also have some friends who use the service almost daily.
| unquietwiki wrote:
| I'm using open-webui at home with a couple of different
| models. gemma2-9b fits in VRAM on a NV 3060 card + performs
| nicely.
| zakki wrote:
| What is the memory of your NV3060? 8GB?
| ngvjmfgb wrote:
| 12GB (edit: that is what mine is)
| l72 wrote:
| I too run openweb-ui locally and use deepinfra.com as my
| backend. It has been working very well, and I am quite
| happy with deepinfra's pricing and privacy policy.
|
| I have set up the same thing at work for my colleagues, and
| they find it better than openai for their tasks.
| TechDebtDevin wrote:
| Thats because its a 3rd party API someone is hosting and
| trying to arb the infra cost or mine training data, or maybe
| something even more sinister. I stay away from open router
| API's that aren't served by reputable well known companies,
| and even then...
| behnamoh wrote:
| good grief! people are okay with it when OpenAI and Google do
| it, but as soon as open source providers do it, people get
| defensive about it...
| chaosprint wrote:
| no. it's nothing to do with deepseek. it's openrouter and
| providers there
| londons_explore wrote:
| I trust big companies far more with my data than small
| ones.
|
| Big companies have so much data they won't be having a
| human look at mine specifically. Some small place probably
| has the engineer looking at my logs as user #4.
|
| Also, big companies have security teams whose job is
| securing the data, and it won't be going over some
| unencrypted link to cloudflare because OP was too lazy to
| set up Https certs.
| henry2023 wrote:
| Equifax.
| echelon wrote:
| Pretty soon I won't be using any American models. It'll be a
| 100% Chinese open source stack.
|
| The foundation model companies are screwed. Only shovel makers
| (Nvidia, infra companies) and product companies are going to
| win.
| jsheard wrote:
| I still don't get where the money for new open source models
| is going to come from once setting investor dollars on fire
| is no longer a viable business model. Does anyone seriously
| expect companies to keep buying and running thousands of
| ungodly expensive GPUs, plus whatever they spend on human
| workers to do labelling/tuning, and then giving away the
| spoils for free, forever?
| finnjohnsen2 wrote:
| ads again. somehow. its like a law of nature.
| api wrote:
| If nationalist propaganda counts as ads, that might
| already be supporting Chinese models. Ask them about
| Tiananmen Square.
|
| Any kind of media with zero or near zero
| copying/distribution costs becomes a deflationary race to
| the bottom. Someone will eventually release something
| that's free, and at that point nothing can compete with
| free unless it's some kind of very specialized offering.
| Then you run into a the problem the OP described: how do
| you fund free? Answer: ads. Now the customer is the
| advertiser, not the user/consumer, which is why most
| media converges on trash.
| Imustaskforhelp wrote:
| These ads can also have ads blockers though.
|
| Perplexity released the deepseek r1 1331? ( I am not sure
| I forgot) It basically removes chinese censorships / yes
| you can ask it about the tiananmen square.
|
| I think the next iteration of these ai model ads would be
| sneaky which might be hard to remove
|
| Though it's funny you comment about chinese censorship
| yet american censorship is fine lol
| eMPee584 wrote:
| XAI to the rescue!!1!
|
| ... (no, not the unintelligible one - the xplainable one)
| ada1981 wrote:
| Money from the Chinese defense budget?
|
| Everyone using these models undercuts US companies.
|
| Eventually China wins.
| Imustaskforhelp wrote:
| And wez the end user get open source models.
|
| Also china doesn't have access to that many gpus because
| of the chips act.
|
| And i hate it , i hate it when america sounds more
| communist than china who open sources their stuff because
| free markets.
|
| I actually think that more countries need to invest into
| AI and not companies wanting profit.
|
| This could be the decision that can impact the next
| century.
| greenavocado wrote:
| If only you knew how many terawatt hours were burned on
| biasing models to prevent them from becoming racist
| Imustaskforhelp wrote:
| To be honest, maybe I am going off topic but I wish for
| the level of innovation in the ai industry in the energy
| industry.
|
| It feels as an outsider that very little progress is made
| on the energy issue. I genuinely think that ai can be
| accelerated so so much more if energy could be more cheap
| / green
| bee_rider wrote:
| China has allowed quite a bit of market liberalism, so it
| isn't that surprising if their AI stuff is responding to
| the market.
|
| But, I don't really see the connection on the flip side.
| Why should proprietary AI be associated with communism?
| If anything I guess a communist handling of AI would also
| be to share the model.
| mitthrowaway2 wrote:
| Maybe from NVIDIA? "Commoditize your product's complement".
|
| https://www.joelonsoftware.com/2002/06/12/strategy-
| letter-v/
| akra wrote:
| This is the reason IMO. Fundamentally China right now is
| better at manufacturing (e.g. robotics). AI is the
| complement to this - AI increases the demand for tech
| manufactured goods. Whereas America is in the opposite
| position w.r.t which side is their advantage (i.e. the
| software). AI for China is an enabler into a potentially
| bigger market which is robots/manufacturing/etc.
|
| Commoditizing the AI/intelligence part means that the
| main advantage isn't the bits - its the atoms. Physical
| dexterity, social skills and manufacturing skills will
| gain more of a comparative advantage vs intelligence work
| in the future as a result - AI makes the old economy new
| again in the long term. It also lowers the value of AI
| investments in that they no longer can command first
| mover/monopoly like pricing for what is a very large
| capex cost undermining US investment in what is their
| advantage. As long as it is strategic, it doesn't
| necessarily need to be economic on its own.
| theptip wrote:
| Yeah, this is the obvious objection to the doom. Someone
| has to pay to train the model that all the small ones
| distill from.
|
| Companies will have to detect and police distilling if they
| want to keep their moat. Maybe you have to have an
| enterprise agreement (and arms control waiver) to get
| GPT-6-large API access.
| Imustaskforhelp wrote:
| I think it's market leadership which is just free word of
| mouth advertising which can then lead to consulting
| business or maybe they can cheek in some ads in llm
| directly oh boy you don't know.
|
| Also I have seen that once a open source llm is released to
| public, though you can access it on any website hosting it,
| most people would still prefer it to be the one which
| created the model.
|
| Deepseek released its revenue models and it's crazy good.
|
| And no they didn't have full racks of h100.
|
| Also one more thing. Open source has always had an issue of
| funding.
|
| Also they are not completely open source, they are just
| open weights, yes you can fine tune them but from my
| limited knowledge, there is some limitations of fine tuning
| so owning that training data proprietary also helps fund my
| previous idea of consulting other ai.
|
| Yes it's not a much profitable venture,imo it's just a
| decently profitable venture, but the current hype around ai
| is making it lucrative for companies.
|
| Also I think this might be a winner takes all market which
| increases competition but in a healthy way.
|
| What deepseek did with releasing the open source model and
| then going out of their way to release some other open
| source projects which themselves could've been worth a few
| millions (bycloud said it), helps innovate ai in general.
| zamadatix wrote:
| Once setting investment dollars on fire is no longer viable
| it'll probably be because scaling died anyways so what's
| the rush to have a dozen new frontier models each year.
| colechristensen wrote:
| I think the only people who will ever make money are the
| shovel makers, the models will always be free because
| you'll just get open source models chasing the paid ones
| and never being all that far behind, especially when this S
| curve growth phase slows down.
| natch wrote:
| Many sources, Chinese government could be one.
| pants2 wrote:
| There are lots of open-source projects that took many
| millions of dollars to create. Kubernetes, React, Postgres,
| Chromium, etc. etc.
|
| This has clearly been part of a viable business model for a
| long time. Why should LLM models be any different?
| Imustaskforhelp wrote:
| Yes I also believe the same though I only believe in either
| grok , gemini or claude ai of the western world.
|
| Gemini isn't too special , it's actually just comparable to
| deepseek / less than deepseek but it is damn fast so maybe
| forget gemini for true tasks.
|
| Grok / gemini can be used as a deep research model which I
| think I like ? Grok seems to have just taken the deepseek
| approach but just scaled it by their hyper massive gpu
| cluster, I am not sure I think that grok can also be
| replaced.
|
| What I truly believe in is claude.
|
| I am not sure but claude really feels good for coding
| especially.
|
| For any other thing I might use something like deepseek /
| chinese models
|
| I used cerebras.ai and holy moly they are so fast , I used
| the deepseek 70 b model , it is still something incredibly
| fast and my time matters so I really like the open source way
| so that companies like cereberas can focus on what they do
| best.
|
| I am not sure about nvidia though. Nvidia seems so connected
| to the western ai that deepseek improvements impact nvidia.
|
| I do hope that nvidia cheapens the price of gpu though I
| don't think they have much incentive.
| refulgentis wrote:
| I've been waiting since November for 1, just 1*, model other
| than Claude than can reliably do agentic tool call loops. As
| long as the Chinese open models are chasing reasoning and
| benchmark maxxing vs. mid-2024 US private models, I'm very
| comfortable with somewhat ignoring these models.
|
| (this isn't idle prognostication hinging on my personal hobby
| horse. I got skin in the game, I'm virtually certain I have
| the only AI client that is able to reliably do tool calls
| with open models in an agentic setting. llama.cpp got a
| massive contribution to make this happen and the big boys who
| bother, like ollama, are still using a dated json-schema-
| forcing method that doesn't comport with recent local model
| releases that can do tool calls. IMHO we're comfortably past
| a point where products using these models can afford to focus
| on conversational chatbots, thats cute but a commodity to
| give away per standard 2010s SV thinking)
|
| * OpenAI's can but are a little less...grounded?...situated?
| i.e. it can't handle "read this file and edit it to do $X".
| Same-ish for Gemini, though, sometimes I feel like the only
| person in the world who actually waits for the experimental
| models to go GA, as per letter of the law, I shouldn't deploy
| them until then
| throwawaymaths wrote:
| is there some reason you cant train a 1b model to just do
| agentic stuff?
| refulgentis wrote:
| They're real squished for space, more than I expected :/
| good illustration here, Qwen2.5-1.5B trained to reason,
| i.e. the name it is released under is "DeepSeek R1 1.5B".
| https://imgur.com/a/F3w5ymp 1st prompt was "What is
| 1048576^0.05", it answered, then I said "Hi",
| then...well...
|
| Fwiw, Claude Sonnet 3.5 100% had some sort of agentic
| loop x precise file editing trained into it. Wasn't
| obvious to me until I added a MCP file server to my
| client, and still isn't well-understood outside a few.
|
| I'm not sure on-device models will be able to handle it
| any time soon because it relies on just letting it read
| the whole effing file.
|
| Seperately...
|
| I say I don't understand why no other model is close, but
| it makes sense. OpenAI has been focused on reasoning,
| Mistral, I assume is GPU-starved, and Google...well, I
| used to work there, so I have to stop myself from going
| on and on. Let's just say I assume that there wouldn't be
| enough Consensus Built(tm) to do something "scary" and
| "experimental" like train that stuff in.
|
| This also isn't going so hot for Sonnet IMHO.
|
| There's vague displeasure and assumptions it "changed"
| the last week, but, AFAICT the real problem is that the
| reasoning stuff isn't as "trained in" as, say, OpenAI's.
|
| This'd be a good thing except you see all kinds of whacky
| behavior.
|
| One of my simple "read file and edit" queries yesterday
| did about 60 pages worth of thinking, and the thinking
| contained 130+ separate tool calls that weren't actually
| called, so it was just wandering around in the
| wilderness, reacting to hallucinated responses it never
| actually got.
|
| Which plays into another one of my hobbyhorses, chat is a
| "hack" on top of an LLM. Great. So is reasoning,
| especially in the way Anthropic implemented it. At what
| point are the abstractions too much, so much that it's
| unreliable? 3.7 Sonnet may be answering that, because
| when it fails, all that thinking looks like the agentic
| loop cooked into Sonnet 3.5. So maybe it's altogether too
| much to have chat, reasoning, and fully reliable agentic
| loops...
| anon373839 wrote:
| The Berkeley Function Calling Leaderboard [1] might be of
| interest to you. As of now, it looks like Hammer2.1-3b is
| the strongest model under 7 billion parameters. Its
| overall score is ~82% of GPT-4o's. There is also
| Hammer2.1-1.5b at 1.5 billion parameters that is ~76% of
| GPT-4o.
|
| [1] https://gorilla.cs.berkeley.edu/leaderboard.html
| refulgentis wrote:
| Worth noting:
|
| - That'll be 1 turn scores: at multiturn, 4o is 3x as
| good as the 3b
|
| - BFCL is generally turn natural language into an API
| call, then multiturn will involve making another API
| call.
|
| - I hope to inspire work towards an open model that can
| eat the paid models sooner rather than later
|
| - trained quite specifically on an agent loop with tools
| read_files and edit_file (you'll also probably do at
| least read_directory and get_shared_directories,
| search_filenames and search_files_text are good too),
| bonus points for cli_command
|
| - IMHO, this is much lower hanging-fruit than ex.
| training an open computer-vision model, so I beseech
| thee, intrepid ML-understander, to fill this gap and hear
| your name resound throughout the age
| ilrwbwrkhv wrote:
| Both of them are better than any American models. Both for
| reasoning, agentic, fine tuning etc.
| th0ma5 wrote:
| Why do you keep promoting your blog on every LLM post?
| rudedogg wrote:
| I think they didn't want to rewrite their post. It's more
| substantial and researched than any comment here, and all
| their posts are full of information. I think they should get
| a pass, and calling it self-promotion is a stretch.
| simonw wrote:
| Because I want people to read it. I only promote it if I
| think it's useful and relevant.
| simonw wrote:
| 32B is one of my favourite model sizes at this point - large
| enough to be extremely capable (generally equivalent to GPT-4
| March 2023 level performance, which is when LLMs first got really
| useful) but small enough you can run them on a single GPU or a
| reasonably well specced Mac laptop (32GB or more).
| clear_view wrote:
| 32B don't fully fit 16GB of VRAM. Still fine for higher quality
| answers, worth the extra wait in some cases.
| abraxas wrote:
| Would a 40GB A6000 fully accommodate a 32B model? I assume an
| fp16 quantization is still necessary?
| manmal wrote:
| At FP16 you'd need 64GB just for the weights, and it'd be
| 2x as slow as a Q8 version, likely with little improvement.
| You'll also need space for attention and context etc, so
| 80-100GB (or even more) VRAM would be better.
|
| Many people ,,just" use 4x consumer GPUs like the 3090
| (24GB each) which scales well. They'd probably buy a mining
| rig, EPYC CPU, Mainboard with sufficient PCIe lanes, PCIe
| risers, 1600W PSU (might need to limit the GPUs to 300W),
| and 128GB RAM. Depending what you pay for the GPUs that'll
| be 3.5-4.5k
| postalrat wrote:
| I haven't found a good case/risers/etc I really like.
| Most the miner stuff wasn't made for PCIe 16x.
| manmal wrote:
| Is that a problem? According to this, the GPUs don't
| communicate that much once the weights are loaded: https:
| //github.com/turboderp/exllama/discussions/16#discussi...
|
| > So at FP16 precision that's a grand total of 16 kB
| you're transmitting over the PCIe bus, once per token. If
| you multiply by, say, 20 tokens per second, then you're
| still only using like 0.1% of your PCIe bandwidth.
|
| Intra GPU memory bandwidth is very important, but I've
| seen lots of people use just a x4 lane and they didn't
| complain much.
| abraxas wrote:
| would it be better for energy efficiency and overall
| performance to use workstation cards like A5000 or A4000?
| Those can be found on eBay.
| manmal wrote:
| Looks like the A4000 has low memory bandwidth (50% of a
| 4090?) which is the limiting factor for inference
| usually. But they are efficient - if you can get them for
| cheap, probably a good entry setup? If you like running
| models that need a lot of VRAM, you'll likely run out of
| PCIe slots before you are done upgrading.
| elorant wrote:
| You don't need 16-bit quantization. The difference in
| accuracy from 8-bit in most models is less than 5%.
| redrove wrote:
| Or quantized on a 4090!
| YetAnotherNick wrote:
| I don't think these models are GPT-4 level. Yes they seem to be
| on benchmarks, but it has been known that models increasingly
| use A/B testing in dataset curation and synthesis(using GPT 4
| level models) to optimize not just the benchmarks but things
| which could be benchmarked like academics.
| simonw wrote:
| I'm not talking about GPT-4o here - every benchmark I've seen
| has had the new models from the past ~12 months out-perform
| the March 2023 GPT-4 model.
|
| To pick just the most popular one,
| https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd
| now.
| th0ma5 wrote:
| How have you been able to tie benchmark results to better
| results?
| simonw wrote:
| Vibes and intuition. Not much more than that.
| wetwater wrote:
| I've only recently started looking into running these models
| locally on my system. I have limited knowledge regarding LLMs
| and even more limited when it comes to building my own PC.
|
| Are there any good sources that I can read up on estimiating
| what would be hardware specs required for 7B, 13B, 32B .. etc
| size If I need to run them locally?
| TechDebtDevin wrote:
| VRAM Required = Number of Parameters (in billions) x Number
| of Bytes per Parameter x Overhead[0].
|
| [0]: https://twm.me/posts/calculate-vram-requirements-local-
| llms/
| wetwater wrote:
| Thats neat! thanks
| manmal wrote:
| Don't forget to add a lot of extra space if you want a
| usable context size.
| TechDebtDevin wrote:
| Wouldn't that be your overhead var
| faizshah wrote:
| I just started self hosting as well on my local machine, been
| using https://lmstudio.ai/ Locally for now.
|
| I think the 32b models are actually good enough that I might
| stop paying for ChatGPT plus and Claude.
|
| I get around 20 tok/second on my m3 and I can get 100
| tok/second on smaller models or quantized. 80-100 tok/second is
| the best for interactive usage if you go above that you
| basically can't read as fast as it generates.
|
| I also really like the QwQ reaoning model, I haven't gotten
| around to try out using locally hosted models for Agents and
| RAG especially coding agents is what im interested in. I feel
| like 20 tok/second is fine if it's just running in the
| background.
|
| Anyways would love to know others experiences, that was mine
| this weekend. The way it's going I really dont see a point in
| paying, I think on-device is the near future and they should
| just charge a licensing fee like DB provider for enterprise
| support and updates.
|
| If you were paying $20/mo for ChatGPT 1 year ago, the 32b
| models are basically at that level but slightly slower and
| slightly lower quality but useful enough to consider cancelling
| your subscriptions at this point.
| wetwater wrote:
| Are there any good sources that I can read up on estimiating
| what would be hardware specs required for 7B, 13B, 32B .. etc
| size If I need to run them locally? I am grad student on
| budget but I want to host one locally and trying to build a
| PC that could run one of these models.
| disgruntledphd2 wrote:
| MacBook with 64gb RAM will probably be the easiest. As a
| bonus, you can train pytorch models on the built in GPU.
|
| It's really frustrating that I can't just write off Apple
| as evil monopolists when they put out hardware like this.
| coder543 wrote:
| "B" just means "billion". A 7B model has 7 billion
| parameters. _Most_ models are trained in fp16, so each
| parameter takes two bytes at full precision. Therefore, 7B
| = 14GB of memory. You can easily quantize models to 8 bits
| per parameter with very little quality loss, so then 7B =
| 7GB of memory. With more quality loss (making the model
| dumber), you can quantize to 4 bits per parameter, so 7B =
| 3.5GB of memory. There are ways to quantize at other levels
| too, anywhere from under 2 bits per parameter up to 6 bits
| per parameter are common.
|
| There is additional memory used for context / KV cache. So,
| if you use a large context window for a model, you will
| need to factor in several additional gigabytes for that,
| but it is much harder to provide a rule of thumb for that
| overhead. Most of the time, the overhead is significantly
| less than the size of the model, so not 2x or anything.
| (The size of the context window is related to the amount of
| text/images that you can have in a conversation before the
| LLM begins forgetting the earlier parts of the
| conversation.)
|
| The most important thing for local LLM performance is
| typically memory bandwidth. This is why GPUs are so much
| faster for LLM inference than CPUs, since GPU VRAM is many
| times the speed of CPU RAM. Apple Silicon offers rather
| decent memory bandwidth, which makes the performance fit
| somewhere between a typical Intel/AMD CPU and a typical
| GPU. Apple Silicon is definitely _not_ as fast as a
| discrete GPU with the same amount of VRAM.
|
| That's about all you need to know to get started. There are
| obviously nuances and exceptions that apply in certain
| situations.
|
| A 32B model at 5 bits per parameter will comfortably fit
| onto a 24GB GPU and provide decent speed, as long as the
| context window isn't set to a huge value.
| epolanski wrote:
| So, in essence, all AMD does to launch a successful GPU
| in inference space is to load it with ram?
| faizshah wrote:
| Go to r/LocalLLAMA they have the most info. There's also
| lots of good YouTube channels who have done benchmarks on
| Mac minis for this (another good value one with student
| discount).
|
| Since you're a student most of the providers/clouds offer
| student credits and you can also get loads of credits from
| hackathons.
| p_l wrote:
| Generally, unquantized - double the number and that's the
| amount of VRAM in GB you need + some extra, because most
| models use fp16 weights so it's 2 bytes per parameter ->
| 32B parameters = 64GB
|
| typical quantization to 4bit will cut 32B model into 16GB
| of weights plus some of the runtime data, which makes it
| possibly usable (if slow) on 16GB GPU. You can sometimes
| viably use smaller quantizations, which will reduce memory
| use even more.
| randomNumber7 wrote:
| Yes. You multiply the number of parameters with the number
| of bytes per parameter and compare it with the amount of
| GPU memory (or CPU RAM) you have.
| pixelHD wrote:
| what spec is your local mac?
| gatienboquet wrote:
| So today is Qwen. Tomorrow a new SOTA model from Google
| apparently, R2 next week.
|
| We haven't hit the wall yet.
| zamadatix wrote:
| Qwen 3 is coming imminently as well
| https://github.com/huggingface/transformers/pull/36878 and it
| feels like Llama 4 should be coming in the next month or so.
|
| That said none of the recent string of releases has done much
| yet to "smash a wall", they've just met the larger proprietary
| models where they already were. I'm hoping R2 or the like
| really changes that by showing ChatGPT 3->3.5 or 3.5->4 level
| generational jumps are still possible beyond the current state
| of the art, not just beyond current models of a given size.
| tomdekan wrote:
| Any more info on the new Google model?
| behnamoh wrote:
| Google's announcements are mostly vaporware anyway. Btw, where
| is Gemini Ultra _1_? how about Gemini Ultra 2?
| karmasimida wrote:
| It is already on the LLM arena right, codename nebula? But
| you are right they can fuck up their releases royally.
| OsrsNeedsf2P wrote:
| > We haven't hit the wall yet.
|
| The models are iterative improvements, but I haven't seen night
| and day differences since GPT3 and 3.5
| anon373839 wrote:
| Yeah. Scaling up pretraining and huge models appears to be
| done. But I think we're still advancing the frontier in the
| other direction -- i.e., how much capability and knowledge
| can we cram into smaller and smaller models?
| nwienert wrote:
| We've slid into the upper S curve though.
| Arcuru wrote:
| Does anyone know how making the models multimodal impacts their
| text capabilities? The article is claiming this achieves good
| performance on pure text as well, but I'm curious if there is any
| analysis on how much impact it usually has.
|
| I've seen some people claim it should make the models better at
| text, but I find that a little difficult to believe without data.
| lysace wrote:
| To clarify: Qwen is made by Alibaba Cloud.
|
| (It's not mentioned anywhere in the blog post.)
| nodesocket wrote:
| What is the recommended model to process a RAG of PDF text
| documents? I've seen some recommendations for Mistral:7b. Looking
| to run on a consumer pedestrian home PC (ollama) with a Nvidia
| 4060ti and Ryzen 5700x.
| unquietwiki wrote:
| Apparently there are two versions of the 4060Ti, with 8GB and
| 16GB of VRAM respectively. I've got an 8GB 3060 that runs
| gemma2:9b nicely, and that will parse PDF files; gemma3:4b also
| seems to analyze PDFs decently.
| nodesocket wrote:
| I also have the 8GB 4060ti variant. Want to upgrade to a 4070
| super, but prices on them are still ridiculous. Could be had
| for $599 a handful of months ago, now on ebay going for $750
| plus.
|
| Thanks for the recommendations. I'll give gemma3:12b a try
| and if needed go down to gemma:4b.
| manmal wrote:
| May I ask why you don't get a used 3090 with 24GB VRAM?
| hmottestad wrote:
| Just don't ask it about the tiananmen square massacre or you'll
| get a security warning. Even if you rephrase it.
|
| It'll happily talk about Bloody Sunday.
|
| Probably a great model, but it worries me that it has such
| restrictions.
|
| Sure OpenAI also has lots of restrictions, but this feels more
| like straight up censorship since it'll happily go on about bad
| things the governments of the west have done.
| generalizations wrote:
| Nah, it's great for things that Western models are censored on.
| The True Hacker will keep an Eastern and Western model
| available, depending on what they need information on.
| hmottestad wrote:
| I tried to ask it about Java exploits that would allow me to
| gain RCE, but it refused just as most western models do.
|
| That was the only thing I could think to ask really. Do you
| have a better example maybe?
| theturtletalks wrote:
| Wouldn't they just run R1 locally and not have any censorship
| at all? The model isn't censored at its core, it's censored
| through the system prompt. Perplexity and Huggingface have
| their own versions of R1 that is not censored.
| hmottestad wrote:
| I tried R1 through Kagi and it's similarly censored. Even
| the distill of llama running on Groq is censored.
| theturtletalks wrote:
| Kagi may be using the official DeepSeek API and not
| hosting the model itself. There is work being done to
| make it completely uncensored:
|
| https://github.com/huggingface/open-r1
|
| https://ollama.com/huihui_ai/deepseek-r1-abliterated
|
| I was mistaken though, it is more than just a system
| prompt causing the censorship.
| rsoto2 wrote:
| a lot of the safety around models seems to be implemented in
| the browser. Underneath the models seem pretty easy to
| fool/jailbreak.
| jampa wrote:
| DeepSeek's website seems to be using two models. The one that
| censors only does so in the online version. Are you saying that
| censoring happens with this model, even in the offline version?
| hmottestad wrote:
| I tried the R1 distill of llama 8B, which did refuse direct
| questions about the massacre.
|
| Haven't tried this new model locally, but I agree with you
| that it looks like there is a secondary censorship going on.
| If I ask it to list the 10 worst catastrophes of recent
| Chinese history with Thinking enabled then it'll actually
| think about the massacre. Gets blocked very quickly, but it
| doesn't look like the thinking is particularly censored.
| asadm wrote:
| a) nobody, in production, asks those questions b) chatgpt is
| similarly biased on israel/palestine issue. Try making it agree
| that there is a genocide ongoing or on Palestinians right to
| defend themselves.
| BoorishBears wrote:
| Daily reminder that all commerical LLMs are going to align with
| the governments their corporations exist under.
|
| https://imgur.com/a/censorship-much-CBxXOgt
|
| It's not even nefarious: they don't want the model spewing out
| content that will get them in trouble in the most general
| sense. It just so happens most governments have things that
| will get you in trouble.
|
| The US is very obsessed with voter manipulation these days, so
| OpenAI and Anthropic's models are extra sensitive if the
| wording implies they're being used for that.
|
| China doesn't like talking about past or ongoing human rights
| violations, so their models will be extra sensitive about that.
| 101008 wrote:
| Silly question: how can OpenAI, Claude and all, have a valuation
| so large considering all the open source models? Not saying they
| will disappear or be tiny (closed models), but why so so so
| valuable?
| barbarr wrote:
| It's user base and brand. Just like with Pepsi and Coca Cola.
| There's a reason OpenAI ran a Super Bowl ad.
| 101008 wrote:
| Yeah but cheaper alternatives (and open source and local
| ones) it would be super easy for most of the customers to
| migrate to a different provider. I am not saying they don't
| provide any value, but it's like paid software vs open source
| alternative. Open source alternative ends up imposing,
| especially among tech people.
| TechDebtDevin wrote:
| Most "normies" I know only recognize ChatGPT with AI, so for
| sure, brand recognition is the only thing that matters.
| seydor wrote:
| People cannot normally invest in their competitors.
|
| It's not unlikely that chinese products may be banned /
| tarriff'd
| FreakyT wrote:
| There are non-Chinese open LLMs (Mistral, LLama, etc), so I
| don't think that explains it.
| neither_color wrote:
| ChatGPT's o1 is still really good and the free options are not
| compelling enough to switch if you've been using it for a
| while. They've positioned themselves to be a good mainstream
| default.
| rafaelmn wrote:
| Because what would seem like a tiny difference in those
| benchmark graphs is the difference between worth paying for and
| complete waste of time in practice
| elorant wrote:
| The average user won't self-host a model.
| epolanski wrote:
| ...yet
| admiralrohan wrote:
| Any update on the privacy concerns?
| ggregoire wrote:
| We were using Llama vision 3.2 a few months back and were very
| frustrated with it (both in term of speed and results quality).
| Some day we were looking for alternatives on Hugging Face and
| eventually stumbled upon Qwen. The difference in accuracy and
| speed absolutely blew our mind. We ask it to find something in an
| image and we get a response in like half a second with a 4090 and
| it's most of the time correct. What's even more mind blowing is
| that when we ask it to extract any entity name from the image,
| and the entity name is truncated, it gives us the complete name
| without even having to ask for it (e.g. "Coca-C" is barely
| visible in the background, it will return "Coca-Cola" on its
| own). And it does it with entities not as well known as Coca-
| Cola, and with entities only known in some very specific regions
| too. Haven't looked back to Llama or any other vision models
| since we tried Qwen.
| tomcam wrote:
| I wish I were smarter and lighter
|
| tbh I'd settle for just lighter
| wg0 wrote:
| For now, motivation is to win the AI race. Not sure how or who
| would keep these models updated in the future.
|
| I mean all of them.
| i_love_retros wrote:
| Any security risks running these Chinese LLMs on my local
| computer?
| simonw wrote:
| This model is available for MLX now, in various different sizes.
|
| I'm trying to run https://huggingface.co/mlx-
| community/Qwen2.5-VL-32B-Instruct... like this:
| uv run --with 'numpy<2' --with mlx-vlm \ python -m
| mlx_vlm.generate \ --model mlx-
| community/Qwen2.5-VL-32B-Instruct-4bit \ --max-tokens
| 1000 \ --temperature 0.0 \ --prompt "Describe
| this image." \ --image Mpaboundrycdfw-1.png
|
| That downloaded an ~18GB model and gave me a VERY impressive
| result, shown at the bottom here:
| https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/
| michaelt wrote:
| Has anyone successfully run a quantized version of any of the
| Qwen2.5-VL series of models?
|
| I've run the smallest model in non-quantized format, but when
| I've tried to run a AWQ version of one of the bigger models I've
| struggled to find a combination of libraries that works right -
| even though it _should_ fit on my GPU.
| slics wrote:
| Warning- this is a lengthy comment. I deployed that model locally
| via Ollama and the reasoning model is not accurate.
|
| This is lengthy but I thought I share its reasoning- at some
| point it just stopped working - then it started responding in
| Chinese.
|
| Query input:
|
| "1+5 = 7, and that is the correct answer"
|
| Reasoning response:
|
| The equation "1 + 5 = 7" isn't about regular arithmetic addition.
| Instead, it refers to the number of segments used to display the
| digits 1 and 5 on a seven-segment display (like on digital clocks
| or calculators). Here's the breakdown: * Digit 1 uses 2 segments
| (the vertical lines on the right). * Digit 5 uses 5 segments
| (top, upper-left, middle, lower-left, and bottom segments).
| Adding these segments: 2 (for 1) + 5 (for 5) = 7 segments total.
| This is a clever visual riddle, not a math problem! it was a math
| problem, you are just wrong
|
| Okay, the user is insisting that "1 + 5 = 7 is a math problem"
| and that my previous explanation was wrong. Hmm, I need to
| address this carefully. First, in standard arithmetic, 1 + 5 is
| indeed 6. So if it's a straightforward math problem, the correct
| answer should be 6. The user might be mistaken, or perhaps they
| have a different context in mind. Wait, in my previous response,
| I considered the seven-segment display explanation, which is a
| common riddle. The user is now rejecting that and stating it's a
| math problem, implying that their answer of 7 is correct in some
| mathematical context. Maybe they're referring to a different
| base? Let me revisit the base idea. If the equation is in a
| different base, maybe base 9? Let's see: In base b, the equation
| "1 + 5 = 11" would be 1 + 5 = 6 in decimal, which isn't helpful.
| Wait, maybe the result is written in a different base? For
| example, if we compute 1 + 5 in decimal (which is 6) and then
| represent that in base 7? 6 in base 7 is still "6", not 7. Hmm,
| not helpful. Alternatively, maybe modulo arithmetic? Like (1 + 5)
| mod something equals 7? Let's see: 1 +5 = 6. To get 7, maybe
| modulo something where 6 [?]7 mod (negative number?), which
| doesn't make sense. Alternatively, maybe the user is adding in a
| cyclic group? Like mod 6? 6 mod 6 is 0, not 7. Alternatively,
| could it be a different operation? Like 1 plus 5 in some
| operation other than addition? Maybe concatenation? 1
| concatenated with 5 is 15, but that's not 7. Wait, the user said
| "another one 1+5 =7". Maybe they're referring to a previous
| problem where the answer was different, and this is a follow-up?
| But I don't have context on previous interactions. Alternatively,
| perhaps the user is joking or trolling, but I should take their
| statement at face value. Since they insist it's a math problem
| and that 7 is correct, I need to explore if there's any
| mathematical framework where this holds. Wait, let's consider if
| it's a sum in a different base where the result is written in
| that base. Suppose the equation is in base 8. Let's see: In base
| 8, "7" is still 7 in decimal, so 1 (base8) +5(base8) = 6 (base8),
| which is 6 in decimal. Not 7. Alternatively, if the answer is
| written in base 9: 1 +5 =6 in decimal, which is still 6 in base9.
| Still not 7. Wait a different angle: Maybe the '+' is not
| addition, but some other operation. Like bitwise XOR? 1 in binary
| is 01, 5 is 101. XOR gives 110 (6). Not 7. Or maybe adding digits
| of the sum? Like 1+5=6, then 6 is a single digit, so adding
| digits 6, but that's not helpful. Wait, the user might have made
| a mistake, but they're insistent. Since they called it a math
| problem, perhaps they're using a different definition of "+"?
| Like 1 +5 in some algebraic structure where the result is 7?
| That's possible, but without more context, it's hard to define.
| Alternatively, maybe the numbers are not in base 10. Suppose the
| numbers are in base 12. Wait, 1+5 is still 6. Hmm. Wait, maybe
| the user is considering the numbers as Roman numerals? I+V=VI
| which is 6. Still not 7. Alternatively, maybe it's a sequence
| where 1+5 is the 7th term? Unlikely. Alternatively, it could be a
| trick with letters. Like 1 is the first letter (A), 5 is the
| fifth letter (E), so AE... but that doesn't lead to 7.
| Alternatively, adding their positions: 1 +5 =6, which is F. Still
| not 7. Alternatively, maybe counting letters in spelled-out
| words? "One" has 3 letters, "five" has 4, total 7? That could be
| a different riddle. Let's see: 1 (one) +5 (five) = 3 letters +4
| letters =7. That's a possible interpretation! If the problem is
| "1+5" where each number is spelled out in English and you add the
| number of letters, then 3+4=7. So maybe that's the case here?
| Alternatively, if it's counting the strokes to write the numbers.
| For example, writing '1' takes one stroke, '5' takes two strokes
| (depending on how you write it)? Not sure. Alternatively, the
| user might have meant 1+5 in a non-standard math problem, like in
| a clock arithmetic? For example, mod 12. 1+5=6 mod12 is still 6.
| Hmm, I need to consider that maybe the user intended a different
| approach that I haven't thought of. Since they are insistent that
| 7 is correct, perhaps they meant a different operation or
| context. Alternatively, maybe the problem is written vertically:
| 1 + 5 ---- 7 But that still doesn't make sense unless there's a
| carryover. But 1+5 is 6, so even with carryover, you can't get 7.
| Unless adding digits with some twist. Wait, maybe they're adding
| the numbers' positions in a different sequence? Like 1st prime
| plus 5th prime? The first prime is 2, fifth prime is 11. 2+11=13,
| nope. Alternatively, if you consider 1 and 5 as bits in a binary
| number, like 1 and 5 (binary 101) concatenated? Not helpful.
| Alternatively, the problem might be in a different numeral
| system, like in base 1? But that's not standard. Alternatively,
| maybe it's a simple mistake, but the user is confident. Since I
| need to respond, perhaps I should outline the possible
| interpretations where 7 could be the answer, like the spelled-out
| letters interpretation. Alternatively, the user could have a typo
| and meant 2+5=7, but wrote 1+5=7. Alternatively, in some contexts
| like in programming, maybe 1+5=7 due to a specific coding (like 1
| is assigned to a=1, so a+f=6? Not sure. Alternatively, it's a
| lateral thinking puzzle where the answer is 7 because of some
| visual or language trick. Wait, another angle: If you add 1 and 5
| on a calculator upside down, but that might not help.
| Alternatively, in temperature scales? For example, converting
| from Fahrenheit to Celsius or vice versa? 1degF to Celsius is
| negative, 5degC is 41degF. Not adding to 7. Hmm. Given the user's
| assertion that 1+5=7 is correct in a math problem, I might need
| to explain that typically 1+5=6, but there are alternative
| contexts where it might be considered 7, like counting letters in
| their names (1 is "one" (3 letters), 5 is "five" (4 letters),
| 3+4=7). That's a possible interpretation. Alternatively, in some
| base where the addition is represented differently.
| Alternatively, a trick question where the numbers are not in base
| 10 but in another base where the representation is different.
| Alternatively, the user might have seen a problem where adding 1
| and 5 gives 7 through some function, like a + b = a + b +1, but
| that's arbitrary.
___________________________________________________________________
(page generated 2025-03-24 23:00 UTC)