[HN Gopher] Qwen VLo: From "Understanding" the World to "Depicti...
___________________________________________________________________
Qwen VLo: From "Understanding" the World to "Depicting" It
Author : lnyan
Score : 149 points
Date : 2025-06-27 14:35 UTC (8 hours ago)
(HTM) web link (qwenlm.github.io)
(TXT) w3m dump (qwenlm.github.io)
| rushingcreek wrote:
| It doesn't seem to have open weights, which is unfortunate. One
| of Qwen's strengths historically has been their open-weights
| strategy, and it would have been great to have a true open-
| weights competitor to 4o's autoregressive image gen. There are so
| many interesting research directions that are only possible if we
| can get access to the weights.
|
| If Qwen is concerned about recouping its development costs, I
| suggest looking at BFL's Flux Kontext Dev release from the other
| day as a model: let researchers and individuals get the weights
| for free and let startups pay for a reasonably-priced license for
| commercial use.
| Jackson__ wrote:
| It's also very clearly trained on OAI outputs, which you can
| tell from the orange tint to the images[0]. Did they even
| attempt to come up with their own data?
|
| So it is trained off OAI, as closed off as OAI and most
| importantly: worse than OAI. What a bizarre strategy to gate-
| keep this behind an API.
|
| [0]
|
| https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
|
| https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
|
| https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
| echelon wrote:
| The way they win is to be open. I don't get why China is
| shutting down open source. It was a knife at the jugular of
| US tech dominance.
|
| Both Alibaba and Tencent championed open source (Qwen family
| of models, Hunyuan family of models), but now they've shut
| off the releases.
|
| There's totally a play where models become loss-leader for
| SaaS/PaaS/IaaS and where they extinguish your closed
| competition.
|
| Imagine spreading your model so widely then making the terms:
| "do not use in conjunction with closed source models".
| diggan wrote:
| > I don't get why China is shutting down open source [...]
| now they've shut off the releases
|
| What are you talking about? Feels like a very strong claim
| considering there are ongoing weight releases, wasn't there
| one just today or yesterday from a Chinese company?
| yorwba wrote:
| The problem with giving away weights for free while also
| offering a hosted API is that once the weights are out
| there, anyone else can also offer it as a hosted API with
| similar operating costs, but only the releasing company had
| the initial capital outlay of training the model. So
| everyone else is more profitable! That's not a good
| business strategy.
|
| New entrants may keep releasing weights as a marketing
| strategy to gain name recognition, but once they have
| established themselves (and investors start getting antsy
| about ROI) making subsequent releases closed is the logical
| next step.
| vachina wrote:
| Huh, so orange tint = openAI output? Maybe their training
| process ended up causing the model to prefer that color
| balance.
| Jackson__ wrote:
| Here's an extreme example that shows how it continually
| adds more orange: https://old.reddit.com/r/ChatGPT/comments
| /1kawcng/i_went_wit...
|
| It's really too close to be anything but a model trained on
| these outputs, the whole vibe just screams OAI.
| acheong08 wrote:
| That form of collapse might just be inherent to the
| methodology. Releasing the weights would be nice so
| people can figure out why
| VladVladikoff wrote:
| What would be the approximate cost of doing this? How many
| million API requests must be made? How many tokens in total?
| refulgentis wrote:
| Most pedantically correct answer is "mu", because the
| answers are both derivable quantitively from "How many
| images do you want to train on?", which is answered by a
| qualitative question that doesn't admit numbers ("How high
| quality do you want it to be?")
|
| Let's say it's 100 images because you're doing a quick
| LoRA. That'd be about $5.00 at medium quality
| (~$0.05/image) or $1 at low. ~($0.01/image)
|
| Let's say you're training a standalone image model. OOM of
| input images is ~1B, so $10M at low and $50M at high.
|
| 250 tokens / image for low, ~1000 for medium, which gets us
| to:
|
| Fastest LoRA? $1-$4. 25,000 - 100,000 tokens output. _All_
| the training data for a new image model? $10M-$50M, 2.5B -
| 10B tokens out.
| diggan wrote:
| > One of Qwen's strengths historically has been their open-
| weights strategy [...] let researchers and individuals get the
| weights for free and let startups pay for a reasonably-priced
| license for commercial use.
|
| But if you're suggesting they should do open weights, doesn't
| that mean people should be able to use it freely?
|
| You're effectively suggesting "trial-weights", "shareware-
| weights", "academic-weights" or something like that rather than
| "open weights", which to me would make it seem like you can use
| them for whatever you want, just like with "open source"
| software. But if it misses a large part of what makes "open
| source" open source, like "use it for whatever you want", then
| it kind of gives the wrong idea.
| rushingcreek wrote:
| I am personally in favor of true open source (e.g. Apache 2
| license), but the reality is that these model are expensive
| to develop and many developers are choosing not to release
| their model weights at all.
|
| I think that releasing the weights openly but with this type
| of dual-license (hence open weights, but not true open
| source) is an acceptable tradeoff to get more model
| developers to release models openly.
| diggan wrote:
| > but the reality is that these model are expensive to
| develop and many developers are choosing not to release
| their model weights at all.
|
| But isn't that true for software too? Software is expensive
| to develop, and lots of developers/companies are choosing
| not to make their code public for free. Does that mean you
| also feel like it would be OK to call software "open
| source" although it doesn't allow usage for any purpose?
| That would then lead to more "open source" software being
| released, at least for individuals and researchers?
| rushingcreek wrote:
| Yes, I think the same analogy applies. Given a binary
| choice of a developer not releasing any code at all or
| releasing code under this type of binary "open-code"
| license, I'd always take the latter.
| diggan wrote:
| > Given a binary choice of a developer not releasing any
| code at all
|
| I mean it wasn't binary earlier, it was "to get more
| model developers to release", so not a binary choice, but
| a gradient I suppose. Would you still make the same call
| for software as you do for ML models and weights?
| hmottestad wrote:
| I wouldn't equate model weights with source code. You can
| run software on your own machine without source code, but
| you can't run an LLM on your own machine without model
| weights.
|
| Though, you could still sell the model weights for local
| use. Not sure if we are there yet that I myself could buy
| model weights, but of course if you are a very big
| company or a very big country then I guess most AI
| companies would consider selling you their model weights
| so you can run them on your own machine.
| echelon wrote:
| The era of open weights from China appears to be over for some
| reason. It's all of a sudden and seems to be coordinated.
|
| Alibaba just shut off the Qwen releases
|
| Tencent just shut off the Hunyuan releases
|
| Bytedance just released Seedream, but it's closed
|
| It's seems like it's over.
|
| They're still clearly training on Western outputs, though.
|
| I still suspect that the strategic thing to do would be to
| become 100% open and sell infra/service.
| pxc wrote:
| Why? And can we really say that already? Wasn't the Qwen3
| release still very recent?
| natrys wrote:
| > Alibaba just shut off the Qwen releases
|
| Alibaba from beginning had some series of models that are
| always closed-weights (*-max, *-plus, *-turbo etc. but also
| QvQ), It's not a new development, nor does it prevent their
| open models. And the VL models are opened after 2-3 months of
| GA in API.
|
| > Tencent just shut off the Hunyuan releases
|
| Literally released one today:
| https://huggingface.co/tencent/Hunyuan-A13B-Instruct
| echelon wrote:
| Hunyuan Image 2.0, which is of Flux quality but has ~20
| milliseconds of inference time, is being withheld.
|
| Hunyuan 3D 2.5, which is an order of magnitude better than
| Hunyuan 3D 2.1, is also being withheld.
|
| I suspect that now that they feel these models are superior
| to Western releases in several categories, they no longer
| have a need to release these weights.
| natrys wrote:
| > I suspect that now that they feel these models are
| superior to Western releases in several categories, they
| no longer have a need to release these weights.
|
| Yes that I can totally believe. Standard corporation
| behaviour (Chinese or otherwise).
|
| I do think DeepSeek would be an exception to this though.
| But they lack diversity in focus (not even multimodal
| yet).
| logicchains wrote:
| What do you mean Tencent just shut off the Hunyuan releases?
| There was another open weights release just today:
| https://huggingface.co/tencent/Hunyuan-A13B-Instruct . And
| the latest Qwen and DeepSeek open weight releases were under
| 2 months ago, there hasn't been enough time for them to
| finish a new version since then.
| echelon wrote:
| Hunyuan Image 2.0 and Hunyuan 3D 2.5 are not being
| released. They're being put into a closed source web-based
| offering.
| jacooper wrote:
| Deepseek R1 0528, the flagship Chinese model is open source.
| Qwen3 is open source. HIdream models are also open source
| dheera wrote:
| > One of Qwen's strengths historically has been their open-
| weights strategy
|
| > let researchers and individuals get the weights for free and
| let startups pay for a reasonably-priced license for commercial
| use
|
| I'm personally doubtful companies can recoup tens of millions
| of dollars in investment, GPU hours, and engineering salaries
| from image generation fees.
| aredox wrote:
| It don't think these words mean what they think they do...
| frotaur wrote:
| Anybody knows if there is a technical report for this, or for
| other models that generate images in a similar way? I'd really
| like to understand the architecture behind 4o-like image gen.
| rickydroll wrote:
| To my eyes, all these images hit the uncanny valley. All the
| colors and the shadows are just off.
| poly2it wrote:
| They are all really sloppy. I don't really see the use case for
| this sort of output outside of research.
| skybrian wrote:
| I tried the obligatory pelican riding a bicycle (as an image, not
| SVG) and some accordion images. It has a bit of trouble with
| fingers and wth getting the black keys right. It's fairly fast.
|
| https://chat.qwen.ai/s/0f9d558c-2108-4350-98fb-6ee87065d587?...
| hexmiles wrote:
| While looking at the examples of editing the bear image, I
| noticed that the model seemed to change more things than were
| strictly asked.
|
| As an example, when asked to change the background, it also
| completely changed the bear (it has the same shirt but the fur
| and face are clearly different), and also: when it turned the
| bear in a balloon, it changed the background (removing the
| pavement) and lost the left seed in the watermelon.
|
| It is something that can be fixed with better prompting, or is it
| a limitation of the model/architecture?
| godelski wrote:
| > It is something that can be fixed with better prompting, or
| is it a limitation of the model/architecture?
|
| Both. You can get better results through better prompting but
| the root cause of this is a limitation of the architecture and
| training methods (which are coupled).
| djaychela wrote:
| How do you stop the auto reading out? Why can't websites just sit
| there and wait until I ask for them to do something? It full
| screen auto played a video on watch and then just started
| reading?
|
| Firefox on ios ftr
| b0a04gl wrote:
| image gets compressed into 256 tokens before language model sees
| it. ask it to add a hat and it redraws the whole face; because
| objects aren't stored as separate things. there's no persistent
| bear in memory. it all lives inside one fused latent soup,
| they're fresh samples under new constraints. every prompt tweak
| rebalances the whole embedding. that's why even small changes
| ripple across the image. i notice it like single shot scene
| synthesis, which is good for diff usecases
| leodriesch wrote:
| That's what I really like about Flux Kontext, it has similar
| editing capabilities to the multimodal models, but doesn't mess
| up the details. The editing with gpt-image-1 only really works
| for complete style changes like "make this ghibli", but not
| adding glasses to a photorealistic image and have it retain all
| the details.
| vunderba wrote:
| Agreed. Kontext's ability to basically do the equivalent of
| img2img inpainting is hugely impressive.
|
| Even when used to add new details it sticks very strongly to
| the existing images overall aesthetic.
|
| https://specularrealms.com/ai-transcripts/experiments-
| with-f...
| veltas wrote:
| Rather I think machine learning has made a lot more progress
| 'depicting' the world than 'understanding' it.
| ivape wrote:
| Why do you think humans understand the world any better? We
| have _emotion_ about the world but emotions do not grant you
| understanding, where "understanding" is still something you
| would still need to define.
|
| "I get it" - is actually just some arbitrary personal
| benchmark.
| godelski wrote:
| As a ML researcher and a degree holding physicist, I'm really
| hesitant to use the words "understanding" and "describing" (much
| less hesitant) around these models. I don't find the language
| helpful and think it's mostly hateful tbh.
|
| The reason we use math in physics is because of its specificity.
| The same reason coding is so hard [0,1]. I think people aren't
| giving themselves enough credit here for how much they (you)
| understand about things. It is the nuances that really matter.
| There's so much detail here and we often forget how important
| they are because it is just normal to us. It's like forgetting
| about the ground you walk upon.
|
| I think something everyone should read about is Asimov's
| "Relativity of Wrong"[2]. This is what we want to see in these
| systems if we want to start claiming they understand things. We
| want to see them to deduction and abduction. To be able to refine
| concepts and ideas. To be able to _discover_ things that are more
| than just a combination of things they 've ingested. What's
| really difficult here is that we train these things on all human
| knowledge and just reciting back that knowledge doesn't
| demonstrate intelligence. It's very unlikely that they losslessly
| compress that knowledge into these model sizes, but without very
| deep investigation into that data and probing at this knowledge
| it is very hard to understand what it knows and what it
| memorizes. Really, this is a very poor way to go about trying to
| make intelligence[3], or at least making intelligence and ending
| up knowing it is intelligent.
|
| To really "understand" things we need to be able to propose
| counterfactuals[4]. Every physics statement is a counterfactual
| statement. Take F=ma as a trivial example. We can modify the mass
| or the acceleration to our heart's content and still determine
| the force. We can observe a specific mass moving at a specific
| acceleration and then ask the counterfactual "what if it was
| twice as heavy?" (twice the mass). * _We can answer that!*_ In
| fact, your mental model of the world does this too! Yo may not be
| describing it with math (maybe you are ;) but you are able to
| propose counterfactuals and do a pretty good job a lot of the
| time. Doesn 't mean you always need to be right though. But the
| way our heads work is through these types of systems. You
| daydream these things, you imagine them while you play, and all
| sorts of things. This, I can say, with high confidence, is not
| something modern ML (AI) systems do. == Edit ==
|
| A good example of lack of understanding is the image OP uses. Not
| only does the right have the wrong number of fingers but look at
| the keys on the keyboard. It does not take much understanding to
| recognize that you shouldn't have repeated keys... the
| configuration is all wonky too, like one of those dreams you can
| immediately tell is a dream[5]. I'd also be willing to bet that
| the number of keys doesn't align to the number of markers and
| definitely the sizing looks off. The more you look at it the
| worse it gets, and that's really common among these systems. Nice
| at a quick glance but _DEEP_ in the uncanny valley at more than a
| glance and deeper the more you look.
|
| [0] https://youtube.com/watch?v=cDA3_5982h8
|
| [1] Code is math. There's an isomorphism between Turing complete
| languages and computable mathematics. You can look more into my
| namesake, church, and Turing if you want to get more formal or
| wait for the comment that corrects a nuanced mistake here (yes,
| it exists). Also, note that physics and math are not the same
| thing, but mathematics is unreasonably effective (yes, this is a
| reference).
|
| [2] https://hermiene.net/essays-trans/relativity_of_wrong.html
|
| [3] This is a very different statement than "making something
| useful." Without a doubt these systems are useful. Do not
| conflate these
|
| [4] https://en.wikipedia.org/wiki/Counterfactual_thinking
|
| [5] Yes, you can read in dreams. I do it frequently. Though on
| occasion I have lucid dreamed because I read something and
| noticed that it changed when I looked away and looked back.
| BoorishBears wrote:
| As a person who builds stuff, I'm tired of these strawmen.
|
| It is helpful that they chose words that are widely understood
| to represent input vs output.
|
| They even used scare quotes to signal they're not making some
| overly grand claim in terms of the long tail implications of
| the terms.
|
| -
|
| A person reading the release would learn previously Qwen had a
| VLM that could understand/see/precive/whateverwordyouwanttouse
| and now it can generate images which is could be
| depicting/drawing/portraying/whateverotherwordyouwanttouse
|
| We don't have to invent a crisis past that.
| godelski wrote:
| > As a person who builds stuff, I'm tired of these strawmen.
|
| Who says I don't build stuff?[0]
|
| Allow me to quote Knuth. I think we can agree he built a lot
| of stuff | If you find that you're spending
| almost all your time on theory, start turning some attention
| to practical things; it will improve your theories. If you
| find that you're spending almost all your time on practice,
| start turning some attention to theoretical things; it will
| improve your practice.
|
| This is important. I don't know you and your beliefs, but
| some people truly believe theory is useless. But it's the
| foundation of everything we do. > We don't
| have to invent a crisis past that.
|
| You're right. But I'm not. Qwen isn't the only one here in
| the larger conversation. Look around the comments and see who
| can't tell the difference. Look at the announcements
| companies make. PhD level intelligence? lol. So I suggest
| taking your own advice. I've made no strawman...
|
| [0] my undergrad I did experimental physics, not theory. I
| then worked as an aerospace engineer for years. I built a
| literal rocket engine. I built advanced radiation shielding
| that NASA uses. Then I came back to school and my PhD is in
| CS. I build things. Don't confuse the fact that I want to
| understand things interferes with that. Truth is I'm good at
| building things _because_ I spend time with theory. See Knuth
| afro88 wrote:
| Strangely the image change examples (edits, style transfer etc)
| have that slight yellow tint that GPT Image 1 (ChatGPT 4o's
| latest image model) has. Why is that? Flux Kontext doesn't seem
| to do that
___________________________________________________________________
(page generated 2025-06-27 23:00 UTC)