[HN Gopher] Qwen VLo: From "Understanding" the World to "Depicti...
       ___________________________________________________________________
        
       Qwen VLo: From "Understanding" the World to "Depicting" It
        
       Author : lnyan
       Score  : 149 points
       Date   : 2025-06-27 14:35 UTC (8 hours ago)
        
 (HTM) web link (qwenlm.github.io)
 (TXT) w3m dump (qwenlm.github.io)
        
       | rushingcreek wrote:
       | It doesn't seem to have open weights, which is unfortunate. One
       | of Qwen's strengths historically has been their open-weights
       | strategy, and it would have been great to have a true open-
       | weights competitor to 4o's autoregressive image gen. There are so
       | many interesting research directions that are only possible if we
       | can get access to the weights.
       | 
       | If Qwen is concerned about recouping its development costs, I
       | suggest looking at BFL's Flux Kontext Dev release from the other
       | day as a model: let researchers and individuals get the weights
       | for free and let startups pay for a reasonably-priced license for
       | commercial use.
        
         | Jackson__ wrote:
         | It's also very clearly trained on OAI outputs, which you can
         | tell from the orange tint to the images[0]. Did they even
         | attempt to come up with their own data?
         | 
         | So it is trained off OAI, as closed off as OAI and most
         | importantly: worse than OAI. What a bizarre strategy to gate-
         | keep this behind an API.
         | 
         | [0]
         | 
         | https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
         | 
         | https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
         | 
         | https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
        
           | echelon wrote:
           | The way they win is to be open. I don't get why China is
           | shutting down open source. It was a knife at the jugular of
           | US tech dominance.
           | 
           | Both Alibaba and Tencent championed open source (Qwen family
           | of models, Hunyuan family of models), but now they've shut
           | off the releases.
           | 
           | There's totally a play where models become loss-leader for
           | SaaS/PaaS/IaaS and where they extinguish your closed
           | competition.
           | 
           | Imagine spreading your model so widely then making the terms:
           | "do not use in conjunction with closed source models".
        
             | diggan wrote:
             | > I don't get why China is shutting down open source [...]
             | now they've shut off the releases
             | 
             | What are you talking about? Feels like a very strong claim
             | considering there are ongoing weight releases, wasn't there
             | one just today or yesterday from a Chinese company?
        
             | yorwba wrote:
             | The problem with giving away weights for free while also
             | offering a hosted API is that once the weights are out
             | there, anyone else can also offer it as a hosted API with
             | similar operating costs, but only the releasing company had
             | the initial capital outlay of training the model. So
             | everyone else is more profitable! That's not a good
             | business strategy.
             | 
             | New entrants may keep releasing weights as a marketing
             | strategy to gain name recognition, but once they have
             | established themselves (and investors start getting antsy
             | about ROI) making subsequent releases closed is the logical
             | next step.
        
           | vachina wrote:
           | Huh, so orange tint = openAI output? Maybe their training
           | process ended up causing the model to prefer that color
           | balance.
        
             | Jackson__ wrote:
             | Here's an extreme example that shows how it continually
             | adds more orange: https://old.reddit.com/r/ChatGPT/comments
             | /1kawcng/i_went_wit...
             | 
             | It's really too close to be anything but a model trained on
             | these outputs, the whole vibe just screams OAI.
        
               | acheong08 wrote:
               | That form of collapse might just be inherent to the
               | methodology. Releasing the weights would be nice so
               | people can figure out why
        
           | VladVladikoff wrote:
           | What would be the approximate cost of doing this? How many
           | million API requests must be made? How many tokens in total?
        
             | refulgentis wrote:
             | Most pedantically correct answer is "mu", because the
             | answers are both derivable quantitively from "How many
             | images do you want to train on?", which is answered by a
             | qualitative question that doesn't admit numbers ("How high
             | quality do you want it to be?")
             | 
             | Let's say it's 100 images because you're doing a quick
             | LoRA. That'd be about $5.00 at medium quality
             | (~$0.05/image) or $1 at low. ~($0.01/image)
             | 
             | Let's say you're training a standalone image model. OOM of
             | input images is ~1B, so $10M at low and $50M at high.
             | 
             | 250 tokens / image for low, ~1000 for medium, which gets us
             | to:
             | 
             | Fastest LoRA? $1-$4. 25,000 - 100,000 tokens output. _All_
             | the training data for a new image model? $10M-$50M, 2.5B -
             | 10B tokens out.
        
         | diggan wrote:
         | > One of Qwen's strengths historically has been their open-
         | weights strategy [...] let researchers and individuals get the
         | weights for free and let startups pay for a reasonably-priced
         | license for commercial use.
         | 
         | But if you're suggesting they should do open weights, doesn't
         | that mean people should be able to use it freely?
         | 
         | You're effectively suggesting "trial-weights", "shareware-
         | weights", "academic-weights" or something like that rather than
         | "open weights", which to me would make it seem like you can use
         | them for whatever you want, just like with "open source"
         | software. But if it misses a large part of what makes "open
         | source" open source, like "use it for whatever you want", then
         | it kind of gives the wrong idea.
        
           | rushingcreek wrote:
           | I am personally in favor of true open source (e.g. Apache 2
           | license), but the reality is that these model are expensive
           | to develop and many developers are choosing not to release
           | their model weights at all.
           | 
           | I think that releasing the weights openly but with this type
           | of dual-license (hence open weights, but not true open
           | source) is an acceptable tradeoff to get more model
           | developers to release models openly.
        
             | diggan wrote:
             | > but the reality is that these model are expensive to
             | develop and many developers are choosing not to release
             | their model weights at all.
             | 
             | But isn't that true for software too? Software is expensive
             | to develop, and lots of developers/companies are choosing
             | not to make their code public for free. Does that mean you
             | also feel like it would be OK to call software "open
             | source" although it doesn't allow usage for any purpose?
             | That would then lead to more "open source" software being
             | released, at least for individuals and researchers?
        
               | rushingcreek wrote:
               | Yes, I think the same analogy applies. Given a binary
               | choice of a developer not releasing any code at all or
               | releasing code under this type of binary "open-code"
               | license, I'd always take the latter.
        
               | diggan wrote:
               | > Given a binary choice of a developer not releasing any
               | code at all
               | 
               | I mean it wasn't binary earlier, it was "to get more
               | model developers to release", so not a binary choice, but
               | a gradient I suppose. Would you still make the same call
               | for software as you do for ML models and weights?
        
               | hmottestad wrote:
               | I wouldn't equate model weights with source code. You can
               | run software on your own machine without source code, but
               | you can't run an LLM on your own machine without model
               | weights.
               | 
               | Though, you could still sell the model weights for local
               | use. Not sure if we are there yet that I myself could buy
               | model weights, but of course if you are a very big
               | company or a very big country then I guess most AI
               | companies would consider selling you their model weights
               | so you can run them on your own machine.
        
         | echelon wrote:
         | The era of open weights from China appears to be over for some
         | reason. It's all of a sudden and seems to be coordinated.
         | 
         | Alibaba just shut off the Qwen releases
         | 
         | Tencent just shut off the Hunyuan releases
         | 
         | Bytedance just released Seedream, but it's closed
         | 
         | It's seems like it's over.
         | 
         | They're still clearly training on Western outputs, though.
         | 
         | I still suspect that the strategic thing to do would be to
         | become 100% open and sell infra/service.
        
           | pxc wrote:
           | Why? And can we really say that already? Wasn't the Qwen3
           | release still very recent?
        
           | natrys wrote:
           | > Alibaba just shut off the Qwen releases
           | 
           | Alibaba from beginning had some series of models that are
           | always closed-weights (*-max, *-plus, *-turbo etc. but also
           | QvQ), It's not a new development, nor does it prevent their
           | open models. And the VL models are opened after 2-3 months of
           | GA in API.
           | 
           | > Tencent just shut off the Hunyuan releases
           | 
           | Literally released one today:
           | https://huggingface.co/tencent/Hunyuan-A13B-Instruct
        
             | echelon wrote:
             | Hunyuan Image 2.0, which is of Flux quality but has ~20
             | milliseconds of inference time, is being withheld.
             | 
             | Hunyuan 3D 2.5, which is an order of magnitude better than
             | Hunyuan 3D 2.1, is also being withheld.
             | 
             | I suspect that now that they feel these models are superior
             | to Western releases in several categories, they no longer
             | have a need to release these weights.
        
               | natrys wrote:
               | > I suspect that now that they feel these models are
               | superior to Western releases in several categories, they
               | no longer have a need to release these weights.
               | 
               | Yes that I can totally believe. Standard corporation
               | behaviour (Chinese or otherwise).
               | 
               | I do think DeepSeek would be an exception to this though.
               | But they lack diversity in focus (not even multimodal
               | yet).
        
           | logicchains wrote:
           | What do you mean Tencent just shut off the Hunyuan releases?
           | There was another open weights release just today:
           | https://huggingface.co/tencent/Hunyuan-A13B-Instruct . And
           | the latest Qwen and DeepSeek open weight releases were under
           | 2 months ago, there hasn't been enough time for them to
           | finish a new version since then.
        
             | echelon wrote:
             | Hunyuan Image 2.0 and Hunyuan 3D 2.5 are not being
             | released. They're being put into a closed source web-based
             | offering.
        
           | jacooper wrote:
           | Deepseek R1 0528, the flagship Chinese model is open source.
           | Qwen3 is open source. HIdream models are also open source
        
         | dheera wrote:
         | > One of Qwen's strengths historically has been their open-
         | weights strategy
         | 
         | > let researchers and individuals get the weights for free and
         | let startups pay for a reasonably-priced license for commercial
         | use
         | 
         | I'm personally doubtful companies can recoup tens of millions
         | of dollars in investment, GPU hours, and engineering salaries
         | from image generation fees.
        
       | aredox wrote:
       | It don't think these words mean what they think they do...
        
       | frotaur wrote:
       | Anybody knows if there is a technical report for this, or for
       | other models that generate images in a similar way? I'd really
       | like to understand the architecture behind 4o-like image gen.
        
       | rickydroll wrote:
       | To my eyes, all these images hit the uncanny valley. All the
       | colors and the shadows are just off.
        
         | poly2it wrote:
         | They are all really sloppy. I don't really see the use case for
         | this sort of output outside of research.
        
       | skybrian wrote:
       | I tried the obligatory pelican riding a bicycle (as an image, not
       | SVG) and some accordion images. It has a bit of trouble with
       | fingers and wth getting the black keys right. It's fairly fast.
       | 
       | https://chat.qwen.ai/s/0f9d558c-2108-4350-98fb-6ee87065d587?...
        
       | hexmiles wrote:
       | While looking at the examples of editing the bear image, I
       | noticed that the model seemed to change more things than were
       | strictly asked.
       | 
       | As an example, when asked to change the background, it also
       | completely changed the bear (it has the same shirt but the fur
       | and face are clearly different), and also: when it turned the
       | bear in a balloon, it changed the background (removing the
       | pavement) and lost the left seed in the watermelon.
       | 
       | It is something that can be fixed with better prompting, or is it
       | a limitation of the model/architecture?
        
         | godelski wrote:
         | > It is something that can be fixed with better prompting, or
         | is it a limitation of the model/architecture?
         | 
         | Both. You can get better results through better prompting but
         | the root cause of this is a limitation of the architecture and
         | training methods (which are coupled).
        
       | djaychela wrote:
       | How do you stop the auto reading out? Why can't websites just sit
       | there and wait until I ask for them to do something? It full
       | screen auto played a video on watch and then just started
       | reading?
       | 
       | Firefox on ios ftr
        
       | b0a04gl wrote:
       | image gets compressed into 256 tokens before language model sees
       | it. ask it to add a hat and it redraws the whole face; because
       | objects aren't stored as separate things. there's no persistent
       | bear in memory. it all lives inside one fused latent soup,
       | they're fresh samples under new constraints. every prompt tweak
       | rebalances the whole embedding. that's why even small changes
       | ripple across the image. i notice it like single shot scene
       | synthesis, which is good for diff usecases
        
         | leodriesch wrote:
         | That's what I really like about Flux Kontext, it has similar
         | editing capabilities to the multimodal models, but doesn't mess
         | up the details. The editing with gpt-image-1 only really works
         | for complete style changes like "make this ghibli", but not
         | adding glasses to a photorealistic image and have it retain all
         | the details.
        
           | vunderba wrote:
           | Agreed. Kontext's ability to basically do the equivalent of
           | img2img inpainting is hugely impressive.
           | 
           | Even when used to add new details it sticks very strongly to
           | the existing images overall aesthetic.
           | 
           | https://specularrealms.com/ai-transcripts/experiments-
           | with-f...
        
       | veltas wrote:
       | Rather I think machine learning has made a lot more progress
       | 'depicting' the world than 'understanding' it.
        
         | ivape wrote:
         | Why do you think humans understand the world any better? We
         | have _emotion_ about the world but emotions do not grant you
         | understanding, where "understanding" is still something you
         | would still need to define.
         | 
         | "I get it" - is actually just some arbitrary personal
         | benchmark.
        
       | godelski wrote:
       | As a ML researcher and a degree holding physicist, I'm really
       | hesitant to use the words "understanding" and "describing" (much
       | less hesitant) around these models. I don't find the language
       | helpful and think it's mostly hateful tbh.
       | 
       | The reason we use math in physics is because of its specificity.
       | The same reason coding is so hard [0,1]. I think people aren't
       | giving themselves enough credit here for how much they (you)
       | understand about things. It is the nuances that really matter.
       | There's so much detail here and we often forget how important
       | they are because it is just normal to us. It's like forgetting
       | about the ground you walk upon.
       | 
       | I think something everyone should read about is Asimov's
       | "Relativity of Wrong"[2]. This is what we want to see in these
       | systems if we want to start claiming they understand things. We
       | want to see them to deduction and abduction. To be able to refine
       | concepts and ideas. To be able to _discover_ things that are more
       | than just a combination of things they 've ingested. What's
       | really difficult here is that we train these things on all human
       | knowledge and just reciting back that knowledge doesn't
       | demonstrate intelligence. It's very unlikely that they losslessly
       | compress that knowledge into these model sizes, but without very
       | deep investigation into that data and probing at this knowledge
       | it is very hard to understand what it knows and what it
       | memorizes. Really, this is a very poor way to go about trying to
       | make intelligence[3], or at least making intelligence and ending
       | up knowing it is intelligent.
       | 
       | To really "understand" things we need to be able to propose
       | counterfactuals[4]. Every physics statement is a counterfactual
       | statement. Take F=ma as a trivial example. We can modify the mass
       | or the acceleration to our heart's content and still determine
       | the force. We can observe a specific mass moving at a specific
       | acceleration and then ask the counterfactual "what if it was
       | twice as heavy?" (twice the mass). * _We can answer that!*_ In
       | fact, your mental model of the world does this too! Yo may not be
       | describing it with math (maybe you are ;) but you are able to
       | propose counterfactuals and do a pretty good job a lot of the
       | time. Doesn 't mean you always need to be right though. But the
       | way our heads work is through these types of systems. You
       | daydream these things, you imagine them while you play, and all
       | sorts of things. This, I can say, with high confidence, is not
       | something modern ML (AI) systems do.                 == Edit ==
       | 
       | A good example of lack of understanding is the image OP uses. Not
       | only does the right have the wrong number of fingers but look at
       | the keys on the keyboard. It does not take much understanding to
       | recognize that you shouldn't have repeated keys... the
       | configuration is all wonky too, like one of those dreams you can
       | immediately tell is a dream[5]. I'd also be willing to bet that
       | the number of keys doesn't align to the number of markers and
       | definitely the sizing looks off. The more you look at it the
       | worse it gets, and that's really common among these systems. Nice
       | at a quick glance but _DEEP_ in the uncanny valley at more than a
       | glance and deeper the more you look.
       | 
       | [0] https://youtube.com/watch?v=cDA3_5982h8
       | 
       | [1] Code is math. There's an isomorphism between Turing complete
       | languages and computable mathematics. You can look more into my
       | namesake, church, and Turing if you want to get more formal or
       | wait for the comment that corrects a nuanced mistake here (yes,
       | it exists). Also, note that physics and math are not the same
       | thing, but mathematics is unreasonably effective (yes, this is a
       | reference).
       | 
       | [2] https://hermiene.net/essays-trans/relativity_of_wrong.html
       | 
       | [3] This is a very different statement than "making something
       | useful." Without a doubt these systems are useful. Do not
       | conflate these
       | 
       | [4] https://en.wikipedia.org/wiki/Counterfactual_thinking
       | 
       | [5] Yes, you can read in dreams. I do it frequently. Though on
       | occasion I have lucid dreamed because I read something and
       | noticed that it changed when I looked away and looked back.
        
         | BoorishBears wrote:
         | As a person who builds stuff, I'm tired of these strawmen.
         | 
         | It is helpful that they chose words that are widely understood
         | to represent input vs output.
         | 
         | They even used scare quotes to signal they're not making some
         | overly grand claim in terms of the long tail implications of
         | the terms.
         | 
         | -
         | 
         | A person reading the release would learn previously Qwen had a
         | VLM that could understand/see/precive/whateverwordyouwanttouse
         | and now it can generate images which is could be
         | depicting/drawing/portraying/whateverotherwordyouwanttouse
         | 
         | We don't have to invent a crisis past that.
        
           | godelski wrote:
           | > As a person who builds stuff, I'm tired of these strawmen.
           | 
           | Who says I don't build stuff?[0]
           | 
           | Allow me to quote Knuth. I think we can agree he built a lot
           | of stuff                 | If you find that you're spending
           | almost all your time on theory, start turning some attention
           | to practical things; it will improve your theories. If you
           | find that you're spending almost all your time on practice,
           | start turning some attention to theoretical things; it will
           | improve your practice.
           | 
           | This is important. I don't know you and your beliefs, but
           | some people truly believe theory is useless. But it's the
           | foundation of everything we do.                 > We don't
           | have to invent a crisis past that.
           | 
           | You're right. But I'm not. Qwen isn't the only one here in
           | the larger conversation. Look around the comments and see who
           | can't tell the difference. Look at the announcements
           | companies make. PhD level intelligence? lol. So I suggest
           | taking your own advice. I've made no strawman...
           | 
           | [0] my undergrad I did experimental physics, not theory. I
           | then worked as an aerospace engineer for years. I built a
           | literal rocket engine. I built advanced radiation shielding
           | that NASA uses. Then I came back to school and my PhD is in
           | CS. I build things. Don't confuse the fact that I want to
           | understand things interferes with that. Truth is I'm good at
           | building things _because_ I spend time with theory. See Knuth
        
       | afro88 wrote:
       | Strangely the image change examples (edits, style transfer etc)
       | have that slight yellow tint that GPT Image 1 (ChatGPT 4o's
       | latest image model) has. Why is that? Flux Kontext doesn't seem
       | to do that
        
       ___________________________________________________________________
       (page generated 2025-06-27 23:00 UTC)