[HN Gopher] Qwen-Image: Crafting with native text rendering
       ___________________________________________________________________
        
       Qwen-Image: Crafting with native text rendering
        
       https://huggingface.co/Qwen/Qwen-Image  https://qianwen-res.oss-cn-
       beijing.aliyuncs.com/Qwen-Image/Q...
        
       Author : meetpateltech
       Score  : 233 points
       Date   : 2025-08-04 15:56 UTC (7 hours ago)
        
 (HTM) web link (qwenlm.github.io)
 (TXT) w3m dump (qwenlm.github.io)
        
       | djoldman wrote:
       | Checkout section 3.2 Data Filtering:
       | 
       | https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
        
         | numpad0 wrote:
         | It's also kind of interesting that no other languages than
         | English and Chinese are named or shown...
        
       | nickandbro wrote:
       | The fact that it doesn't change the images like 4o image gen is
       | incredible. Often when I try to tweak someone's clothing using
       | 4o, it also tweaks their face. This only seems to apply those
       | recognizable AI artifacts to only the elements needing to be
       | edited.
        
         | herval wrote:
         | You can select the area you want edited on 4o, and it'll keep
         | the rest unchanged
        
           | barefootford wrote:
           | gpt doesn't respect masks
        
             | icelancer wrote:
             | Correct. Have tried this without much success despite
             | OpenAI's claims.
        
         | vunderba wrote:
         | That's why Flux Kontext was such a huge deal - it gave you the
         | power of img2img inpainting without needing to manually mask
         | the content.
         | 
         | https://mordenstar.com/blog/edits-with-kontext
        
       | artninja1988 wrote:
       | Insane how many good Chinese open source models they've been
       | releasing. This really gives me hope
        
       | anon191928 wrote:
       | It will take years for people to use these but Adobe is not
       | alone.
        
         | herval wrote:
         | Adobe has never been alone. Photoshop's AI stuff is
         | consistently behind OSS models and workflows. It's just way
         | more convenient
        
           | dvt wrote:
           | I think Adobe is also very careful with copyrighted content
           | not being a part of their models, which inherently makes them
           | of lower quality.
        
             | herval wrote:
             | They have a much better and cleaner dataset than Stable
             | Diffusion & others, so I'd expect it to be better with some
             | kinds of images (photos in particular)
        
             | doctorpangloss wrote:
             | as long as you don't consider the part of the model which
             | understands text as part of the model, and as long as you
             | don't consider copyrighted text content copyrighted :)
        
       | yjftsjthsd-h wrote:
       | Wow, the text/writing is amazing! Also the editing in general,
       | but the text really stands out
        
       | rushingcreek wrote:
       | Not sure why this isn't a bigger deal --- it seems like this is
       | the first open-source model to beat gpt-image-1 in all respects
       | while also beating Flux Kontext in terms of editing ability. This
       | seems huge.
        
         | zamadatix wrote:
         | It's only been a few hours and the demo is constantly erroring
         | out, people need more time to actually play with it before
         | getting excited. Some quantized GGUFs + various comfy workflows
         | will also likely be a big factor for this one since people will
         | want to run it locally but it's pretty large compared to other
         | models. Funnily enough, the main comparison to draw might be
         | between Alibaba and Alibaba. I.e. using Wan 2.2 for image
         | generation has been an extremely popular choice, so most will
         | want to know how big a leap Qwen-Image is from that rather than
         | Flux.
         | 
         | The best time to judge how good a new image model actually is
         | seems to be about a week from launch. That's when enough pieces
         | have fallen into place that people have had a chance to really
         | mess with it and come out with 3rd party pros/cons of the
         | models. Looking hopeful for this one though!
        
           | rushingcreek wrote:
           | I spun up an H100 on Voltage Park to give it a try in an
           | isolated environment. It's really, really good. The only area
           | where it seems less strong than gpt-image-1 is in generating
           | images of UI (e.g. make me a landing page for Product Hunt in
           | the style of Studio Ghibli), but other than that, I am
           | impressed.
        
         | hleszek wrote:
         | It's not clear from their page but the editing model is not
         | released yet: https://github.com/QwenLM/Qwen-
         | Image/issues/3#issuecomment-3...
        
         | tetraodonpuffer wrote:
         | I think the fact that, as far as I understand, it takes 40GB of
         | VRAM to run, is probably dampening some of the enthusiasm.
         | 
         | As an aside, I am not sure why for LLM models the technology to
         | spread among multiple cards is quite mature, while for image
         | models, despite also using GGUFs, this has not been the case.
         | Maybe as image models become bigger there will be more of a
         | push to implement it.
        
           | TacticalCoder wrote:
           | > I think the fact that, as far as I understand, it takes
           | 40GB of VRAM to run, is probably dampening some of the
           | enthusiasm.
           | 
           | 40 GB of VRAM? So two GPU with 24 GB each? That's pretty
           | reasonable compared to the kind of machine to run the latest
           | Qwen coder (which btw are close to SOTA: they do also beat
           | proprietary models on several benchmarks).
        
             | cellis wrote:
             | A 3090 + 2xTitanXP? technically i have 48, but i don't
             | think you can "split it" over multiple cards. At least with
             | Flux, it would OOM the Titans and allocate the full 3090
        
           | cma wrote:
           | If 40GB you can lightly quantize and fit it on a 5090.
        
           | reissbaker wrote:
           | 40GB is small IMO: you can run it on a mid-tier Macbook
           | Pro... or the smallest M3 Ultra Mac Studio! You don't need
           | Nvidia if you're doing at-home inference, Nvidia only becomes
           | economical at very high throughput: i.e. dedicated inference
           | companies. Apple Silicon is much more cost effective for
           | single-user for the small-to-medium-sized models. The M3
           | Ultra is ~roughly on par with a 4090 in terms of memory
           | bandwidth, so it won't be much slower, although it won't
           | match a 5090.
           | 
           | Also for a 20B model, you only really need 20GB of VRAM: FP8
           | is near-identical to FP16, it's only below FP8 that you start
           | to see dramatic drop-offs in quality. So literally any Mac
           | Studio available for purchase will do, and even a fairly low-
           | end Macbook Pro would work as well. And a 5090 should be able
           | to handle it with room to spare as well.
        
             | RossBencina wrote:
             | Does M3 Ultra or later have hardware FP8 support on the CPU
             | cores?
        
         | jug wrote:
         | I think it does way more than gpt-image-1 too?
         | 
         | Besides style transfer, object additions and removals, text
         | editing, manipulation of human poses, it also supports object
         | detection, semantic segmentation, depth/edge estimation, super-
         | resolution and novel view synthesis (NVS) i.e. synthesizing new
         | perspectives from a base image. It's quite a smorgasbord!
         | 
         | Early results indicate to me that gpt-image-1 has a bit better
         | sharpness and clarity but I'm honestly not sure if OpenAI
         | doesn't simply do some basic unsharp mask or something as a
         | post-processing step? I've always felt suspicious about that,
         | because the sharpness seems oddly uniform even in out-of-focus
         | areas? And sometimes a bit much, even.
         | 
         | Otherwise, yeah this one looks about as good.
         | 
         | Which is impressive! I thought OpenAI had a lead here from
         | their unique image generation solution that'd last them this
         | year at least.
         | 
         | Oh, and Flux Krea has lasted four days since announcement! In
         | case this one is truly similar in quality to gpt-image-1.
        
           | jacooper wrote:
           | Not to mention, flux models are for non-commercial use only.
        
             | doctorpangloss wrote:
             | the license for flux models is $1,000/mo, hardly an
             | obstacle to any serious commercial usage
        
               | liuliu wrote:
               | Per 100k image. And it is additionally $0.01 per image.
               | Considering H100 is $1.5 per hour and you can get 1 image
               | per 5s, we are talking about bare-metal cost of ~$0.002
               | per image + $0.01 license cost.
        
         | minimaxir wrote:
         | With the notable exception of gpt-image-1, discussion about AI
         | image generation has become much less popular. I suspect it's a
         | function of a) AI discourse being dominated by AI agents/vibe
         | coding and b) the increasing social stigma of AI image
         | generation.
         | 
         | Flux Kontext was a gamechanger release for image editing and it
         | can do some _absurd_ things, but it 's still relatively
         | unknown. Qwen-Image, with its more permissive license, could
         | lead to much more innovation once the editing model is
         | released.
        
           | doctorpangloss wrote:
           | gpt-image-1 is the League of Legends of image generation. It
           | is a tool in front of like 30 million DAUs...
        
           | ACCount36 wrote:
           | Social stigma? Only if you listen to mentally ill Twitter
           | users.
           | 
           | It's more that the novelty just wore off. Mainstream image
           | generation in online services is "good enough" for most
           | casual users - and power users are few, and already knee deep
           | in custom workflows. They aren't about to switch to the shiny
           | new thing unless they see a lot of benefits to it.
        
       | rwmj wrote:
       | This may be obvious to people who do this regularly, but what
       | kind of machine is required to run this? I downloaded & tried it
       | on my Linux machine that has a 16GB GPU and 64GB of RAM. This
       | machine can run SD easily. But Qwen-image ran out of space both
       | when I tried it on the GPU and on the CPU, so that's obviously
       | not enough. But am I off by a factor of two? An order of
       | magnitude? Do I need some crazy hardware?
        
         | zippothrowaway wrote:
         | You're probably going to have to wait a couple of days for 4
         | bit quantized versions to pop up. It's 20B parameters.
        
           | pollinations wrote:
           | # Configure NF4 quantization        quant_config =
           | PipelineQuantizationConfig(
           | quant_backend="bitsandbytes_4bit",
           | quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type":
           | "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
           | components_to_quantize=["transformer", "text_encoder"],
           | )             # Load the pipeline with NF4 quantization
           | pipe = DiffusionPipeline.from_pretrained(
           | model_name,            quantization_config=quant_config,
           | torch_dtype=torch.bfloat16,            use_safetensors=True,
           | low_cpu_mem_usage=True        ).to(device)
           | 
           | seems to use 17gb of vram like this
           | 
           | update: doesn't work well. this approach seems to be
           | recommended: https://github.com/QwenLM/Qwen-
           | Image/pull/6/files
        
         | mortsnort wrote:
         | I believe it's roughly the same size as the model files. If you
         | look in the transformers folder you can see there are around 9
         | 5gb files, so I would expect you need ~45gb vram on your GPU.
         | Usually quantized versions of models are eventually
         | released/created that can run on much less vram but with some
         | quality loss.
        
           | foobarqux wrote:
           | Why doesn't huggingface list the aggregate model size?
        
             | matcha-video wrote:
             | Huggingface is just a git hosting service, like github. You
             | can add up the sizes of all the files in the directory
             | yourself
        
             | simonw wrote:
             | I've been bugging them about this for a while. There are
             | repos that contain multiple model weights in a single repo
             | which means adding up the file sizes won't work
             | universally, but I'd still find it useful to have a "repo
             | size" indicator somewhere.
             | 
             | I ended up building my own tool for that:
             | https://tools.simonwillison.net/huggingface-storage
        
           | halJordan wrote:
           | Model size = file for fp8, so if this was released at fp16
           | then 40-ish, if it's quantized to fp4 then 10ish
        
         | TacticalCoder wrote:
         | > I think the fact that, as far as I understand, it takes 40GB
         | of VRAM to run, is probably dampening some of the enthusiasm.
         | 
         | For PCs I take it one that has two PCIe 4.0 x16 or more recent
         | slots? As in: quite some consumers motherboards. You then put
         | two GPU with 24 GB of VRAM each.
         | 
         | A friend runs this (don't know if the tried this Qwen-Image
         | yet): it's not an "out of this world" machine.
        
         | icelancer wrote:
         | > This may be obvious to people who do this regularly
         | 
         | This is not that obvious. Calculating VRAM usage for VLMs/LLMs
         | is something of an arcane art. There are about 10 calculators
         | online you can use and none of them work. Quantization, KV
         | caching, activation, layers, etc all play a role. It's
         | annoying.
         | 
         | But anyway, for this model, you need 40+ GB of VRAM. System RAM
         | isn't going to cut it unless it's unified RAM on Apple Silicon,
         | and even then, memory bandwidth is shot, so inference is much
         | much slower than GPU/TPU.
        
           | cellis wrote:
           | Also I think you need a 40GB "card", not just 40GB of vram. I
           | wrote about this upthread, you're probably going to need one
           | card, I'd be surprised if you could chain several GPUs
           | together.
        
             | rapfaria wrote:
             | Not sure what you mean or new to llms, but two RTX 3090
             | will work for this, and even lower-end cards will (RTX3060)
             | once it's GGUF'd
        
               | karolist wrote:
               | do you mean https://github.com/pollockjj/ComfyUI-
               | MultiGPU? One GPU would do the computation, but others
               | could pool in for VRAM expansion, right? (I've not used
               | this node)
        
         | liuliu wrote:
         | 16GiB RAM with 8-bit quantization.
         | 
         | This is a slightly scaled up SD3 Large model (38 layers -> 60
         | layers).
        
         | philipkiely wrote:
         | For prod inference, 1xH100 is working well.
        
       | oceanplexian wrote:
       | Does anyone know how they actually trained text rendering into
       | these models?
       | 
       | To me they all seem to suffer from the same artifacts, that the
       | text looks sort of unnatural and doesn't have the correct
       | shadows/reflections as the rest of the image. This applies to all
       | the models I have tried, from OpenAI to Flux. Presumably they are
       | all using the same trick?
        
         | yorwba wrote:
         | It's on page 14 of the technical report. They generate
         | synthetic data by putting text on top of an image, apparently
         | without taking the original lighting into account. So that's
         | the look the model reproduces. Garbage in, garbage out.
         | 
         | Maybe in the future someone will come up with a method for
         | putting realistic text into images so that they can generate
         | data to train a model for putting realistic text into images.
        
           | doctorpangloss wrote:
           | i'm not sure if that's such garbage as you suggest, surely it
           | is helpful for generalization yes? kind of the point of self-
           | supervised models
        
           | halJordan wrote:
           | If you think diffusing legible, precise text from pure noise
           | is garbage then wtf are you doing here. The arrogance of the
           | it crowd can be staggering at times
        
           | Maken wrote:
           | Wouldn't it make sense to use rendered images for that?
        
       | sampton wrote:
       | Short canva.
        
       | esafak wrote:
       | Team Qwen: Please stop ripping off Studio Ghibli to demo your
       | product.
        
       | Destiner wrote:
       | The text rendering is impressive, but I don't understand the
       | value -- wouldn't it be easier to add any text that you like in
       | Figma?
        
         | doctorpangloss wrote:
         | the value is: the absence of text where you expect it, and the
         | presence of garbled text, are dead giveaways of AI generation.
         | i'm not sure why you are being downvoted, compositing text
         | seems like a legitimate alternative.
        
           | sipjca wrote:
           | it seems like the value is that you don't need another tool
           | to composite the text. especially for users who aren't aware
           | of figma/photoshop nor how to use them (many many many
           | people)
        
       | Uehreka wrote:
       | I'm interested to see what this model can do, but also kinda
       | annoyed at the use of a Studio Ghibli style image as one of the
       | first examples. Miyazaki has said over and over that he hates AI
       | image generation. Is it really so much to ask that people not
       | deliberately train LoRAs and finetunes specifically on his work
       | and use them in official documentation?
       | 
       | It reminds me of how CivitAI is full of "sexy Emma Watson" LoRAs,
       | presumably because she very notably has said she doesn't want to
       | be portrayed in ways that objectify her body. There's a really
       | rotten vein of "anti-consent" pulsing through this community,
       | where people deliberately seek out people who have asked to be
       | left out of this and go "Oh yeah? Well there's nothing you can do
       | to stop us, here's several terabytes of exactly what you didn't
       | want to happen".
        
         | aabhay wrote:
         | Seems a bit drastic to compare Ghibli style transfer to revenge
         | porn, but you do you I guess.
        
           | Uehreka wrote:
           | It's the anti-consent thing that ties them together. The idea
           | of "You asked us to leave you alone, which is why we're
           | targeting you."
        
           | littlestymaar wrote:
           | Why are you talking about revenge porn here?
        
         | topato wrote:
         | I mean, did you really expect anything more from the internet?
         | Maybe I'm wrong, but hentai, erotic roleplay, and nudify
         | applications seem to still represent a massive portion of AI
         | use cases. At least in the case of ero RP, perhaps the
         | exploitation of people for pornography might be lessened....
        
           | Uehreka wrote:
           | I get that if you can imagine something, it exists, and also
           | there is porn of it.
           | 
           | What disappoints me is how aligned the whole community is
           | with its worst exponents. That someone went "Heh heh, I'm
           | gonna spend hours of my day and hundreds/thousands of dollars
           | in compute just to make Miyazaki sad." and then influencers
           | in the AI art space saw this happen and went "Hell yeah let's
           | go" and promoted the shit out of it making it one of the few
           | finetunes to actually get used by normies in the mainstream,
           | and then leaders in this field like the Qwen team went "Yeah
           | sure let's ride the wave" and made a Studio Ghibli style
           | image their first example.
           | 
           | I get that there was no way to physically stop a Studio
           | Ghibli LoRA from existing. I still think the community's
           | gleeful reaction to it has been gross.
        
         | Zopieux wrote:
         | Welcome to the internet, which is for porn (and cat pictures).
        
       | artninja1988 wrote:
       | How censored is it?
        
         | Zopieux wrote:
         | I love that this is the only thing the community wants to know
         | at every announce of a new model, but no organization wants to
         | face the crude reality of human nature.
         | 
         | That, and the weird prudishness of most american people and
         | companies.
        
       | vunderba wrote:
       | Good release! I've added it to the GenAI Showdown site. Overall a
       | pretty good model scoring around 40% - and definitely represents
       | SOTA for something that could be reasonably hosted on consumer
       | GPU hardware (even more so when its quantized).
       | 
       | That being said, it still lags pretty far behind OpenAI's gpt-
       | image-1 strictly in terms of prompt adherence for txt2img
       | prompting. However as has already been mentioned elsewhere in the
       | thread, this model can do a lot more around editing, etc.
       | 
       | https://genai-showdown.specr.net
        
       | masfuerte wrote:
       | > In this case, the paper is less than one-tenth of the entire
       | image, and the paragraph of text is relatively long, but the
       | model still accurately generates the text on the paper.
       | 
       | Nope. The text includes the line "That dawn will bloom" but the
       | render reads "That down will bloom", which is meaningless.
        
       ___________________________________________________________________
       (page generated 2025-08-04 23:00 UTC)