[HN Gopher] Qwen-Image: Crafting with native text rendering
___________________________________________________________________
Qwen-Image: Crafting with native text rendering
https://huggingface.co/Qwen/Qwen-Image https://qianwen-res.oss-cn-
beijing.aliyuncs.com/Qwen-Image/Q...
Author : meetpateltech
Score : 233 points
Date : 2025-08-04 15:56 UTC (7 hours ago)
(HTM) web link (qwenlm.github.io)
(TXT) w3m dump (qwenlm.github.io)
| djoldman wrote:
| Checkout section 3.2 Data Filtering:
|
| https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
| numpad0 wrote:
| It's also kind of interesting that no other languages than
| English and Chinese are named or shown...
| nickandbro wrote:
| The fact that it doesn't change the images like 4o image gen is
| incredible. Often when I try to tweak someone's clothing using
| 4o, it also tweaks their face. This only seems to apply those
| recognizable AI artifacts to only the elements needing to be
| edited.
| herval wrote:
| You can select the area you want edited on 4o, and it'll keep
| the rest unchanged
| barefootford wrote:
| gpt doesn't respect masks
| icelancer wrote:
| Correct. Have tried this without much success despite
| OpenAI's claims.
| vunderba wrote:
| That's why Flux Kontext was such a huge deal - it gave you the
| power of img2img inpainting without needing to manually mask
| the content.
|
| https://mordenstar.com/blog/edits-with-kontext
| artninja1988 wrote:
| Insane how many good Chinese open source models they've been
| releasing. This really gives me hope
| anon191928 wrote:
| It will take years for people to use these but Adobe is not
| alone.
| herval wrote:
| Adobe has never been alone. Photoshop's AI stuff is
| consistently behind OSS models and workflows. It's just way
| more convenient
| dvt wrote:
| I think Adobe is also very careful with copyrighted content
| not being a part of their models, which inherently makes them
| of lower quality.
| herval wrote:
| They have a much better and cleaner dataset than Stable
| Diffusion & others, so I'd expect it to be better with some
| kinds of images (photos in particular)
| doctorpangloss wrote:
| as long as you don't consider the part of the model which
| understands text as part of the model, and as long as you
| don't consider copyrighted text content copyrighted :)
| yjftsjthsd-h wrote:
| Wow, the text/writing is amazing! Also the editing in general,
| but the text really stands out
| rushingcreek wrote:
| Not sure why this isn't a bigger deal --- it seems like this is
| the first open-source model to beat gpt-image-1 in all respects
| while also beating Flux Kontext in terms of editing ability. This
| seems huge.
| zamadatix wrote:
| It's only been a few hours and the demo is constantly erroring
| out, people need more time to actually play with it before
| getting excited. Some quantized GGUFs + various comfy workflows
| will also likely be a big factor for this one since people will
| want to run it locally but it's pretty large compared to other
| models. Funnily enough, the main comparison to draw might be
| between Alibaba and Alibaba. I.e. using Wan 2.2 for image
| generation has been an extremely popular choice, so most will
| want to know how big a leap Qwen-Image is from that rather than
| Flux.
|
| The best time to judge how good a new image model actually is
| seems to be about a week from launch. That's when enough pieces
| have fallen into place that people have had a chance to really
| mess with it and come out with 3rd party pros/cons of the
| models. Looking hopeful for this one though!
| rushingcreek wrote:
| I spun up an H100 on Voltage Park to give it a try in an
| isolated environment. It's really, really good. The only area
| where it seems less strong than gpt-image-1 is in generating
| images of UI (e.g. make me a landing page for Product Hunt in
| the style of Studio Ghibli), but other than that, I am
| impressed.
| hleszek wrote:
| It's not clear from their page but the editing model is not
| released yet: https://github.com/QwenLM/Qwen-
| Image/issues/3#issuecomment-3...
| tetraodonpuffer wrote:
| I think the fact that, as far as I understand, it takes 40GB of
| VRAM to run, is probably dampening some of the enthusiasm.
|
| As an aside, I am not sure why for LLM models the technology to
| spread among multiple cards is quite mature, while for image
| models, despite also using GGUFs, this has not been the case.
| Maybe as image models become bigger there will be more of a
| push to implement it.
| TacticalCoder wrote:
| > I think the fact that, as far as I understand, it takes
| 40GB of VRAM to run, is probably dampening some of the
| enthusiasm.
|
| 40 GB of VRAM? So two GPU with 24 GB each? That's pretty
| reasonable compared to the kind of machine to run the latest
| Qwen coder (which btw are close to SOTA: they do also beat
| proprietary models on several benchmarks).
| cellis wrote:
| A 3090 + 2xTitanXP? technically i have 48, but i don't
| think you can "split it" over multiple cards. At least with
| Flux, it would OOM the Titans and allocate the full 3090
| cma wrote:
| If 40GB you can lightly quantize and fit it on a 5090.
| reissbaker wrote:
| 40GB is small IMO: you can run it on a mid-tier Macbook
| Pro... or the smallest M3 Ultra Mac Studio! You don't need
| Nvidia if you're doing at-home inference, Nvidia only becomes
| economical at very high throughput: i.e. dedicated inference
| companies. Apple Silicon is much more cost effective for
| single-user for the small-to-medium-sized models. The M3
| Ultra is ~roughly on par with a 4090 in terms of memory
| bandwidth, so it won't be much slower, although it won't
| match a 5090.
|
| Also for a 20B model, you only really need 20GB of VRAM: FP8
| is near-identical to FP16, it's only below FP8 that you start
| to see dramatic drop-offs in quality. So literally any Mac
| Studio available for purchase will do, and even a fairly low-
| end Macbook Pro would work as well. And a 5090 should be able
| to handle it with room to spare as well.
| RossBencina wrote:
| Does M3 Ultra or later have hardware FP8 support on the CPU
| cores?
| jug wrote:
| I think it does way more than gpt-image-1 too?
|
| Besides style transfer, object additions and removals, text
| editing, manipulation of human poses, it also supports object
| detection, semantic segmentation, depth/edge estimation, super-
| resolution and novel view synthesis (NVS) i.e. synthesizing new
| perspectives from a base image. It's quite a smorgasbord!
|
| Early results indicate to me that gpt-image-1 has a bit better
| sharpness and clarity but I'm honestly not sure if OpenAI
| doesn't simply do some basic unsharp mask or something as a
| post-processing step? I've always felt suspicious about that,
| because the sharpness seems oddly uniform even in out-of-focus
| areas? And sometimes a bit much, even.
|
| Otherwise, yeah this one looks about as good.
|
| Which is impressive! I thought OpenAI had a lead here from
| their unique image generation solution that'd last them this
| year at least.
|
| Oh, and Flux Krea has lasted four days since announcement! In
| case this one is truly similar in quality to gpt-image-1.
| jacooper wrote:
| Not to mention, flux models are for non-commercial use only.
| doctorpangloss wrote:
| the license for flux models is $1,000/mo, hardly an
| obstacle to any serious commercial usage
| liuliu wrote:
| Per 100k image. And it is additionally $0.01 per image.
| Considering H100 is $1.5 per hour and you can get 1 image
| per 5s, we are talking about bare-metal cost of ~$0.002
| per image + $0.01 license cost.
| minimaxir wrote:
| With the notable exception of gpt-image-1, discussion about AI
| image generation has become much less popular. I suspect it's a
| function of a) AI discourse being dominated by AI agents/vibe
| coding and b) the increasing social stigma of AI image
| generation.
|
| Flux Kontext was a gamechanger release for image editing and it
| can do some _absurd_ things, but it 's still relatively
| unknown. Qwen-Image, with its more permissive license, could
| lead to much more innovation once the editing model is
| released.
| doctorpangloss wrote:
| gpt-image-1 is the League of Legends of image generation. It
| is a tool in front of like 30 million DAUs...
| ACCount36 wrote:
| Social stigma? Only if you listen to mentally ill Twitter
| users.
|
| It's more that the novelty just wore off. Mainstream image
| generation in online services is "good enough" for most
| casual users - and power users are few, and already knee deep
| in custom workflows. They aren't about to switch to the shiny
| new thing unless they see a lot of benefits to it.
| rwmj wrote:
| This may be obvious to people who do this regularly, but what
| kind of machine is required to run this? I downloaded & tried it
| on my Linux machine that has a 16GB GPU and 64GB of RAM. This
| machine can run SD easily. But Qwen-image ran out of space both
| when I tried it on the GPU and on the CPU, so that's obviously
| not enough. But am I off by a factor of two? An order of
| magnitude? Do I need some crazy hardware?
| zippothrowaway wrote:
| You're probably going to have to wait a couple of days for 4
| bit quantized versions to pop up. It's 20B parameters.
| pollinations wrote:
| # Configure NF4 quantization quant_config =
| PipelineQuantizationConfig(
| quant_backend="bitsandbytes_4bit",
| quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type":
| "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
| components_to_quantize=["transformer", "text_encoder"],
| ) # Load the pipeline with NF4 quantization
| pipe = DiffusionPipeline.from_pretrained(
| model_name, quantization_config=quant_config,
| torch_dtype=torch.bfloat16, use_safetensors=True,
| low_cpu_mem_usage=True ).to(device)
|
| seems to use 17gb of vram like this
|
| update: doesn't work well. this approach seems to be
| recommended: https://github.com/QwenLM/Qwen-
| Image/pull/6/files
| mortsnort wrote:
| I believe it's roughly the same size as the model files. If you
| look in the transformers folder you can see there are around 9
| 5gb files, so I would expect you need ~45gb vram on your GPU.
| Usually quantized versions of models are eventually
| released/created that can run on much less vram but with some
| quality loss.
| foobarqux wrote:
| Why doesn't huggingface list the aggregate model size?
| matcha-video wrote:
| Huggingface is just a git hosting service, like github. You
| can add up the sizes of all the files in the directory
| yourself
| simonw wrote:
| I've been bugging them about this for a while. There are
| repos that contain multiple model weights in a single repo
| which means adding up the file sizes won't work
| universally, but I'd still find it useful to have a "repo
| size" indicator somewhere.
|
| I ended up building my own tool for that:
| https://tools.simonwillison.net/huggingface-storage
| halJordan wrote:
| Model size = file for fp8, so if this was released at fp16
| then 40-ish, if it's quantized to fp4 then 10ish
| TacticalCoder wrote:
| > I think the fact that, as far as I understand, it takes 40GB
| of VRAM to run, is probably dampening some of the enthusiasm.
|
| For PCs I take it one that has two PCIe 4.0 x16 or more recent
| slots? As in: quite some consumers motherboards. You then put
| two GPU with 24 GB of VRAM each.
|
| A friend runs this (don't know if the tried this Qwen-Image
| yet): it's not an "out of this world" machine.
| icelancer wrote:
| > This may be obvious to people who do this regularly
|
| This is not that obvious. Calculating VRAM usage for VLMs/LLMs
| is something of an arcane art. There are about 10 calculators
| online you can use and none of them work. Quantization, KV
| caching, activation, layers, etc all play a role. It's
| annoying.
|
| But anyway, for this model, you need 40+ GB of VRAM. System RAM
| isn't going to cut it unless it's unified RAM on Apple Silicon,
| and even then, memory bandwidth is shot, so inference is much
| much slower than GPU/TPU.
| cellis wrote:
| Also I think you need a 40GB "card", not just 40GB of vram. I
| wrote about this upthread, you're probably going to need one
| card, I'd be surprised if you could chain several GPUs
| together.
| rapfaria wrote:
| Not sure what you mean or new to llms, but two RTX 3090
| will work for this, and even lower-end cards will (RTX3060)
| once it's GGUF'd
| karolist wrote:
| do you mean https://github.com/pollockjj/ComfyUI-
| MultiGPU? One GPU would do the computation, but others
| could pool in for VRAM expansion, right? (I've not used
| this node)
| liuliu wrote:
| 16GiB RAM with 8-bit quantization.
|
| This is a slightly scaled up SD3 Large model (38 layers -> 60
| layers).
| philipkiely wrote:
| For prod inference, 1xH100 is working well.
| oceanplexian wrote:
| Does anyone know how they actually trained text rendering into
| these models?
|
| To me they all seem to suffer from the same artifacts, that the
| text looks sort of unnatural and doesn't have the correct
| shadows/reflections as the rest of the image. This applies to all
| the models I have tried, from OpenAI to Flux. Presumably they are
| all using the same trick?
| yorwba wrote:
| It's on page 14 of the technical report. They generate
| synthetic data by putting text on top of an image, apparently
| without taking the original lighting into account. So that's
| the look the model reproduces. Garbage in, garbage out.
|
| Maybe in the future someone will come up with a method for
| putting realistic text into images so that they can generate
| data to train a model for putting realistic text into images.
| doctorpangloss wrote:
| i'm not sure if that's such garbage as you suggest, surely it
| is helpful for generalization yes? kind of the point of self-
| supervised models
| halJordan wrote:
| If you think diffusing legible, precise text from pure noise
| is garbage then wtf are you doing here. The arrogance of the
| it crowd can be staggering at times
| Maken wrote:
| Wouldn't it make sense to use rendered images for that?
| sampton wrote:
| Short canva.
| esafak wrote:
| Team Qwen: Please stop ripping off Studio Ghibli to demo your
| product.
| Destiner wrote:
| The text rendering is impressive, but I don't understand the
| value -- wouldn't it be easier to add any text that you like in
| Figma?
| doctorpangloss wrote:
| the value is: the absence of text where you expect it, and the
| presence of garbled text, are dead giveaways of AI generation.
| i'm not sure why you are being downvoted, compositing text
| seems like a legitimate alternative.
| sipjca wrote:
| it seems like the value is that you don't need another tool
| to composite the text. especially for users who aren't aware
| of figma/photoshop nor how to use them (many many many
| people)
| Uehreka wrote:
| I'm interested to see what this model can do, but also kinda
| annoyed at the use of a Studio Ghibli style image as one of the
| first examples. Miyazaki has said over and over that he hates AI
| image generation. Is it really so much to ask that people not
| deliberately train LoRAs and finetunes specifically on his work
| and use them in official documentation?
|
| It reminds me of how CivitAI is full of "sexy Emma Watson" LoRAs,
| presumably because she very notably has said she doesn't want to
| be portrayed in ways that objectify her body. There's a really
| rotten vein of "anti-consent" pulsing through this community,
| where people deliberately seek out people who have asked to be
| left out of this and go "Oh yeah? Well there's nothing you can do
| to stop us, here's several terabytes of exactly what you didn't
| want to happen".
| aabhay wrote:
| Seems a bit drastic to compare Ghibli style transfer to revenge
| porn, but you do you I guess.
| Uehreka wrote:
| It's the anti-consent thing that ties them together. The idea
| of "You asked us to leave you alone, which is why we're
| targeting you."
| littlestymaar wrote:
| Why are you talking about revenge porn here?
| topato wrote:
| I mean, did you really expect anything more from the internet?
| Maybe I'm wrong, but hentai, erotic roleplay, and nudify
| applications seem to still represent a massive portion of AI
| use cases. At least in the case of ero RP, perhaps the
| exploitation of people for pornography might be lessened....
| Uehreka wrote:
| I get that if you can imagine something, it exists, and also
| there is porn of it.
|
| What disappoints me is how aligned the whole community is
| with its worst exponents. That someone went "Heh heh, I'm
| gonna spend hours of my day and hundreds/thousands of dollars
| in compute just to make Miyazaki sad." and then influencers
| in the AI art space saw this happen and went "Hell yeah let's
| go" and promoted the shit out of it making it one of the few
| finetunes to actually get used by normies in the mainstream,
| and then leaders in this field like the Qwen team went "Yeah
| sure let's ride the wave" and made a Studio Ghibli style
| image their first example.
|
| I get that there was no way to physically stop a Studio
| Ghibli LoRA from existing. I still think the community's
| gleeful reaction to it has been gross.
| Zopieux wrote:
| Welcome to the internet, which is for porn (and cat pictures).
| artninja1988 wrote:
| How censored is it?
| Zopieux wrote:
| I love that this is the only thing the community wants to know
| at every announce of a new model, but no organization wants to
| face the crude reality of human nature.
|
| That, and the weird prudishness of most american people and
| companies.
| vunderba wrote:
| Good release! I've added it to the GenAI Showdown site. Overall a
| pretty good model scoring around 40% - and definitely represents
| SOTA for something that could be reasonably hosted on consumer
| GPU hardware (even more so when its quantized).
|
| That being said, it still lags pretty far behind OpenAI's gpt-
| image-1 strictly in terms of prompt adherence for txt2img
| prompting. However as has already been mentioned elsewhere in the
| thread, this model can do a lot more around editing, etc.
|
| https://genai-showdown.specr.net
| masfuerte wrote:
| > In this case, the paper is less than one-tenth of the entire
| image, and the paragraph of text is relatively long, but the
| model still accurately generates the text on the paper.
|
| Nope. The text includes the line "That dawn will bloom" but the
| render reads "That down will bloom", which is meaningless.
___________________________________________________________________
(page generated 2025-08-04 23:00 UTC)