[HN Gopher] 4o Image Generation
       ___________________________________________________________________
        
       4o Image Generation
        
       Author : meetpateltech
       Score  : 405 points
       Date   : 2025-03-25 18:06 UTC (4 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | minimaxir wrote:
       | OpenAI's livestream of GPT-4o Image Generation shows that it is
       | slowwwwwwwwww (maybe 30 seconds per image, which Sam Altman had
       | to spin "it's slow but the generated images are worth it").
       | Instead of using a diffusion approach, it appears to be
       | generating the image tokens and decoding them akin to the
       | original DALL-E (https://openai.com/index/dall-e/), which allows
       | for streaming partial generations from top to bottom. In
       | contrast, Google's Gemini can generate images and make edits in
       | seconds.
       | 
       | No API yet, and given the slowness I imagine it will cost much
       | more than the $0.03+/image of competitors.
        
         | kevmo314 wrote:
         | Maybe this is the dialup of the era.
        
           | ijidak wrote:
           | Ha. That's a good analogy.
           | 
           | When I first read the parent comment, I thought, maybe this
           | is a long-term architecture concern...
           | 
           | But your message reminded me that we've been here before.
        
           | asadm wrote:
           | specially with the slow loading effect it has.
        
         | cubefox wrote:
         | LLMs are autoregressive, so they can't be (multi-modality)
         | integrated with diffusion image models, only with
         | autoregressive image models (which generate an image via image
         | tokens). Historically those had lower image fidelity than
         | diffusion models. OpenAI now seems to have solved this problem
         | somehow. More than that, they appear far ahead of any available
         | diffusion model, including Midjourney and Imagen 3.
         | 
         | Gemini "integrates" Imagen 3 (a diffusion model) only via a
         | tool that Gemini calls internally with the relevant prompt. So
         | it's not a true multimodal integration, as it doesn't benefit
         | from the advanced prompt understanding of the LLM.
         | 
         | Edit: Apparently Gemini also has an experimental native image
         | generation ability.
        
           | argsnd wrote:
           | Is this the same for their gemini-2.0-flash-exp-image-
           | generation model?
        
             | cubefox wrote:
             | No that seems to be indeed a native part of the multimodal
             | Gemini model. I didn't know this existed, it's not
             | available in the normal Gemini interface.
        
               | lxgr wrote:
               | This is a pretty good example of the current state of
               | Google LLMs:
               | 
               | The (no longer, I guess) industry-leading features people
               | actually want are hidden away in some obscure "AI studio"
               | with horrible usability, while the headline Gemini app
               | still often refuses to do anything useful for me.
               | (Disclaimer: I last checked a couple of months ago, after
               | several more of mild amusement/great frustration.)
        
               | tough wrote:
               | hey at least now they bought ai.dev and redirected it to
               | their bad ux
        
           | echelon wrote:
           | I expect the Chinese to have an open source answer for this
           | soon.
           | 
           | They haven't been focusing attention on images because the
           | most used image models have been open source. Now they might
           | have a target to beat.
        
             | rfoo wrote:
             | ByteDance has been working on autoregressive image
             | generation for a while (see VAR, NeurIPS 2024 best paper).
             | Traditionally they weren't in the open-source gang though.
        
               | cubefox wrote:
               | The VAR paper is very impressive. I wonder if OpenAI did
               | something similar. But the main contribution in the new
               | GPT-4o feature doesn't seem to be just image quality
               | (which VAR seems to focus on), but also massively
               | enhanced prompt understanding.
        
           | summerlight wrote:
           | Your understanding seems outdated, I think people are
           | referring Gemini native image generation
        
           | SweetSoftPillow wrote:
           | Gemini added their multimodal Flash model to Google AI Studio
           | some time ago. It does not use Imagen via tool, it's uses
           | native capabilities to manipulate images, and it's free to
           | try.
        
           | johntb86 wrote:
           | Meta has experimented with a hybrid mode, where the LLM uses
           | autoregressive mode for text, but within a set of delimiters
           | will switch to diffusion mode to generate images. In
           | principle it's the best of both worlds.
        
         | infecto wrote:
         | As a user, images feel slightly slower but comparable to the
         | previous generation. Given the significant quality improvement,
         | it's a fair trade-off. Overall, it feels snappy, and the value
         | justifies a higher price.
        
       | rvz wrote:
       | > ChatGPT's new image generation in GPT-4o rolls out starting
       | today to Plus, Pro, Team, and Free users as the default image
       | generator in ChatGPT, with access coming soon to Enterprise and
       | Edu. For those who hold a special place in their hearts for
       | DALL*E, it can still be accessed through a dedicated DALL*E GPT.
       | 
       | > Developers will soon be able to generate images with GPT-4o via
       | the API, with access rolling out in the next few weeks.
       | 
       | That's it folks. Tens of thousands of so-called "AI" image
       | generator startups have been obliterated and taking digital
       | artists with them all reduced to near zero.
       | 
       | Now you have a widely accessible meme generator with the name
       | "ChatGPT".
       | 
       | The last task is for an open weight model that competes against
       | this and is faster and all for free.
        
         | afro88 wrote:
         | Yep. The coherence and text quality is insanely good. Keen to
         | play with it to find it's "mangled hands" style deficiencies,
         | because of course they cherry picked the best examples.
        
         | dragonwriter wrote:
         | > Tens of thousands of so-called "AI" image generator startups
         | have been obliterated and taking digital artists with them all
         | reduced to near zero. Now you have a widely accessible meme
         | generator with the name "ChatGPT".
         | 
         | ChatGPT has already had a that via Dall-E. If it didn't kill
         | those startups when that happened this doesn't fundamentally
         | change anything. Now its got a new image gen model, which --
         | like Dall-E 3 when it came out -- is competitive or ahead of
         | other SotA base models using just text prompts, the simplest
         | generation workflow, but both more expensive and less adaptable
         | to more involved workflows than the tools anyone more than a
         | casual user (whether using local tools or hosted services) is
         | using. This is station-keeping for OpenAI, not a meaningful
         | change in the landscape.
        
           | og_kalu wrote:
           | There are several examples here, especially in the videos
           | that no existing image gen model can do and would require
           | tedious workflows and/or training regimens to replicate,
           | maybe.
           | 
           | It's not 'just' a new model ala Imagen 3. This is 'what if
           | GPT could transform images nearly as well as text?' and that
           | opens up a lot of possibilities. It's definitely a meaningful
           | change.
        
       | occamschainsaw wrote:
       | Did they time it with the Gemini 2.5 launch?
       | https://news.ycombinator.com/item?id=43473489
       | 
       | Was it public information when Google was going to launch their
       | new models? Interesting timing.
        
         | qoez wrote:
         | "Interesting timing" It's like the 4th time by my counting
         | they've done this
        
         | aabhay wrote:
         | OpenAI was started with the express goal of undermining
         | Google's potential lead in AI. The fact that they time launches
         | to Google launches to me indicates they still see this as a
         | meaningful risk. And with this launch in particular I find
         | their fears more well-founded than ever.
        
       | qoez wrote:
       | Looks about what you'd get with FLUX and attaching some language
       | model to enhance your prompt with eg more text
        
         | afro88 wrote:
         | Flux doesn't do text that good
        
         | echelon wrote:
         | Exactly. OpenAI isn't going to win image and video.
         | 
         | Sora is one of the _worst_ video generators. The Chinese have
         | really taken the lead in video with Kling, Hailuo, and the open
         | source Wan and Hunyuan.
         | 
         | Wan with LoRAs will enable real creative work. Motion control,
         | character consistency. There's no place for an OpenAI Sora type
         | product other than as a cheap LLM add-in.
        
         | minimaxir wrote:
         | Flux 1.1 Pro has good prompt adherence, but some of these
         | (admittingly cherry-picked) GPT-4o generated image demos are
         | beyond what you would get with Flux without a lot of iteration,
         | particularly the large paragraphs of text.
         | 
         | I'm excited to see what a Flux 2 can do if it can actually use
         | a modern text encoder.
        
           | echelon wrote:
           | Structural editing and control nets are much more powerful
           | than text prompting alone.
           | 
           | The image generators used by creatives will not be text-
           | first.
           | 
           | "Dragon with brown leathery scales with an elephant texture
           | and 10% reflectivity positioned three degrees under the
           | mountain, which is approximately 250 meters taller than the
           | next peak, ..." is not how you design.
           | 
           | Creative work is not 100% dice rolling in a crude and
           | inadequate language. Encoding spatial and qualitative details
           | is impossible. "A picture is worth a thousand words" is an
           | understatement.
        
             | jjmarr wrote:
             | Yeah, but then it no longer replaces human artists.
             | 
             | Controlnet has been the obvious future of image-generation
             | for a while now.
        
               | echelon wrote:
               | We're not trying to replace human artists. We're trying
               | to make them more efficient.
               | 
               | We might find that the entire "studio system" is a gross
               | inefficiency and that individual artists and directors
               | can self-publish like on Steam or YouTube.
        
               | dragonwriter wrote:
               | > Yeah, but then it no longer replaces human artists.
               | 
               | Automation tools are always more powerful as a force
               | multiplier for skilled users than a complete replacement.
               | (Which is still a replacement on any given task scope,
               | since it reduces the number of human labor hours -- and,
               | given any elapsed time constraints, human laborers --
               | needed.)
        
             | minimaxir wrote:
             | Prompt adherence and additional tricks such as
             | ControlNet/ComfyUI pipelines are not mutually exclusive.
             | Both are very important to get good image generation
             | results.
        
               | Der_Einzige wrote:
               | It is when it's kept behind an API. You cannot use
               | Controlnet/ComfyUI and especially not the best stuff like
               | regional prompting with this model. You can't do it with
               | Gemini, and that's by design because otherwise coomers
               | are going to generate 999999 anime waifus like they do on
               | Civit.ai.
        
               | Y_Y wrote:
               | That just elicits a cheeky refusal I'm afraid:
               | 
               | """
               | 
               | That's a fun idea--but generating an image with 999,999
               | anime waifus in it isn't technically possible due to
               | visual and processing limits. But we can get creative.
               | 
               | Want me to generate:
               | 
               | 1. A massive crowd of anime waifus (like a big collage or
               | crowd scene)?
               | 
               | 2. A stylized representation of "999999 anime waifus"
               | (maybe with a few in focus and the rest as silhouettes or
               | a sea of colors)?
               | 
               | 3. A single waifu with a visual reference to the number
               | 999999 (like a title, emblem, or digital counter in the
               | background)?
               | 
               | Let me know your vibe--epic, funny, serious, chaotic?
               | 
               | """
        
             | voxic11 wrote:
             | It can do in-context learning from images you upload. So
             | you can just upload a depth map or mark up an image with
             | the locations of edits you want and it should be able to
             | handle that. I guess my point is that since its the same
             | model that understands how to see images and how to
             | generate them you aren't restricted from interacting with
             | it via text only.
        
       | shaky-carrousel wrote:
       | Tried it, the "compise armporressed" and "Pros: made bord
       | reqotons" didn't impress me in the slightest.
        
         | BoorishBears wrote:
         | Are you sure you were even using the model from the post?
        
           | shaky-carrousel wrote:
           | Pressed the "Try in ChatGPT", pasted the first prompt, became
           | thoroughly unimpressed.
        
       | pton_xd wrote:
       | Can you specify the output dimensions?
       | 
       | EDIT: Seems not, "The smallest image size I can generate is
       | 1024x1024. Would you like me to proceed with that, or would you
       | like a different approach?"
        
         | minimaxir wrote:
         | I suspect you can prompt aspect ratios.
        
       | resource_waste wrote:
       | LPT: while the benchmarks don't show it, chatGPT4>4o. It amazes
       | me people use 4o at all. But hey its the brand name and its free.
       | 
       | ofc 4.5 is best, but its slow and I am afraid I'm going to hit
       | limits.
        
         | minimaxir wrote:
         | OpenAI themselves discourages using GPT-4 outside of legacy
         | applications, in favor of GPT-4o instead (they are shutting
         | down the large output gpt-4-32k variants in a few months).
         | GPT-4 is also an order of magnitude more expensive/slower.
        
           | zamadatix wrote:
           | I think both of these points are what sow doubt in some
           | people in the first place because both could be true if GPT-4
           | was just less profitable to run, not if it was worse in
           | quality. Of course it is actually worse in quality than 4o by
           | any reasonable metric... but I guess not everyone sees it
           | that way.
        
       | xnx wrote:
       | Will be interesting to see how this ranks against Google Imagen
       | and Reve. https://huggingface.co/spaces/ArtificialAnalysis/Text-
       | to-Ima...
        
       | jfoster wrote:
       | The character consistency and UI capabilities seem like they open
       | up a lot of new use cases.
        
         | JTyQZSnP3cQGa8B wrote:
         | I'd like to know which use case because 2 years ago no one
         | cared about pictures or generating stuff, and now it's vital
         | for humanity but people can't explain why.
        
           | Alupis wrote:
           | For creating believable fake images...
           | 
           | We're largely past the days of 7 fingered hands - text
           | remains one of the tell-tale signs.
        
           | jfoster wrote:
           | Well I definitely wouldn't say it's vital for humanity. Has
           | anyone actually said that?
           | 
           | Character consistency means that these models could now
           | theoretically illustrate books, as one example.
           | 
           | Generating UIs seems like it would be very helpful for any
           | app design or prototyping.
        
           | olalonde wrote:
           | Never heard about professional photographers, stock
           | photography, graphic artists, etc.?
        
           | colesantiago wrote:
           | It's vital for grifting and not paying those cheeky expensive
           | artists a dime.
           | 
           | That is a great right, as long as it's not programmers.
        
             | BoorishBears wrote:
             | I work on a product for generating interactive fanfiction
             | using an LLM, and I've put a lot of work into post-training
             | to improve writing quality to match or exceed typical human
             | levels.
             | 
             | I'm excited about this for adding images to those
             | interactive stories.
             | 
             | It has nothing to do with circumventing the cost of artists
             | or writers: regardless of cost, no one can put out a story
             | and then rewrite it based on whatever idea pops into every
             | reader's mind for their own personal main character.
             | 
             | It's a novel experience that only a "writer" that scales by
             | paying for an inanimate object to crunch numbers can
             | enable.
             | 
             | Similarly no artist can put out a piece of art for that
             | story and then go and put out new art bespoke to every
             | reader's newly written story.
             | 
             | -
             | 
             | I think there's this weird obsession with framing these
             | tools about being built to just replace current people
             | doing similar things. Just speaking objectively: the market
             | for replacing "cheeky expensive artists" would not justify
             | building these tools.
             | 
             | The most interesting applications of this technology being
             | able to do things that are simply not possible today even
             | if you have all the money in the world.
             | 
             | And for the record, I'll be _ecstatic_ for the day an AI
             | can reach my level of competency in building software. I
             | 've been doing it since I was a child because I love it,
             | it's the one skill I've ever been paid for, and I'd still
             | be over the moon because it'd let me explore so many more
             | ideas than I alone can ever hope to build.
        
             | bufferoverflow wrote:
             | > _That is a great right, as long as it 's not
             | programmers._
             | 
             | You realize that almost weekly we have new AI models coming
             | out that are better and better at programming? It just
             | happened that the image generation is an easier problem
             | than programming. But make no mistake, AI is coming for us
             | too.
             | 
             | That's the price of automating everything.
        
       | bbor wrote:
       | Whelp. That's terrifying.
        
       | gs17 wrote:
       | This is really impressive, but the "Best of 8" tag on a lot of
       | them really makes me want to see how cherry-picked they are. My
       | three free images had two impressive outputs and one failure.
        
         | do_not_redeem wrote:
         | The high five looks extremely unnatural. Their wrists are
         | aligned, but their fingers aren't, somehow?
         | 
         | If that's best of 8, I'd love to see the outtakes.
        
           | tiahura wrote:
           | Agreed. It seems totally unnatural that a couple of nerds
           | high-five awkwardly.
        
             | do_not_redeem wrote:
             | Not awkward. Anatomically uncanny and physically
             | impossible.
        
               | skydhash wrote:
               | While drawing hands is difficult (because the surface
               | morphs in a variety of ways), the shapes and relative
               | proportions are quite simple. That's how you can have
               | tools like Metahuman[0]
               | 
               | [0]: https://www.unrealengine.com/en-US/metahuman
        
       | aantix wrote:
       | Still seems to have problems with transparent backgrounds.
        
         | minimaxir wrote:
         | That's expected with any image generating models because they
         | aren't trained with an alpha channel.
         | 
         | It's more pragmatic to pipeline the results to a background
         | removal model.
         | 
         | EDIT: It appears GPT-4o is different as there is a video demo
         | dedicated to transparancy.
        
           | BoorishBears wrote:
           | There's an entire video in the post dedicated to how well it
           | does transparency:
           | https://openai.com/index/introducing-4o-image-
           | generation/?vi...
           | 
           | I suspect we're getting a flood of comment from people who
           | are using Dall-E.
        
             | minimaxir wrote:
             | Huh, I missed that. I'm skeptical of the results in
             | practice, though.
        
             | aantix wrote:
             | The video was helpful. I started with the prompt "Generate
             | a transparent image. "
             | 
             | And that created the isolated image on a transparent
             | background.
             | 
             | Thank-you.
        
           | throwaway314155 wrote:
           | This one however explicitly advertises good transparency
           | support.
        
           | Der_Einzige wrote:
           | There's a mod for stable diffusion webui
           | forge/automatic1111/ComfyUI which enables this for all
           | diffusion models (except these closed source ones).
        
       | sergiotapia wrote:
       | am I dumb or every time they release something I can never find
       | out how to actually use it and forget about it. take this for
       | instance I wanted to try out their newton "an infographic
       | explaining newton's prism experiment in great detail" example,
       | but it generated a very bad result but maybe it's because I'm not
       | using the right model? every release of theirs is not really a
       | release, it's like a trailer. right?
        
         | throwaway314155 wrote:
         | You're not dumb. They do this for nearly every single major
         | release. I can't really understand why considering it generates
         | negative sentiment about the release, but it's something to be
         | expected from OpenAI at this point.
        
         | swalsh wrote:
         | This is what's so wild about Anthropic. When they release it
         | seems like it's rolled out to all users, and API customers
         | immediately. OpenAI has MONTHS between annoucement and roll
         | out, or if they do it's usually just influencers who get an
         | "early look". It's pretty frustrating.
        
         | guzik wrote:
         | This is hilarious. I'm also confused about whether they
         | released it or not because the results are underwhelming.
         | 
         | EDIT: Ok it works in Sora, and my jaw dropped
        
       | carbocation wrote:
       | This works great for many purposes.
       | 
       | One area where it does not work well at all is modifying
       | photographs of people's faces.* Completely fumbles if you take a
       | selfie and ask it to modify your shirt, for example.
       | 
       | * = unless the people are in the training set
        
         | cess11 wrote:
         | That's to be expected, no? It's a usian product so it will be a
         | disappointment in all areas where things could get lewd.
        
           | briandear wrote:
           | What is usian? Never heard of that.
        
             | jakelazaroff wrote:
             | US-ian, as in from the United States.
        
               | briandear wrote:
               | So should we be using Eusians for citizens of the Estados
               | Unidos Mexicanos?
        
         | BoorishBears wrote:
         | > We're aware of a bug where the model struggles with
         | maintaining consistency of edits to faces from user uploads but
         | expect this to be fixed within the week.
         | 
         | Sounds like it may be a safety thing that's still getting
         | figured out
        
           | carbocation wrote:
           | Thanks, I had not seen that caveat!
        
         | ilaksh wrote:
         | It just doesn't have that kind of image editing capability.
         | Maybe people just assume it does because Google's similar model
         | has it. But did OpenAI claim it could edit images?
        
           | BoorishBears wrote:
           | Yes it does, and that's one of the most important parts of it
           | being multi-modal: just like it can make targeted edits at a
           | piece of text, it can now make similarly nuanced edits to an
           | image. The character consistency and restyling they mention
           | are all rooted in the same concepts.
        
       | alach11 wrote:
       | It's incredible that this took 316 days to be released since it
       | was initially announced. I do appreciate the emphasis in the
       | presentation on how this can be useful beyond just being a
       | cool/fun toy, as it seems most image generation tools have
       | functioned.
       | 
       | Was anyone else surprised how slow the images were to generate in
       | the livestream? This seems notably slower than DALLE.
        
       | byearthithatius wrote:
       | I remember literally just two or three years back getting good
       | text was INSANE. We were all amazed when SD started making pretty
       | good text.
        
       | afro88 wrote:
       | Edit: Please ignore. They hadn't rolled the new model out to my
       | account yet. The announcement blog post is a bit misleading
       | saying you can try it today.
       | 
       | --
       | 
       | Comparison with Leonardo.Ai.
       | 
       | ChatGPT:
       | https://chatgpt.com/share/67e2fb21-a06c-8008-b297-07681dddee...
       | 
       | ChatGPT again (direct one shot):
       | https://chatgpt.com/share/67e2fc44-ecc8-8008-a40f-e1368d306e...
       | 
       | ChatGPT again (using word "photorealistic instead of "photo"):
       | https://chatgpt.com/share/67e2fce4-369c-8008-b69e-c2cbe0dd61...
       | 
       | Leonardo.Ai Phoenix 1.0 model:
       | https://cdn.leonardo.ai/users/1f263899-3b36-4336-b2a5-d8bc25...
        
         | tetris11 wrote:
         | Is the "2D animation style" part you put at the beginning and
         | then changed an attempt to see how well the AI responds to gas
         | lighting?
        
           | afro88 wrote:
           | My bad, I was trying the conversational aspect, but that's
           | not an apples to apples conparison. I have put a direct one
           | shot example in the original post as well.
        
         | wodenokoto wrote:
         | In all fairness you _did_ say 2D animation style
        
           | afro88 wrote:
           | True. I had that conversation before deciding to compare to
           | others. I have updated the post with other fairer examples.
           | Nowhere near Leonardo Phoenix or Flux for this simple image
           | at least.
        
         | elicash wrote:
         | What did the prompt look like for Leonard.ai?
         | 
         | I'm curious if you said 2d animation style for both or just for
         | chatgpt.
         | 
         | Edit: Your second version of chatgpt doesn't say
         | photorealistic. Can you share the Leonard.ai prompt?
        
           | afro88 wrote:
           | Added photorealistic, which made it worse.
           | 
           | Leonardo prompt: A golden cocker spaniel with floppy ears and
           | a collar that says "Sunny" on it
           | 
           | Model: Phoenix 1.0 Style: Pro color photography
        
             | afro88 wrote:
             | Saying "pro color photography" to ChatGPT doesn't get it
             | any better either unfortunately: https://chatgpt.com/share/
             | 67e2fd91-8d24-8008-b144-92c832ed0b...
        
         | drew-y wrote:
         | The ChatGPT examples don't look like the new Image Gen model
         | yet. The text on the dog collar isn't very good.
        
           | afro88 wrote:
           | Apparently it rolls out today to Plus (which I have). I
           | followed the "Try in ChatGPT" link at the top of the post
        
             | og_kalu wrote:
             | It's rolling out to everyone starting today but i'm not
             | sure if everyone has it yet. Does it generate top down for
             | you (picture goes from mostly blurry to clear starting from
             | the top) like in their presentation ?
        
               | afro88 wrote:
               | No it didn't generate like that. Thanks for clarifying. I
               | have updated my original post.
        
             | yed wrote:
             | On mine I tried it "natively" and in DALL-E mode and the
             | results were basically identical, I think they haven't
             | actually rolled it out to everyone yet.
        
         | spaceman_2020 wrote:
         | Yeah, its just not good enough. The big labs are way behind
         | what the image focused labs are putting out. Flux and
         | Midjourney are running laps around these guys
        
       | KrazyButTrue wrote:
       | Is it live yet? Have been trying it out and am still getting poor
       | results on text generation.
        
         | moffkalast wrote:
         | You're supposed to generate images, stupid /s
        
         | Maxatar wrote:
         | I don't think it's available to everyone yet on 4o. Just like
         | you I am getting the same "cartoony" styling and poor text
         | generation.
         | 
         | Might take a day or two before it's available in general.
        
         | virtualcharles wrote:
         | So far it seems to be the same for me.
         | 
         | It seems like an odd way to name/announce it, there's nothing
         | obvious to distinguish it from what was already there (i.e. 4o
         | making images) so I have no idea if there is a UI change to
         | look for, or just keep trying stuff until it seems better?
        
         | throwaway314155 wrote:
         | This is OpenAI's bread and butter - announce something as
         | though it's being launched and then proceed to slowly roll it
         | out after a couple of days.
         | 
         | Truly infuriating, especially when it's something like this
         | that makes it tough to tell if the feature is even enabled.
        
       | user3939382 wrote:
       | I'll just be happy with not everything having that over saturated
       | cg/cartoon style that you cant prompt your way out of.
        
         | jjeaff wrote:
         | Is that an artifact of the training data? Where are all these
         | original images with that cartoony look that it was trained on?
        
           | minimaxir wrote:
           | Ever since Midjourney popularized it, image generation models
           | are often posttrained on more "aesthetic" subsets of images
           | to give them a more fantasy look. It also help obscure some
           | of the imperfections of the AI.
        
           | jl6 wrote:
           | Wild speculation: video game engines. You want your model to
           | understand what a car looks like from all angles, but it's
           | expensive to get photos of real cars from all angles, so
           | instead you render a car model in UE5, generating hundreds of
           | pictures of it, from many different angles, in many different
           | colors and styles.
        
           | wongarsu wrote:
           | A large part of deviantart.com would fit that description.
           | There are also a lot of cartoony or CG images in communities
           | dedicated to fanart. Another component in there is probably
           | the overly polished and clean look of stock images, like the
           | front page results of shutterstock.
           | 
           | "Typical" AI images are this blend of the popular image
           | styles of the internet. You always have a bit of digital
           | drawing + cartoon image + oversaturated stock image + 3d
           | render mixed in. Models trained on just one of these work
           | quite well, but for a generalist model this blend of styles
           | is an issue
        
           | ToValueFunfetti wrote:
           | I've heard this is downstream of human feedback. If you ask
           | someone which picture is better, they'll tend to pick the
           | more saturated option. If you're doing post-training with
           | humans, you'll bake that bias into your model.
        
         | alana314 wrote:
         | I was relying on that to determine if images were AI though
        
         | LeoPanthera wrote:
         | Frustratingly the DALL-E API actually has an option for this,
         | you can switch it from "vivid" to "realistic".
         | 
         | This option is not exposed in ChatGPT, it only uses vivid.
        
         | richardfulop wrote:
         | you really have to NOT try to end up with that result in MJ.
        
       | coherentpony wrote:
       | > we've built our most advanced image generator yet into GPT-4o.
       | The result--image generation that is not only beautiful, but
       | useful.
       | 
       | Sorry, but how are these useful? None of the examples demonstrate
       | any use beyond being cool to look at.
       | 
       | The article vaguely mentions 'providing inspiration' as possible
       | definition of 'useful'. I suppose.
        
       | kh_hk wrote:
       | > Introducing 4o Image Generation: [...] our most advanced image
       | generator yet
       | 
       | Then google:
       | 
       | > Gemini 2.5: Our most intelligent AI model
       | 
       | > Introducing Gemini 2.0 | Our most capable AI model yet
       | 
       | I could go on forever. I hope this trend dies and apple starts
       | using something effective so all the other companies can start
       | copying a new lexicon.
        
         | hombre_fatal wrote:
         | Maybe it's not useless. 1) it's only comparing it to their own
         | products and 2) it's useful to know that the product is the
         | current best in their offering as opposed to a new product that
         | might offer new functionality but isn't actually their most
         | advanced.
         | 
         | Which is especially relevant when it's not obvious which
         | product is the latest and best just looking at the names. Lots
         | of tech naming fails this test from Xbox (Series X vs S) to
         | OpenAI model names (4o vs o1-pro).
         | 
         | Here they claim 4o is their most capable _image generator_
         | which is useful info. Especially when multiple models in their
         | dropdown list will generate images for you.
        
         | Kiro wrote:
         | What's the problem?
        
           | kh_hk wrote:
           | It's a nitpick about the repetitive phrasing for
           | announcements
           | 
           | <Product name>: Our most <superlative> <thing> yet|ever.
        
             | echelon wrote:
             | I hate modern marketing trends.
             | 
             | This one isn't even my biggest gripe. If I could eliminate
             | any word from the English language forever, it would be
             | "effortlessly".
        
               | kh_hk wrote:
               | If you could _effortlessly_ eliminate any word you mean?
        
               | mhurron wrote:
               | Modern? Everything has been 'new and improved' since the
               | 60's
        
               | xboxnolifes wrote:
               | Idk, right now I think I'd eliminate "blazingly fast"
               | from software engineering vocabulary.
        
             | rachofsunshine wrote:
             | Speaking as someone who'd love to not speak that way in my
             | own marketing - it's an unfortunate necessity in a world
             | where people will give you literal milliseconds of their
             | time. Marketing isn't there to tell you about the thing,
             | it's there to get you to want to know more about the thing.
        
               | skydhash wrote:
               | A term for people giving only milliseconds of their
               | attention is: uninterested people. If I'm not looking for
               | a project planner, or interested in the space, there's no
               | wording that can make me stay on an announcement for one.
               | If I am, you can be sure I'm going to read the whole
               | feature page.
        
               | adammarples wrote:
               | Idealistic and wrong, marketing does work in a lot of
               | cases and that's why everybody does it
        
         | sigmoid10 wrote:
         | Has post-Jobs Apple ever come up with anything that would
         | warrant this hope?
        
           | internetter wrote:
           | Every iPhone is their best iPhone yet
        
             | brianshaler wrote:
             | Even the 18 Pro Max Ultra with Apple Intelligence?
             | 
             | Obligatory Jobs monologue on marketing people:
             | 
             | https://www.youtube.com/watch?v=P4VBqTViEx4
        
             | layer8 wrote:
             | Only the September ones. ;)
        
           | kh_hk wrote:
           | No, but I think they stopped with "our most" (since all other
           | brainless corps adopted it) and just connect adjectives with
           | dots.
           | 
           | Hotwheels: Fast. Furious. Spectacular.
        
             | sigmoid10 wrote:
             | Maybe people also caught up to the fact that the "our most
             | X product" for Apple usually means someone else already did
             | X a long time ago and Apple is merely jumping on the wagon.
        
         | nyczomg wrote:
         | https://www.youtube.com/watch?v=CUPDRnUWeBA
        
         | Buttons840 wrote:
         | Every step of gradient descent is the best model yet!
        
         | sionisrecur wrote:
         | Maybe they used AI to come up with the tag line.
        
         | roenxi wrote:
         | We're in the middle of a massive and unprecedented boom in AI
         | capabilities. It is hard to be upset about this phrasing - it
         | is literally true and extremely accurate.
        
       | TheAceOfHearts wrote:
       | I wanted to use this to generate funny images of myself. Recently
       | I was playing around with Gemini Image Generation to dress myself
       | up as different things. Gemini Image Generation is surprisingly
       | good, although the image quality quickly degrades as you add more
       | changes. Nothing harmful, just silly things like dressing me up
       | as a wizard or other typical RPG roles.
       | 
       | Trying out 4o image generation... It doesn't seem to support this
       | use-case at all? I gave it an image of myself and asked to turn
       | me into a wizard, and it generate something that doesn't look
       | like me in the slightest. A second attempt, I asked to add a
       | wizard hat and it just used python to add a triangle in the
       | middle of my image. I looked at the examples and saw they had a
       | direct image modification where they say "Give this cat a
       | detective hat and a monocle", so I tried that with my own image
       | "Give this human a detective hat and a monocle" and it just gave
       | me this error:
       | 
       | > I wasn't able to generate the modified image because the
       | request didn't follow our content policy. However, I can try
       | another approach--either by applying a filter to stylize the
       | image or guiding you on how to edit it using software like
       | Photoshop or GIMP. Let me know what you'd like to do!
       | 
       | Overall, a very disappointing experience. As another point of
       | comparison, Grok also added image generation capabilities and
       | while the ability to edit existing images is a bit limited and
       | janky, it still manages to overlay the requested transformation
       | on top of the existing image.
        
         | og_kalu wrote:
         | It's not actually out for everyone yet. You can tell by the
         | generation style. 4o generates top down (picture goes from
         | mostly blurry to clear starting from the top).
        
       | planb wrote:
       | To quote myself from a comment on sora:
       | 
       | Iterations are the missing link. With ChatGPT, you can
       | iteratively improve text (e.g., "make it shorter," "mention
       | xyz"). However, for pictures (and video), this functionality is
       | not yet available. If you could prompt iteratively (e.g.,
       | "generate a red car in the sunset," "make it a muscle car,"
       | "place it on a hill," "show it from the side so the sun shines
       | through the windshield"), the tools would become exponentially
       | more useful.
       | 
       | I'm looking forward to try this out and see if I was right.
       | Unfortunately it's not yet available for me.
        
         | Telemakhos wrote:
         | Reading other comments in other threads on HN has left me with
         | the impression that iterative improvement within a single chat
         | is not a good idea.
         | 
         | For example, https://news.ycombinator.com/item?id=43388114
        
           | planb wrote:
           | You're right. I'm actually doing this quite often when
           | coding. Starting with a few iterative promts to get a general
           | outline of what I want and when that's ok, copy the outline
           | to a new chat and flesh out the details. But that's still
           | iterative work, I'm just throwing away the intermediate
           | results that I think confuse the LLM sometimes.
        
         | Workaccount2 wrote:
         | You can do that with Gemini's image model, flash 2.0 (image
         | generation) exp.[1] It's not perfect but it does mostly
         | maintain likeness between generations.
         | 
         | [1]https://aistudio.google.com/prompts/new_chat
        
           | camel_Snake wrote:
           | Whisk I think is possibly the best at it. No idea what it
           | uses under the hood though.
           | 
           | https://labs.google/fx/tools/whisk
        
       | jashephe wrote:
       | The periodic table poster under "High binding problems" is billed
       | as evidence of model limitations, but I wonder if it just
       | suggests that 4o is a fan of "Look Around You".
        
       | lxgr wrote:
       | Is there any way to see whether a given prompt was serviced by 4o
       | or Dall-E?
       | 
       | Currently, my prompts seem to be going to the latter still, based
       | on e.g. my source image being very obviously looped through a
       | verbal image description and back to an image, compared to
       | gemini-2.0-flash-exp-image-generation. A friend with a Plus plan
       | has been getting responses from either.
       | 
       | The long-term plan seems to be to move to 4o completely and move
       | Dall-E to its own tab, though, so maybe that problem will resolve
       | itself before too long.
        
         | og_kalu wrote:
         | 4o generates top down (picture goes from mostly blurry to clear
         | starting from the top). If it's not generating like that for
         | you then you don't have it yet.
        
           | lxgr wrote:
           | That's useful, thank you! But it also highlights my point:
           | Why do I have to observe minor details about how the result
           | is being presented to me to know which model was used?
           | 
           | I get the intent to abstract it all behind a chat interface,
           | but this seems a bit too much.
        
             | og_kalu wrote:
             | Oh I agree 100%. Open AI roll outs leave much to be
             | desired. Sometimes there isn't even a clear difference like
             | there is for this.
        
         | n2d4 wrote:
         | If you don't have access to it on ChatGPT yet, you can try
         | Sora, which already has access for me.
        
         | tethys wrote:
         | I've generated (and downloaded) a couple of images. All
         | filenames start with `DALL*E`, so I guess that's a safe way to
         | tell how the images were generated.
        
       | blixt wrote:
       | What's important about this new type of image generation that's
       | happening with tokens rather than with diffusion, is that this is
       | effectively reasoning in pixel space.
       | 
       | Example: Ask it to draw a notepad with an empty tic-tac-toe, then
       | tell it to make the first move, then you make a move, and so on.
       | 
       | You can also do very impressive information-conserving
       | translations, such as changing the drawing style, but also stuff
       | like "change day to night", or "put a hat on him", and so forth.
       | 
       | I get the feeling these models are quite restricted in
       | resolution, and that more work in this space will let us do
       | really wild things such as ask a model to create an app step by
       | step first completely in images, essentially designing the whole
       | app with text and all, then writing the code to reproduce it. And
       | it also means that a model can take over from a really good
       | diffusion model, so even if the original generations are not
       | good, it can continue "reasoning" on an external image.
       | 
       | Finally, once these models become faster, you can imagine a truly
       | generative UI, where the model produces the next frame of the app
       | you are using based on events sent to the LLM (which can do all
       | the normal things like using tools, thinking, etc). However, I
       | also believe that diffusion models can do some of this, in a much
       | faster way.
        
         | Mond_ wrote:
         | Pretty sure the modern Gemini image models can already do token
         | based image generation/editing and are significantly better and
         | faster.
        
           | og_kalu wrote:
           | It's faster but it's definitely not better than what's being
           | showcased here. The quality of Flash 2 Image gens are
           | generally pretty meh.
        
           | blixt wrote:
           | Yeah Gemini has had this for a few weeks, but much lower
           | resolution. Not saying 4o is perfect, but my first few images
           | with it are much more impressive than my first few images
           | with Gemini.
        
             | yieldcrv wrote:
             | _weeks_ , ya'll, weeks!
        
         | jjbinx007 wrote:
         | It still can't generate a full glass of wine. Even in follow up
         | questions it failed to manipulate the image correctly.
        
           | yusufozkan wrote:
           | Are you sure you are using the new 4o image generation?
           | 
           | https://imgur.com/a/wGkBa0v
        
             | minimaxir wrote:
             | That is an unexpectedly literal definition of "full glass".
        
               | numpad0 wrote:
               | Except this is correct in this context. None of existing
               | Diffusion models could, apparently.
        
               | yusufozkan wrote:
               | Generating an image of a completely full glass of wine
               | has been one of the popular limitations of image
               | generators, the reason being neural networks struggling
               | to generalise outside of their training data (there are
               | almost no pictures on the internet of a glass "full" of
               | wine). It seems they implemented some reasoning over
               | images to overcome that.
        
               | kube-system wrote:
               | I wonder if that has changed recently since this has
               | become a litmus test.
               | 
               | Searching in my favorite search engine for "full glass of
               | wine", without even scrolling, three of the images are of
               | wine glasses filled to the brim.
        
               | Loeffelmann wrote:
               | That's the point. With the old models they all failed to
               | produce a wine glass that is completley to the brim full.
               | Because you can't find that a lot in the data they used
               | for training.
        
               | colecut wrote:
               | Imagine if they just actually trained the model on a
               | bunch of photographs of a full glass of wine, knowing of
               | this litmus test
        
               | HelloImSteven wrote:
               | Even if they did, I'd assume the association of "full"
               | and this correct representation would benefit other areas
               | of the model. I.e., there could (/should?) be general
               | improvement for prompts where objects have unusual
               | adjectives.
               | 
               | So maybe training for litmus tests isn't the worst
               | strategy in the absence of another entire internet of
               | training data...
        
               | nefarious_ends wrote:
               | imagine!
        
               | gorkish wrote:
               | I obviously have no idea if they added real or synthetic
               | data to the training set specifically regarding the full-
               | to-the-brim wineglass test, but I fully expect that this
               | prompt is now compromised in the sense that because it is
               | being discussed in the public sphere, it's has inherently
               | become part of the test suite.
               | 
               | Remember the old internet adage that the fastest way to
               | get a correct answer online is to post an incorrect one?
               | I'm not entirely convinced this type of iterative gap
               | finding and filling is really much different than natural
               | human learning behavior.
        
               | orbital-decay wrote:
               | A lot of other things are rare in datasets, let alone
               | correctly labeled. Overturned cars (showing the
               | underside), views from under the table, people walking on
               | the ceiling with plausible upside down hair, clothes, and
               | facial features etc etc
        
               | jorvi wrote:
               | The old models were doing it correct also.
               | 
               | There is no one correct way to interpert 'full'. If you
               | go to a wine bar and ask for a full glass of wine,
               | they'll probably interpert that as a double. But you
               | could also interpert it the way a friend would at home,
               | which is about 2-3cm from the rim.
               | 
               | Personally I would call a glass of wine filled to the
               | brim 'overfilled', not 'full'.
        
               | drdeca wrote:
               | [delayed]
        
               | yusufozkan wrote:
               | This is another cool example from their blog
               | 
               | https://imgur.com/a/Svfuuf5
        
             | Imustaskforhelp wrote:
             | Looks amazing,can you please also create a unconventional
             | image like the clock at 2:35 , I tried it something like
             | this with gemini when some redditor asked it and it failed
             | so wondering if 4o does do it
        
               | Workaccount2 wrote:
               | I tried and while the clock it generated was very well
               | done and high quality, it showed the time as the analog
               | clock default of 10:10.
        
               | lyu07282 wrote:
               | The problem now is we don't know if people mistake dall-e
               | for the new multimodal gpt4o output, they really
               | should've made that clearer.
        
               | CSMastermind wrote:
               | I tried and it failed repeatedly (like actual error
               | messages):
               | 
               | > It looks like there was an error when trying to
               | generate the updated image of the clock showing 5:03. I
               | wasn't able to create it. If you'd like, you can try
               | again by rephrasing or repeating the request.
               | 
               | A few times it did generate an image but it never showed
               | the right time. It would frequently show 10:10 for
               | instance.
        
               | coder543 wrote:
               | If it tried and failed repeatedly, then it was prompting
               | DALL-E, looking at the results, then prompting DALL-E
               | again, not doing direct image generation.
        
             | stevesearer wrote:
             | Can you do this with the prompt of a cow jumping over the
             | moon?
             | 
             | I can't ever seem to get it to make the cow appear to be
             | above the moon. Always literally covering it or to the side
             | etc.
        
               | michaelt wrote:
               | https://chatgpt.com/share/67e31a31-3d44-8011-994e-b7f8af7
               | 694... got it on the second try.
        
               | coder543 wrote:
               | To be clear, that is DALL-E, not 4o image generation.
               | (You can see the prompt that 4o generated to give to
               | DALL-E.)
        
           | blixt wrote:
           | Yeah, it seems like somewhere in the semantic space (which
           | then gets turned into a high resolution image using a
           | specialized model probably) there is not enough space to hold
           | all this kind of information. It becomes really obvious when
           | you try to meaningfully modify a photo of yourself, it will
           | lose your identity.
           | 
           | For Gemini it seems to me there's some kind of "retain old
           | pixels" support in these models since simple image edits just
           | look like a passthrough, in which case they _do_ maintain
           | your identity.
        
           | jasonjmcghee wrote:
           | I don't buy the meme or w/e that they can't produce an image
           | with the full glass of wine. Just takes a little prompt
           | engineering.
           | 
           | Using Dall-e / old model without too much effort (I'd call
           | this "full".)
           | 
           | https://imgur.com/a/J2bCwYh
        
           | sfjailbird wrote:
           | They're glass-half-full type models.
        
           | meeton wrote:
           | https://i.imgur.com/xsFKqsI.png
           | 
           | "Draw a picture of a full glass of wine, ie a wine glass
           | which is full to the brim with red wine and almost at the
           | point of spilling over... Zoom out to show the full wine
           | glass, and add a caption to the top which says "HELL YEAH".
           | Keep the wine level of the glass exactly the same."
        
             | Stevvo wrote:
             | Can't replicate. Maybe the rollout is staggered? Using Plus
             | from Europe, it's consistently giving me a half full glass.
        
               | sionisrecur wrote:
               | Maybe it's half empty.
        
               | coder543 wrote:
               | Is it drawing the image from top to bottom very slowly
               | over the course of at least 30 seconds? If not, then
               | you're using DALL-E, not 4o image generation.
        
             | cruffle_duffle wrote:
             | Maybe the "HELL YEAH" added a "party implication" which
             | shifted it's "thinking" into just correct enough latent
             | space that it was able to actually hunt down some image
             | somewhere in its training data of a truly full glass of
             | wine.
             | 
             | I almost wonder if prompting it "similar to a full glass of
             | beer" would get it shifted just enough.
        
           | iagooar wrote:
           | The question remains: why would you generate a full glass of
           | wine? Is that something really that common?
        
             | minimaxir wrote:
             | It's a type of QA question that can identify peculiarities
             | in models (e.g. count "r"s in strawberry), which the best
             | we have given the black box nature of LLMs.
        
         | rafram wrote:
         | > Finally, once these models become faster, you can imagine a
         | truly generative UI, where the model produces the next frame of
         | the app you are using based on events sent to the LLM
         | 
         | With current GPU technology, this system would need its own
         | Dyson sphere.
        
         | xg15 wrote:
         | > _What 's important about this new type of image generation
         | that's happening with tokens rather than with diffusion_
         | 
         | That sounds really interesting. Are there any write-ups how
         | exactly this works?
        
           | lyu07282 wrote:
           | Would be interested to know as well. As far as I know there
           | is no public information about how this works exactly. This
           | is all I could find:
           | 
           | > The system uses an autoregressive approach -- generating
           | images sequentially from left to right and top to bottom,
           | similar to how text is written -- rather than the diffusion
           | model technique used by most image generators (like DALL-E)
           | that create the entire image at once. Goh speculates that
           | this technical difference could be what gives Images in
           | ChatGPT better text rendering and binding capabilities.
           | 
           | https://www.theverge.com/openai/635118/chatgpt-sora-ai-
           | image...
        
             | treis wrote:
             | I wonder how it'd work if the layers were more physical
             | based. In other words something like rough 3d shape ->
             | details -> color -> perspective -> lighting.
             | 
             | Also wonder if you'd get better results in generating
             | something like blender files and using its engine to render
             | the result.
        
         | abossy wrote:
         | That's very interesting. I would have assumed that 4o is
         | internally using a single seed for the entire conversation, or
         | something analogous to that, to control randomness across image
         | generation requests. Can you share the technical name for this
         | reasoning process so I could look up research about it?
        
           | SpaceManNabs wrote:
           | multimodal chain of thought / generation of thought
           | 
           | Nobody has really decided on a name.
           | 
           | Also chain of thought is somewhat different from chain of
           | thought reasoning so mb throw in multimodal chain of thought
           | reasoning
        
         | nine_k wrote:
         | It also would mean that the model can correctly split the image
         | into layers, or segments, matching the entities described. The
         | low-res layers can then be fed to other image-processing
         | models, which would enhance them and fill in missing small
         | details. The result could be a good-quality animation, for
         | instance, and the "character" layers can even potentially be
         | reusable.
        
         | SamBam wrote:
         | Hmmm, I wanted to do that tic tac toe example, and it failed to
         | create a 3x3 grid, instead creating a 5x5 (?) grid with two
         | first moves marked.
         | 
         | https://chatgpt.com/share/67e32d47-eac0-8011-9118-51b81756ec...
        
           | nerder92 wrote:
           | I tried to play it, and while the conversation is right the
           | image is just all wrong
        
         | Taek wrote:
         | > What's important about this new type of image generation
         | that's happening with tokens rather than with diffusion, is
         | that this is effectively reasoning in pixel space.
         | 
         | I do not think that this is correct. Prior to this release, 4o
         | would generate images by calling out to a fully external model
         | (DALL-E). After this release, 4o generates images by calling
         | out to a multi-modal model that was trained alongside it. It's
         | the same thing as LLaVA.
         | 
         | You can ask 4o about this yourself. Here's what it said to me:
         | 
         | "So while I'm deeply multimodal in cognition (understanding and
         | coordinating text + image), image generation is handled by a
         | linked latent diffusion model, not an end-to-end token-unified
         | architecture."
        
       | krackers wrote:
       | So what's the lore with why this took over a _year_ to launch
       | from the first announcement. It's fairly clear that their hand
       | was forced by Google quietly releasing this exact feature a few
       | weeks back though.
        
       | Lerc wrote:
       | I think the biggest problem I still see is the models awareness
       | of the images it generated itself.
       | 
       | The glaring issue for the older image generators is how it would
       | proudly proclaim to have presented an image with a description
       | that has almost no relation to the image it actually provided.
       | 
       | I'm not sure if this update improves on this aspect. It may
       | create the illusion of awareness of the picture by having better
       | prompt adherence.
        
       | nprateem wrote:
       | The garbled text on these things always just makes them basically
       | useless, especially it often text without being told to like
       | previous models.
        
         | chairdoor wrote:
         | you are being served the old model
        
       | mclau156 wrote:
       | I would love to see advancement in the pixel art space,
       | specifying 64x64 pixels and attempting to make game-ready pixel
       | art and even animations, or even taking a reference image and
       | creating a 64x64 version
        
       | bbstats wrote:
       | that "best of 8" is doing a lot of work. i put in the same input
       | and the image is awful.
        
       | nycdatasci wrote:
       | Here's an example of iterative editing with the new model:
       | https://chatgpt.com/share/67e30f62-12f0-800f-b1d7-b3a9c61e99...
       | 
       | It's much better than prior models, but still generates hands
       | with too many fingers, bodies with too many arms, etc.
        
         | dmd wrote:
         | You know the images themselves don't get shared in links like
         | that, right? (It even tells you so when you make the link.)
        
           | rahimnathwani wrote:
           | I created a shared link just now, was not presented with any
           | such warning, and have the same problem with the image not
           | showing up:
           | 
           | https://chatgpt.com/share/67e319dd-
           | bd08-8013-8f9b-6f5140137f...
        
             | dmd wrote:
             | Interesting. I see this: https://imgur.com/a/QNWeEoZ
        
               | rahimnathwani wrote:
               | Aha! I see different messages in the Android app vs. web
               | app.
               | 
               | In the web app I see:
               | 
               | Your name, custom instructions, and any messages you add
               | after sharing stay private. Learn more
        
             | andai wrote:
             | The image shows for me.
        
         | rahimnathwani wrote:
         | For some reason, I can't see the images in that chat, whether
         | I'm signed in or in incognito mode.
         | 
         | I see errors like this in the console:
         | 
         | ewwsdwx05evtcc3e.js:96 Error: Could not fetch file with ID file
         | _0000000028185230aa1870740fa3887b?shared_conversation_id=67e30f
         | 62-12f0-800f-b1d7-b3a9c61e99d6 from file service at
         | iehdyv0kxtwne4ww.js:1:671 at async w
         | (iehdyv0kxtwne4ww.js:1:600) at async queryFn
         | (iehdyv0kxtwne4ww.js:1:458)Caused by:
         | ClientRequestMismatchedAuthError: No access token when trying
         | to use AuthHeader
        
       | ravedave5 wrote:
       | Everyone should try running their prompts and see how over hyped
       | this is. The results I get are terrible comparatively.
        
         | nycdatasci wrote:
         | I don't think the new model is rolled out to all users yet.
        
       | DrNosferatu wrote:
       | Could they have switched to *both* image and text generation via
       | diffusion, without tokens?
        
       | M4v3R wrote:
       | I've just tried it and oh wow it's really good. I managed to
       | create a birthday invitation card for my daughter in basically
       | 1-shot, it nailed exactly the elements and style I wanted. Then I
       | asked to retain everything but tweak the text to add more details
       | about the date, venue etc. And it did. I'm in shock. Previous
       | models would not be even halfway there.
        
         | swyx wrote:
         | share prompt minus identifying details?
        
           | M4v3R wrote:
           | > Draw a birthday invitation for a 4 year old girl [name
           | here]. It should be whimsical, look like its hand-drawn with
           | little drawings on the sides of stuff like dinosaurs,
           | flowers, hearts, cats. The background should be light and the
           | foreground elements should be red, pink, orange and blue.
           | 
           | Then I asked for some changes:
           | 
           | > That's almost perfect! Retain this style and the elements,
           | but adjust the text to read:
           | 
           | > [refined text]
           | 
           | > And then below it should add the location and date details:
           | 
           | > [location details]
        
       | tantaman wrote:
       | Attention to every detail, even the awkward nerd high-five.
        
       | danhds wrote:
       | To avoid confusion, why not always use a general AI model
       | upfront, then depending on the user's prompt, redirect it to a
       | specific model?
        
         | n2d4 wrote:
         | The models are noticeably different -- for example, o1 and o3
         | have reasoning, and some users (eg. me) want to tell the model
         | when to use reasoning, and when not.
         | 
         | As to why they don't automatically detect when reasoning could
         | be appropriate and then switch to o3, I don't know, but I'd
         | assume it's about cost (and for most users the output quality
         | is negligible). 4o can do everything, it's just not great at
         | "logic".
        
       | polotics wrote:
       | well it failed on me, after many tries:
       | 
       | ...Once the wait time is up, I can generate the corrected version
       | with exactly eight characters: five mice, one elephant, one polar
       | bear, and one giraffe in a green turtleneck. Let me know if you'd
       | like me to try again later!
        
       | n2d4 wrote:
       | For those who are still getting the old DALL-E images inside
       | ChatGPT, you can access the new model on Sora:
       | https://sora.com/explore/images
        
       | ibzsy wrote:
       | Anyone else frightened by this? Seeing meant believing, and now
       | that isnt the case anymore...
        
         | layer8 wrote:
         | Look closer at the fingers. These models still don't have a
         | firm handle on them. The right elbow on the second picture also
         | doesn't quite look anatomically possible.
        
         | wepple wrote:
         | This specifically? No. We've been on this path a while now.
         | 
         | The general idea of indistinguishable real/fake images; yeah
        
         | quectophoton wrote:
         | Nah, I'll maybe start taking them seriously when they can draw
         | someone grating cheese, but holding the cheese and the grater
         | as if they were playing violin.
        
       | kylehotchkiss wrote:
       | They still all have a somewhat cold and sterile look to them.
       | Probably that 1% the next decade will be spent working out.
        
       | ashvardanian wrote:
       | The pre-recorded short videos are a much better form of
       | presentation than live-streamed announcements!
        
       | freeopinion wrote:
       | It bothers me to see links to content that requires a login. I
       | don't expect openai or anyone else to give their services away
       | for free. But I feel like "news" posts that require one to setup
       | an account with a vendor are bad faith.
       | 
       | If the subject matter is paywalled, I feel that the post should
       | include some explanation of what is newsworthy behind the link.
        
       | macleginn wrote:
       | A real improvement, but it still drew me a door with a handle
       | where the should be one and an extra knob on the side where
       | hinges are.
        
       | trekkie1024 wrote:
       | Interesting that in the second image the text on the whiteboard
       | changes (top left)
        
       | BigParm wrote:
       | They say it must be an important OpenAI announcement when they
       | bring out the twink.
        
       | akomtu wrote:
       | The real test for image generators is the image->text->image
       | conversion. In other words it should be able to describe an image
       | with words and then use the words to recreate the original image
       | with a high accuracy.
        
       ___________________________________________________________________
       (page generated 2025-03-25 23:00 UTC)