[HN Gopher] Launch HN: Design Arena (YC S25) - Head-to-head AI b...
       ___________________________________________________________________
        
       Launch HN: Design Arena (YC S25) - Head-to-head AI benchmark for
       aesthetics
        
       Hi HN, I'm Grace from Design Arena (https://www.designarena.ai/) -
       we're building a crowdsourced benchmark for AI-generated visuals
       (websites, images, video, and more). We put AI models and builder
       tools in head-to-head comparisons that get voted on by real users
       from around the world. Think "Hot or Not" for the AI era :)  (Btw,
       when we say real users we mean _real_ users, so you may get a
       captcha on the site. Sorry, but we have to use every bot protection
       available! We only want human ratings, for obvious reasons.)
       Here's a demo video: https://www.youtube.com/watch?v=vPyEQnuVgeI
       We didn't set out to build this - we were actually working on an AI
       game engine. But we found that models sucked at look-and-feel. Even
       when the output code was usually functional, most visual aspects
       lacked the soul that makes great graphics feel alive.  So we built
       a this-or-that game, just for ourselves, to figure out which
       generated outputs had the best graphics. To our surprise, that
       turned out to be more exciting than the original idea--it turns out
       this is a widespread problem! We did a Show HN a month ago
       (https://news.ycombinator.com/item?id=44542578) and that was partly
       what convinced us to make this benchmark thing our actual product.
       State-of-the-art models might be winning IMO gold, but they are
       still putting white text on a white background. There needs to be
       _some_ measurement of what's good and what isn't (yes, there is
       such a thing as good design!), and it sure isn't going to come from
       LLMs.  We come from engineering backgrounds (Apple and Nvidia) with
       a love for design; we know when we like or dislike something, even
       when we can't say why. This-or-that / hot-or-not games are made for
       domains like this: Design Arena's goal is to make everything
       stupidly simple so humans can just do the easy part: like-
       vs.-dislike. Which also turns out to be the valuable part, because
       what's easiest for humans is actually the part that the AIs can't
       currently do.  Since our Show HN, we've extended our initial set of
       ~25 LLM models to 54 LLM models, 12 image models, 4 video models,
       22 audio models, and 22 vibe-coding tools (like Lovable, Bolt, v0,
       Firebase Studio, and more). In this last category, we've been
       surprised to find that agentic tools that were not specifically
       marketed as vibe-coders like Devin performed exceedingly well in
       the builder category, outperforming dedicated builder tools like
       Lovable, v0, and Bolt.  Our users are mostly devs who want to spin
       up a frontend, or designers who want to spin up design variants
       faster. In both cases, Design Arena provides a quick way to find
       out which options are better than others. Dev-or-designer needs to
       make the final calls, because there's no substitute for good
       judgment. But this type of formatting can really help.  We plan to
       make money by offering version testing as a service to companies
       that need to quantify improvements in their product between builds.
       This is the first time we've ever worked on something like this!
       We'd love to learn from you all and look forward to your feedback.
        
       Author : grace77
       Score  : 51 points
       Date   : 2025-08-12 16:10 UTC (6 hours ago)
        
       | transformi wrote:
       | Cool - do you train model that will be the proxy from the votes
       | of persons?
        
         | grace77 wrote:
         | we're not training models or proxying human votes with models
        
       | ryhanshannon wrote:
       | Is this an area that is not yet covered by other user rating
       | benchmark sites like LLMarena?
        
         | grace77 wrote:
         | yes! LMArena recently started pushing "webdev" arena, but there
         | was no explicit emphasis on design or aesthetics, just web-
         | based content
        
       | KaoruAoiShiho wrote:
       | Curious if you guys got into YC for this idea or something else?
        
         | neonate wrote:
         | Post says they were making an AI game engine, so that's
         | probably what they got in with.
        
         | j_da wrote:
         | We started out building a platform to one-shot games (single-
         | player and multi-player), but realized that the model you used
         | under the hood really made a difference in functionality and
         | graphics. We started out building the benchmark as an internal
         | tool for ourselves to see which model was the best, but found
         | that benchmarking models on visual "taste" was something that
         | people were generally interested in currently.
        
       | andrewstuart wrote:
       | AI is terrible at making nice looking design layout with font
       | selections.
       | 
       | Sure it can make great looking images but nothing can make a nice
       | looking poster or basic page layout.
       | 
       | I'm waiting for someone to solve this. I'm not even sure it takes
       | AI it might just be programmatic.
        
         | grace77 wrote:
         | yes - we're trying to figure out why that is
        
         | rovmut wrote:
         | You've perfectly articulated the gap in the market. The
         | solution isn't just a better image generator. I built a tool
         | called LayoutCraft to solve this exact problem. It focuses
         | entirely on creating a great layout with good font choices
         | automatically. It uses AI to understand the request, but then
         | applies a structured, programmatic 'blueprint' to build the
         | layout. This is how it handles fonts and spacing properly,
         | resulting in a clean design, not a chaotic image.
        
       | refrigerator wrote:
       | Great concept -- definitely needed and will hopefully push the
       | labs to improve design abilities of models!
        
         | j_da wrote:
         | Yes, exactly. We want to be a forcing function for better
         | design models and agents.
        
       | Michelangelo11 wrote:
       | This is interesting but, speaking frankly, I see many seemingly
       | insurmountable issues. Here are some:
       | 
       | - Contests will often be won not by the entry that best adhered
       | to the prompt, but the best-looking one. This happened in the
       | contest "Input Prompt Build a brutalist website to a typeface
       | maker," which I got as a recent example. The winning entry had
       | megawatt-bright magenta and yellow, which shouldn't appear
       | anywhere near brutalism, and in other design aspects had almost
       | no connection to brutalism either -- but it _was_ the most
       | attractive of the bunch.
       | 
       | - The approach only gets you to a local maximum. Current LLMs
       | aren't very good designers, as you say, so contests will involve
       | picking between mostly middling entries. You'd want a design
       | that's, say, a 9 or a 10 on a 10-point scale -- but some 95% of
       | the entry distribution will probably be between 5.5 and 7.5 or
       | so, and that's what users will get to pick from.
        
         | j_da wrote:
         | All great points. A limitation with human feedback is that once
         | you start asking for more than binary preferences (e.g.
         | multiple rankings or written feedback), the quality of the
         | feedback does decrease. For instance, many times humans can
         | give a quick answer on preference, but when asked "why" they
         | prefer one thing over the other, they might not be able to full
         | explain it in language. This in general is very much an open
         | area of research on collecting and incorporating the most
         | optimal types of feedback.
         | 
         | I definitely agree with your second point. One idea we're
         | experimenting with is adding a human baseline, in which the
         | models are benchmarked against human generated designs as well.
        
         | grace77 wrote:
         | yes! to the second point, someone in our show HN proposed
         | encouraging human designers to compete in submissions as well -
         | we tried implementing this and found that, at least right now,
         | LLMs are still so bad at design that asking a human to beat
         | them is trivial - our plan right now is to focus more on this
         | once it becomes more of challenges and therefore hopefully more
         | interesting/entertaining
        
       | henriquegodoy wrote:
       | This is actually really needed, current ai design tools are so
       | predictable and formulaic, like every output feels like the same
       | purple gradients with rounded corners and that one specific sans
       | serif font that every model seems obsessed with, it's gotten to
       | the point where you can spot ai-generated designs from a mile
       | away because they all have this weird sterile aesthetic that
       | screams "made by a model"
        
         | grace77 wrote:
         | Exactly - we think the ai design tools are in the equivalent of
         | the 'uncanny valley' territory that a lot of the diffusion
         | models were stuck in just 1-2 months ago; most average
         | diffusion models are still in this local optimum, but the best
         | of the best seem to have escaped it.
        
         | BoorishBears wrote:
         | I don't think this works right now tbh.
         | 
         | It has the same problem as LMArena (which already had
         | webarena): better aesthetics are so far out of distribution you
         | can't even train on the feedback you get here.
         | 
         | You just get a new form of turbo-slop as some hidden preference
         | takes over. With text output that ended up being extensive
         | markdown and emojis. Here that might be people accidentally
         | associating frosted surfaces with relatively better aesthetics,
         | for example.
         | 
         | The problem is so bad LMArena maintains a seperate ranking
         | where they strip away styling entirely
        
       | ChrisArchitect wrote:
       | So this went from a Show HN: to a Launch HN in a month?
       | 
       | (Show HN: https://news.ycombinator.com/item?id=44542578)
        
         | nextworddev wrote:
         | Vibe launch
        
         | grace77 wrote:
         | yes - the changes since Show HN have been builders
         | (https://www.designarena.ai/builder), audio
         | (https://www.designarena.ai/audio), video
         | (https://www.designarena.ai/diffusion), and compare
         | (https://www.designarena.ai/studio). We also had a feed at some
         | point, but ripped that out because it looked messy
        
       | doctorpangloss wrote:
       | Can you write what you imagine is a good "game dev" prompt?
        
         | grace77 wrote:
         | We keep our system prompts across the board as bare bones as
         | possible: https://www.designarena.ai/system-prompts
         | 
         | As for good game dev prompts, here's one from a user that made
         | a pretty fun game: Make asteroids with 2 computers playing
         | against each other on one screen. There should be asteroids
         | flying and 2 ships being controlled by 2 computers. Pay
         | attention to thoroughly implementing the logic to make the
         | ships avoid asteroids at all costs. Absolutely no user input
         | should be necessary, no click to start, no click to restart.
         | The game starts automatically on load and automatically
         | restarts when either computer is dead. The ships should survive
         | as long as possible. The ships should fly around, avoid
         | asteroids as a priority, but also shoot asteroids and each
         | other. Make ships and asteroids positions random each time.
         | Asteroids should split when shot. The goal is to create a
         | robust algorithm for ships so they can survive as long as
         | possible. The game should be playable at 500x500 screen
         | resolution.
        
       ___________________________________________________________________
       (page generated 2025-08-12 23:00 UTC)