[HN Gopher] Pelicans on a bicycle
       ___________________________________________________________________
        
       Pelicans on a bicycle
        
       Author : colejohnson66
       Score  : 60 points
       Date   : 2024-12-16 18:10 UTC (4 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | flippyhead wrote:
       | Once again, Claude wins.
        
       | twiss wrote:
       | I think it's interesting to consider how humans would go about
       | this task ("Generate an SVG of a pelican riding a bicycle"), and
       | how well they would do if they had to output the SVG into a text
       | box without any other tools. Considering that, I think Claude 3.5
       | Sonnet and GPT-4o did incredibly well, and even the others might
       | be commended for making a valid SVG at all..
        
         | nemomarx wrote:
         | depends on the human, right? I imagine an artist who
         | specializes in SVG would do pretty well and might make it a
         | professional logo?
        
           | egypturnash wrote:
           | Pro artist who specializes in vector work here: I would use
           | Adobe Illustrator to draw it, while looking at actual photos
           | of pelicans and bicycles, and export an SVG. If it needed to
           | have a lot of named parts I could make that happen.
           | 
           | If I had the latest version of Illustrator then I would
           | consider seeing how well its image generation does, but I do
           | not because it has a lot of exciting new bugs that break my
           | normal workflow. I believe that under the hood that works by
           | feeding your text prompt to a bitmap image generator and
           | running the same old autotrace on it, which results in some
           | pretty messy and hard-to-edit shapes.
        
       | danielcorin wrote:
       | More recently, `gemini-exp-1206` did quite well [1].
       | 
       | [1]: https://github.com/simonw/pelican-
       | bicycle/blob/main/README.m...
        
         | toxik wrote:
         | I feel like "quite well" is overselling it a bit. It did maybe
         | better.
        
         | ttul wrote:
         | Gemini 1206 is the new hotness in my books. I've moved my day
         | to day LLM needs over to Google's tab for the first time. I'm
         | not sure what they changed, but it deserves a good look. Claude
         | 3.5-Sonnet (New) is fantastic as well, but the 2M token context
         | window offered by Google allows you to suck in an entire code
         | repository and reason effectively across the whole thing.
         | Google is catching up...
        
       | simonw wrote:
       | I've been using this dumb benchmark for a few months now. More
       | posts about it here: https://simonwillison.net/tags/pelican-
       | riding-a-bicycle/
        
         | behnamoh wrote:
         | [flagged]
        
           | simonw wrote:
           | Sure, they're the wrong tool for drawing a pelican - but
           | testing their SVG output is a useful way to get a feel for
           | how good they are at step by step reasoning, coordinate
           | systems, spatial awareness and generating valid SVG/XML.
           | 
           | There are genuinely useful applications of SVG-generation
           | from LLMs - outputting simple infographics or charts for
           | example.
           | 
           | I use LLMs to write HTML all the time, of which SVG is a
           | useful optional component.
        
             | KMnO4 wrote:
             | This benchmark is interesting, because it sidesteps the
             | reasoning and process that humans would excel at.
             | 
             | For example, if I asked you to assemble a bookshelf with
             | some wood, nails, and cement, you might first make a hammer
             | with the cement before trying to assemble the bookshelf.
             | 
             | You can get a much better image by first asking the
             | (multimodal) LLM to draw an image of a pelican on a
             | bicycle, and then generate an SVG using the referenced
             | image.
             | 
             | https://chatgpt.com/share/67609300-9abc-800d-9b26-95074f214
             | 9...
        
           | gffrd wrote:
           | > these models aren't pelican painters or anything like that,
           | they're LANGUAGE models
           | 
           | Tools are defined by what people use them for, not by how
           | they were intended--or designed--to be used. (Just ask
           | Nvidia)
           | 
           | adding: so I think someone comparing how various tools
           | perform at a task that's valuable to them--and probably
           | others--is just fine, even if it's different from what the
           | creator of the tool intended?
        
         | sfink wrote:
         | I hope you have other private benchmarks running that you don't
         | talk about or publish, just in case a model maker intentionally
         | targets one of your benchmarks, or some fuzzy "find things
         | people have mentioned as potential LLM benchmarks" process
         | scoops up your ideas and/or outputs.
        
           | zamadatix wrote:
           | If we ever get to the point LLMs are already optimized to
           | answer every question you can think of then there isn't
           | really a need to have a secret question in the first place.
        
             | sfink wrote:
             | Not any that you can think of. Just the ones you've
             | published something about.
             | 
             | Plus, simonw isn't exactly a meaningless nobody in this
             | space, and his writeups are more detailed and actionable,
             | and therefore identifiable, than some random "hey a great
             | LLM benchmark would be creating an SVG of a walrus twerking
             | in front of a jelly bean store" throwaway comment.
             | 
             | Proof: I asked ChatGPT 4o the question "What are some users
             | who post ad hoc LLM benchmarks to technical discussion
             | sites, and what benchmarks have they proposed?" simonw is
             | in the list, 1 of 7 individual people it suggested. (The
             | proposed benchmarks listed for him were more general than
             | the specific one here: "Testing LLMs' capabilities with
             | code generation, particularly in niche languages or against
             | real-world API schemas." But it's easy to imagine followup
             | queries bringing this one up.)
        
               | zamadatix wrote:
               | I'm in agreement LLMs get contaminated with test data,
               | particularly from simonw. What I'm referring to is nobody
               | needs worry about hoarding secret questions from the
               | public eye to avoid that problem. It is a valid
               | approach... but a bit of a sad path considering.
               | 
               | Don't run unpublished private benchmarks or worry about
               | keeping a counted hoard of secret questions. Do rotate
               | your questions every few months to whatever comes to mind
               | at the time. When nothing comes to mind there is no point
               | in running a question benchmark anymore as it already
               | answers every possible you could possibly question you
               | can think of (and the only way it gets there in your
               | lifespan is by reasoning rather than memorization). You
               | can always run the new question retroactively on an old
               | models for comparison purposes so that's not a concern
               | either.
               | 
               | The important thing here being "rotate questions without
               | concern of having things lined up for it" rather than
               | "fear what happens when you discuss your question".
        
       | eminence32 wrote:
       | I like how one of them is clearly (to my eyes, at least) a person
       | holding a gun
        
       | MarkusWandel wrote:
       | I'm focusing on the bike part here because, as a bike geek, I
       | could draw one from memory that's correct in all details. But to
       | a non-bikie that's more difficult than you'd think. I can't find
       | the picture gallery right now but an article about it, which
       | links another article:
       | 
       | https://web.archive.org/web/20240419001426/https://www.wired...
       | 
       | So the fact that the AI models screw this up so badly is
       | understandable. Sure, they screw up in ways that humans wouldn't,
       | such as the beak backwards in one of the pictures (pointy end
       | toward the bird!) because they don't know or care about something
       | every human would know: What a beak is for and what it looks like
       | in general. Or for that matter the biodynamics of how a pelican's
       | long, spindly legs could, in fact, work a pair pedals. But ask me
       | to draw a pelican from memory, and have a good laugh (if you're
       | better at it than me) because to me, they're just kind of a
       | peripheral vision, pink abstraction, not something I focus on
       | understanding. And that's what they are to the AI model too.
        
         | parpfish wrote:
         | > ... they're just kind of a peripheral vision, pink
         | abstraction, not something I focus on understanding
         | 
         | are there pink pelicans, or are you thinking of flamingoes?
        
           | MarkusWandel wrote:
           | Ha, see, even missed that part! Honestly.
        
         | larubbio wrote:
         | This is the the artist you are thinking of.
         | 
         | https://www.gianlucagimini.it/portfolio-item/velocipedia/
        
           | alwa wrote:
           | This is incredible. I wonder if anybody has set out to build
           | some of these bikes as sculptures.
        
         | Ylpertnodi wrote:
         | >as a bike geek, I could draw one from memory that's correct in
         | all details.
         | 
         | Mo link, sorry, but on youtube, GCN asked pro riders to draw a
         | bicycle...none could.
        
           | MarkusWandel wrote:
           | Well, as a bike geek - who wrenches on them, changes old
           | bikes to different configurations etc - I can visualize every
           | part because I've dealt with all of them. I can tell you, for
           | example, that ancient Shimano downtube shifters are held in
           | place by an M4.5 bolt. M4.5? Try to find something to fit
           | that at your local hardware store (when changing said bike to
           | handlebar-mounted shifters). Or which way the opposite sides
           | of a BSA bottom bracket are threaded (from memory! Which side
           | has the backwards threads?) Or the whole stack of bits and
           | pieces that make up a headset (both threaded and threadless).
           | 
           | Whereas a pro rider can probably tell you all about the
           | biomechanics of how to optimally interact with the bike, the
           | right foods to eat and how much to sleep and when. But the
           | actual wrenching around with them? That's the pro mechanic's
           | job.
        
             | olddustytrail wrote:
             | Why don't you share some of your videos where you draw
             | these bikes?
        
             | bsammon wrote:
             | I'd have to agree here that success at this drawing
             | test/challenge is strongly correlated with experience
             | repairing/maintaining (one or more) bicycles, a lot more
             | than it is correlated with riding them.
             | 
             | I also suspect it strongly correlates with knowing the term
             | "diamond-frame". In addition to bicycle-repairers probably
             | knowing the term, it's also used among people who like/know
             | other frame styles--in my case recumbent bicycles.
        
             | stevage wrote:
             | Ha, I'm pretty into bikes, to the point where at least I
             | understand the questions here but the most complex things I
             | have ever done were changing a normal BB and set of cranks,
             | and replacing some STI cables.
        
           | vbarrielle wrote:
           | Lots of pro riders do not take care of their bikes
           | themselves. They're used to having bike mechanics adjust
           | everything for every race, and a lot of them don't consider
           | it part of the job to take care of their training bikes. Some
           | of them don't even do the cleaning.
        
       | eitally wrote:
       | Those output examples are absolutely horrifically bad compared to
       | what I get with a cursory request to Gemini 2.0 using Imagen3 via
       | gemini.google.com.
        
         | toxik wrote:
         | This is generating SVG data, not using an image generator.
        
         | alwa wrote:
         | Was yours an SVG? I think that's what makes Simon feel that
         | this test is useful: the LLM has to generate functioning SVG
         | code describing these shapes.
        
           | simonw wrote:
           | Yeah, this test is to see if a pure LLM can output SVG that
           | renders well. It's effectively a test of their "spatial
           | reasoning" capabilities.
        
       | oatsandsugar wrote:
       | It looks like no LLM has seen a pelican before at all.
        
       | pjs_ wrote:
       | I wanna see this running in a feedback loop - show the model its
       | output and get it to make corrections.
       | 
       | Remember that these are basically one-shot. Very different to how
       | you or I would solve the problem (get a circle up on the screen,
       | have a look at it, make some changes, add some wings, tweak the
       | dimensions, etc.). We would go through hundreds or thousands of
       | feedback cycles before we got something half-decent -- in this
       | situation the model only gets one attempt.
        
       | cadamsau wrote:
       | Careful not to shout too loud about it or they'll start training
       | for it!
       | 
       | Latest Claude does a suspiciously good job...
        
       | marcodiego wrote:
       | Most humans can't correctly draw a bicycle from memory:
       | https://www.wired.com/2016/04/can-draw-bikes-memory-definite...
        
       | fuegoio wrote:
       | Mistral Large proposition:
       | https://chat.mistral.ai/chat/4da427b1-e033-454d-b134-c5d1f6e...
        
       | Mistletoe wrote:
       | If only it didn't have to be svg, so glorious.
       | 
       | https://g.co/gemini/share/56e5dfa1a598
        
       | skissane wrote:
       | Just thinking out loud: a lot of models generate pretty poor SVG
       | output, but part of that is because they don't get any visual
       | feedback.
       | 
       | But, what about this workflow: given prompt, LLM generates two
       | SVG outputs. Both are rendered by an SVG renderer, and then we
       | combine the two into one image, one on the left and the other on
       | the right. We then ask a visual LLM (could be the same LLM or
       | could be a different one) to tell us whether the left half or
       | right half of the image is a better response to the prompt. Now
       | we've got preferences which can be used to fine-tune the LLM
       | using DPO. And you could iteratively repeat the process - as the
       | LLM is fine-tuned it may produce even better outputs which then
       | produces new preferences for further fine-tuning.
       | 
       | Would be interesting to see what kinds of results it might
       | produce in practice.
        
       | e3a8 wrote:
       | "Generate room with no elephant" https://imgur.com/BI61S1T
        
       | boredhedgehog wrote:
       | Some of the drawings, like the one from Amazon Nova Pro, are
       | quite fascinating as abstract artworks. It's like the idea of a
       | bicycle without its physicality.
        
       | notatoad wrote:
       | assuming that LLMs are trained to generate human-like output, i
       | think Claude and GPT4o both aced this.
       | 
       | no, they didn't get it right, but the output approximates what
       | most humans can do.
        
       | tessellated wrote:
       | > aren't any pelican on a bicycle SVG files floating around (yet)
       | 
       |  _maintains website collecting SVG files of pelicans on bicycles_
        
       ___________________________________________________________________
       (page generated 2024-12-16 23:01 UTC)