[HN Gopher] Pelicans on a bicycle
___________________________________________________________________
Pelicans on a bicycle
Author : colejohnson66
Score : 60 points
Date : 2024-12-16 18:10 UTC (4 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| flippyhead wrote:
| Once again, Claude wins.
| twiss wrote:
| I think it's interesting to consider how humans would go about
| this task ("Generate an SVG of a pelican riding a bicycle"), and
| how well they would do if they had to output the SVG into a text
| box without any other tools. Considering that, I think Claude 3.5
| Sonnet and GPT-4o did incredibly well, and even the others might
| be commended for making a valid SVG at all..
| nemomarx wrote:
| depends on the human, right? I imagine an artist who
| specializes in SVG would do pretty well and might make it a
| professional logo?
| egypturnash wrote:
| Pro artist who specializes in vector work here: I would use
| Adobe Illustrator to draw it, while looking at actual photos
| of pelicans and bicycles, and export an SVG. If it needed to
| have a lot of named parts I could make that happen.
|
| If I had the latest version of Illustrator then I would
| consider seeing how well its image generation does, but I do
| not because it has a lot of exciting new bugs that break my
| normal workflow. I believe that under the hood that works by
| feeding your text prompt to a bitmap image generator and
| running the same old autotrace on it, which results in some
| pretty messy and hard-to-edit shapes.
| danielcorin wrote:
| More recently, `gemini-exp-1206` did quite well [1].
|
| [1]: https://github.com/simonw/pelican-
| bicycle/blob/main/README.m...
| toxik wrote:
| I feel like "quite well" is overselling it a bit. It did maybe
| better.
| ttul wrote:
| Gemini 1206 is the new hotness in my books. I've moved my day
| to day LLM needs over to Google's tab for the first time. I'm
| not sure what they changed, but it deserves a good look. Claude
| 3.5-Sonnet (New) is fantastic as well, but the 2M token context
| window offered by Google allows you to suck in an entire code
| repository and reason effectively across the whole thing.
| Google is catching up...
| simonw wrote:
| I've been using this dumb benchmark for a few months now. More
| posts about it here: https://simonwillison.net/tags/pelican-
| riding-a-bicycle/
| behnamoh wrote:
| [flagged]
| simonw wrote:
| Sure, they're the wrong tool for drawing a pelican - but
| testing their SVG output is a useful way to get a feel for
| how good they are at step by step reasoning, coordinate
| systems, spatial awareness and generating valid SVG/XML.
|
| There are genuinely useful applications of SVG-generation
| from LLMs - outputting simple infographics or charts for
| example.
|
| I use LLMs to write HTML all the time, of which SVG is a
| useful optional component.
| KMnO4 wrote:
| This benchmark is interesting, because it sidesteps the
| reasoning and process that humans would excel at.
|
| For example, if I asked you to assemble a bookshelf with
| some wood, nails, and cement, you might first make a hammer
| with the cement before trying to assemble the bookshelf.
|
| You can get a much better image by first asking the
| (multimodal) LLM to draw an image of a pelican on a
| bicycle, and then generate an SVG using the referenced
| image.
|
| https://chatgpt.com/share/67609300-9abc-800d-9b26-95074f214
| 9...
| gffrd wrote:
| > these models aren't pelican painters or anything like that,
| they're LANGUAGE models
|
| Tools are defined by what people use them for, not by how
| they were intended--or designed--to be used. (Just ask
| Nvidia)
|
| adding: so I think someone comparing how various tools
| perform at a task that's valuable to them--and probably
| others--is just fine, even if it's different from what the
| creator of the tool intended?
| sfink wrote:
| I hope you have other private benchmarks running that you don't
| talk about or publish, just in case a model maker intentionally
| targets one of your benchmarks, or some fuzzy "find things
| people have mentioned as potential LLM benchmarks" process
| scoops up your ideas and/or outputs.
| zamadatix wrote:
| If we ever get to the point LLMs are already optimized to
| answer every question you can think of then there isn't
| really a need to have a secret question in the first place.
| sfink wrote:
| Not any that you can think of. Just the ones you've
| published something about.
|
| Plus, simonw isn't exactly a meaningless nobody in this
| space, and his writeups are more detailed and actionable,
| and therefore identifiable, than some random "hey a great
| LLM benchmark would be creating an SVG of a walrus twerking
| in front of a jelly bean store" throwaway comment.
|
| Proof: I asked ChatGPT 4o the question "What are some users
| who post ad hoc LLM benchmarks to technical discussion
| sites, and what benchmarks have they proposed?" simonw is
| in the list, 1 of 7 individual people it suggested. (The
| proposed benchmarks listed for him were more general than
| the specific one here: "Testing LLMs' capabilities with
| code generation, particularly in niche languages or against
| real-world API schemas." But it's easy to imagine followup
| queries bringing this one up.)
| zamadatix wrote:
| I'm in agreement LLMs get contaminated with test data,
| particularly from simonw. What I'm referring to is nobody
| needs worry about hoarding secret questions from the
| public eye to avoid that problem. It is a valid
| approach... but a bit of a sad path considering.
|
| Don't run unpublished private benchmarks or worry about
| keeping a counted hoard of secret questions. Do rotate
| your questions every few months to whatever comes to mind
| at the time. When nothing comes to mind there is no point
| in running a question benchmark anymore as it already
| answers every possible you could possibly question you
| can think of (and the only way it gets there in your
| lifespan is by reasoning rather than memorization). You
| can always run the new question retroactively on an old
| models for comparison purposes so that's not a concern
| either.
|
| The important thing here being "rotate questions without
| concern of having things lined up for it" rather than
| "fear what happens when you discuss your question".
| eminence32 wrote:
| I like how one of them is clearly (to my eyes, at least) a person
| holding a gun
| MarkusWandel wrote:
| I'm focusing on the bike part here because, as a bike geek, I
| could draw one from memory that's correct in all details. But to
| a non-bikie that's more difficult than you'd think. I can't find
| the picture gallery right now but an article about it, which
| links another article:
|
| https://web.archive.org/web/20240419001426/https://www.wired...
|
| So the fact that the AI models screw this up so badly is
| understandable. Sure, they screw up in ways that humans wouldn't,
| such as the beak backwards in one of the pictures (pointy end
| toward the bird!) because they don't know or care about something
| every human would know: What a beak is for and what it looks like
| in general. Or for that matter the biodynamics of how a pelican's
| long, spindly legs could, in fact, work a pair pedals. But ask me
| to draw a pelican from memory, and have a good laugh (if you're
| better at it than me) because to me, they're just kind of a
| peripheral vision, pink abstraction, not something I focus on
| understanding. And that's what they are to the AI model too.
| parpfish wrote:
| > ... they're just kind of a peripheral vision, pink
| abstraction, not something I focus on understanding
|
| are there pink pelicans, or are you thinking of flamingoes?
| MarkusWandel wrote:
| Ha, see, even missed that part! Honestly.
| larubbio wrote:
| This is the the artist you are thinking of.
|
| https://www.gianlucagimini.it/portfolio-item/velocipedia/
| alwa wrote:
| This is incredible. I wonder if anybody has set out to build
| some of these bikes as sculptures.
| Ylpertnodi wrote:
| >as a bike geek, I could draw one from memory that's correct in
| all details.
|
| Mo link, sorry, but on youtube, GCN asked pro riders to draw a
| bicycle...none could.
| MarkusWandel wrote:
| Well, as a bike geek - who wrenches on them, changes old
| bikes to different configurations etc - I can visualize every
| part because I've dealt with all of them. I can tell you, for
| example, that ancient Shimano downtube shifters are held in
| place by an M4.5 bolt. M4.5? Try to find something to fit
| that at your local hardware store (when changing said bike to
| handlebar-mounted shifters). Or which way the opposite sides
| of a BSA bottom bracket are threaded (from memory! Which side
| has the backwards threads?) Or the whole stack of bits and
| pieces that make up a headset (both threaded and threadless).
|
| Whereas a pro rider can probably tell you all about the
| biomechanics of how to optimally interact with the bike, the
| right foods to eat and how much to sleep and when. But the
| actual wrenching around with them? That's the pro mechanic's
| job.
| olddustytrail wrote:
| Why don't you share some of your videos where you draw
| these bikes?
| bsammon wrote:
| I'd have to agree here that success at this drawing
| test/challenge is strongly correlated with experience
| repairing/maintaining (one or more) bicycles, a lot more
| than it is correlated with riding them.
|
| I also suspect it strongly correlates with knowing the term
| "diamond-frame". In addition to bicycle-repairers probably
| knowing the term, it's also used among people who like/know
| other frame styles--in my case recumbent bicycles.
| stevage wrote:
| Ha, I'm pretty into bikes, to the point where at least I
| understand the questions here but the most complex things I
| have ever done were changing a normal BB and set of cranks,
| and replacing some STI cables.
| vbarrielle wrote:
| Lots of pro riders do not take care of their bikes
| themselves. They're used to having bike mechanics adjust
| everything for every race, and a lot of them don't consider
| it part of the job to take care of their training bikes. Some
| of them don't even do the cleaning.
| eitally wrote:
| Those output examples are absolutely horrifically bad compared to
| what I get with a cursory request to Gemini 2.0 using Imagen3 via
| gemini.google.com.
| toxik wrote:
| This is generating SVG data, not using an image generator.
| alwa wrote:
| Was yours an SVG? I think that's what makes Simon feel that
| this test is useful: the LLM has to generate functioning SVG
| code describing these shapes.
| simonw wrote:
| Yeah, this test is to see if a pure LLM can output SVG that
| renders well. It's effectively a test of their "spatial
| reasoning" capabilities.
| oatsandsugar wrote:
| It looks like no LLM has seen a pelican before at all.
| pjs_ wrote:
| I wanna see this running in a feedback loop - show the model its
| output and get it to make corrections.
|
| Remember that these are basically one-shot. Very different to how
| you or I would solve the problem (get a circle up on the screen,
| have a look at it, make some changes, add some wings, tweak the
| dimensions, etc.). We would go through hundreds or thousands of
| feedback cycles before we got something half-decent -- in this
| situation the model only gets one attempt.
| cadamsau wrote:
| Careful not to shout too loud about it or they'll start training
| for it!
|
| Latest Claude does a suspiciously good job...
| marcodiego wrote:
| Most humans can't correctly draw a bicycle from memory:
| https://www.wired.com/2016/04/can-draw-bikes-memory-definite...
| fuegoio wrote:
| Mistral Large proposition:
| https://chat.mistral.ai/chat/4da427b1-e033-454d-b134-c5d1f6e...
| Mistletoe wrote:
| If only it didn't have to be svg, so glorious.
|
| https://g.co/gemini/share/56e5dfa1a598
| skissane wrote:
| Just thinking out loud: a lot of models generate pretty poor SVG
| output, but part of that is because they don't get any visual
| feedback.
|
| But, what about this workflow: given prompt, LLM generates two
| SVG outputs. Both are rendered by an SVG renderer, and then we
| combine the two into one image, one on the left and the other on
| the right. We then ask a visual LLM (could be the same LLM or
| could be a different one) to tell us whether the left half or
| right half of the image is a better response to the prompt. Now
| we've got preferences which can be used to fine-tune the LLM
| using DPO. And you could iteratively repeat the process - as the
| LLM is fine-tuned it may produce even better outputs which then
| produces new preferences for further fine-tuning.
|
| Would be interesting to see what kinds of results it might
| produce in practice.
| e3a8 wrote:
| "Generate room with no elephant" https://imgur.com/BI61S1T
| boredhedgehog wrote:
| Some of the drawings, like the one from Amazon Nova Pro, are
| quite fascinating as abstract artworks. It's like the idea of a
| bicycle without its physicality.
| notatoad wrote:
| assuming that LLMs are trained to generate human-like output, i
| think Claude and GPT4o both aced this.
|
| no, they didn't get it right, but the output approximates what
| most humans can do.
| tessellated wrote:
| > aren't any pelican on a bicycle SVG files floating around (yet)
|
| _maintains website collecting SVG files of pelicans on bicycles_
___________________________________________________________________
(page generated 2024-12-16 23:01 UTC)