[HN Gopher] AI World Clocks
___________________________________________________________________
AI World Clocks
"Every minute, a new clock is rendered by nine different AI
models."
Author : waxpancake
Score : 1283 points
Date : 2025-11-14 18:35 UTC (1 days ago)
(HTM) web link (clocks.brianmoore.com)
(TXT) w3m dump (clocks.brianmoore.com)
| kfarr wrote:
| Add some voting and you got yourself an AI World Clock arena!
| https://artificialanalysis.ai/image/arena
| BrandoElFollito wrote:
| Thank you very much.... It was a fun game until I got to the
| prompt
|
| Place a baby elephant in the green chair
|
| I cannot unsee what I saw and it is 21:30 here so I have an
| hour or so to eliminate the picture from my mind or I will have
| nightmares.
| syx wrote:
| I'm very curious about the monthly bill for such a creative
| project, surely some of these are pre rendered?
| coffeecoders wrote:
| Napkin math:
|
| 9 AIs x 43,200 minutes = 388,800 requests/month
|
| 388,800 requests x 200 tokens = 77,760,000 tokens/month [?] 78M
| tokens
|
| Cost varies from 10 cents to $1 per 1M tokens.
|
| Using the mid-price, the cost is around $50/month.
|
| ---
|
| Hopefully, the OP has this endpoint protected -
| https://clocks.brianmoore.com/api/clocks?time=11:19AM
| whimsicalism wrote:
| i think it is cached on the minute level, responses cannot be
| that fast
| ugh123 wrote:
| Cool, and marginally informative on the current state of things.
| but kind of a waste of energy given everything is re-done every
| minute to compare. We'd probably only need a handful of each to
| see the meaningful differences.
| whoisjuan wrote:
| It's actually quite fascinating if you watch it for 5 minutes.
| Some models are overall bad, but others nail it in one minute
| and butcher it in the next.
|
| It's perhaps the best example I have seen of model drift driven
| by just small, seemingly unimportant changes to the prompt.
| alister wrote:
| > _model drift driven by just small, seemingly unimportant
| changes to the prompt_
|
| What changes to the prompt are you referring to?
|
| According the comment on the site, the prompt is the
| following:
|
| _Create HTML /CSS of an analog clock showing ${time}.
| Include numbers (or numerals) if you wish, and have a CSS
| animated second hand. Make it responsive and use a white
| background. Return ONLY the HTML/CSS code with no markdown
| formatting._
|
| The prompt doesn't seem to change.
| sambaumann wrote:
| presumably the time is replaced with the actual current
| time at each generation. I wonder if they are actually
| generated every minute or if all 6480 permutations (720
| minutes in a day * 9 llms) were generated and just show on
| a schedule
| whoisjuan wrote:
| The time given to the model. So the difference between two
| generations is just somethng trivially different like:
| "12:35" vs 12:36"
| moffkalast wrote:
| Kimi seems the only reliable one which is a bit surprising,
| and GPT 4o is consistently better than GPT 5 which on the
| other hand is unfortunately not surprising at all.
| nbaugh1 wrote:
| It is really interesting to watch them for a while. QWEN
| keeps outputting some really abstract interpretations of a
| clock, KIMI is consistently very good, GPT5's results line up
| exactly with my experience with its code output (overly
| complex and never working correctly)
| bglusman wrote:
| We can't know how much is about the prompt though and how
| much is just stochastic randomness in the behavior of that
| model on that prompt, right? I mean, even given identical
| prompts, even at temp 0, models don't always behave
| identically.... at least, as far as I know? Some of the
| reasons why are I think still a research question, but I
| think its a fact nonetheless.
| ascorbic wrote:
| The energy usage is minuscule.
| jdiff wrote:
| It's wasteful. If someone built a clock out of 47
| microservices that called out to 193 APIs to check the
| current time, location, time zone, and preferred display
| format we'd rightfully criticize it for similar reasons.
|
| In a world where Javascript and Electron are still getting
| (again, rightfully) skewered for inefficiency despite often
| exceeding the performance of many compiled languages, we
| should not dismiss the discussion around efficiency so
| easily.
| Arisaka1 wrote:
| What I find amusing with this argument is that, no one ever
| brought power savings when e.g. used "let me google that
| for you" instead of giving someone the answer to their
| question, because we saw the utility of teaching others how
| to Google. But apparently we can't see the utility of
| measuring the oversold competence of current AI models,
| given sufficiently large sampling size.
| saulpw wrote:
| Let's do some math.
|
| 60x24x30 = 40k AI calls per month per model. Let's suppose
| there are 1000 output tokens (might it be 10k tokens? Seems
| like a lot for this task). So 40m tokens per model.
|
| The price for 1m output tokens[0] ranges from $.10
| (qwen-2.5) to $60 (GPT-4). So $4/mo for the cheapest, and
| $2.5k/mo for the most expensive.
|
| So this might cost several thousand dollars a month?
| Something smells funny. But you're right, throttling it to
| once an hour would achieve a similar goal and likely cost
| less than $100/mo (which is still more than I would spend
| on a project like this).
|
| [0] https://pricepertoken.com/
| qwe----3 wrote:
| They use 4o (maybe a mini version?)(
| berkes wrote:
| Yes it is wasteful.
|
| But I presume you light up Christmas lights in December,
| drive to the theater to watch a movie or fire up a campfire
| on holiday. That too is "wasteful". It's not needed, other,
| or far more efficient ways exist to achieve the same. And
| in absolute numbers, far more energy intensive than running
| an LLM to create 9 clocks every minute. We do things to
| learn, have fun, be weird, make art, or just spend time.
|
| Now, if Rolex starts building watches by running an LLM to
| drive its production machines or if we replace millions of
| wall clocks with ones that "Run an LLM every second", then
| sure, the waste is an actual problem.
|
| Point I'm trying to make is that it's OK to consider or
| debate the energy use of LLMs compared to alternatives. But
| that bringing up that debate in a context where someone is
| creative, or having a fun time, its not, IMO. Because a lot
| of "fun" activities use a lot of energy, and that too isn't
| automatically "wasteful".
| ugh123 wrote:
| Hmm, curious. How did you come up with that?
| energy123 wrote:
| I sort of assumed they cached like 30 inferences and just
| repeat them, but maybe I'm being too cynical.
| PeterStuer wrote:
| Why? This is diagonal to how LLM's work, and trivially solved by
| a minimal hybrid front/sub system.
| em3rgent0rdr wrote:
| To gauge.
| bayindirh wrote:
| Because, LLMs are touted to be the silver bullet of silver
| bullets. Built upon world's knowledge, and with the capacity to
| call upon updated information with agents, they are ought to
| rival the top programmers 3 days ago.
| awkwam wrote:
| They might be touted like that but it seems like you don't
| understand how they work. The example in the article shows
| that the prompt is limiting the LLM by giving it access to
| only 2000 tokens and also saying "ONLY OUTPUT ...". This is
| like me asking you to solve the same problem but forcing you
| do de-activate half of your brain + forget any programming
| experience you have. It's just stupid.
| bayindirh wrote:
| > like you don't understand how they work.
|
| I would not make such assumptions.
|
| > The example in the article shows that the prompt is
| limiting the LLM by giving it access to only 2000 tokens
| and also saying "ONLY OUTPUT ..."
|
| The site is pretty simple, method is pretty
| straightforward. If you believe this is unfair, you can
| always build one yourself.
|
| > It's just stupid.
|
| No, it's a great way of testing things within constraints.
| em3rgent0rdr wrote:
| Most look like they were done by a beginner programmer on crack,
| but every once in a while a correct one appears.
| morkalork wrote:
| I'd say more like a blind programmer in the early stages of
| dementia. Able to write code, unable to form a mental image of
| what it would render as and can't see the final result.
| pixl97 wrote:
| DeepSeek and Kimi seem to have correct ones most of the time
| I've looked.
| em3rgent0rdr wrote:
| yes, and sometimes Grok.
| pixl97 wrote:
| The hour hand commonly seems off on Grok.
| BrandoElFollito wrote:
| DeepSeek told me that it cannot generate pictures and
| suggested code (which is very different)
| shafoshaf wrote:
| It's interesting how drawing a clock is one of the primary
| signals for dementia. https://www.verywellhealth.com/the-clock-
| drawing-test-98619
| BrandoElFollito wrote:
| This is very interesting, thank you.
|
| I could not get to the store because of the cookie banner
| that does not work (at left on mobile chrome and ff). The
| Internet Archive page: https://archive.ph/qz4ep
|
| I wonder how this test could be modified for people that have
| neurological problems - my father's hands shake a lot but I
| would like to try the test on him (I do not have suspicions,
| just curious).
|
| I passed it :)
| technothrasher wrote:
| "One variation of the test is to provide the person with a
| blank piece of paper and ask them to draw a clock showing 10
| minutes after 11. The word "hands" is not used to avoid
| giving clues."
|
| Hmm, ambiguity. I would be the smart ass that drew a digital
| clock for them, or a shaku-dokei.
| energy123 wrote:
| If they can identify which one is correct, then it's the same
| as always being correct, just with an expensive compute budget.
| larodi wrote:
| would be gr8t to also see the prompt this was done with
| creade wrote:
| The ? has "Create HTML/CSS of an analog clock showing ${time}.
| Include numbers (or numerals) if you wish, and have a CSS
| animated second hand. Make it responsive and use a white
| background. Return ONLY the HTML/CSS code with no markdown
| formatting."
| bananatron wrote:
| grok's looks like one of those clocks you'd find at a novelty
| shop
| AlfredBarnes wrote:
| Its cool to see them get it right .....sometimes
| zkmon wrote:
| Why are Deepseek and Kimi are beating other models by so much
| margin? Is this to do with their specialization for this task?
| baltimore wrote:
| Since the first (good) image generation models became available,
| I've been trying to get them to generate an image of a clock with
| 13 instead of the usual 12 hour divisions. I have not been
| successful. Usually they will just replace the "12" with a "13"
| and/or mess up the clock face in some other way.
|
| I'd be interested if anyone else is successful. Share how you did
| it!
| snek_case wrote:
| From my experience they quickly fail to understand anything
| beyond a superficial description of the image you want.
| atorodius wrote:
| That's less and less true
|
| https://minimaxir.com/2025/11/nano-banana-prompts/
| dang wrote:
| Related ongoing thread:
|
| _Nano Banana can be prompt engineered for nuanced AI image
| generation_ - https://news.ycombinator.com/item?id=45917875
| - Nov 2025 (214 comments)
| Scene_Cast2 wrote:
| I've noticed that image models are particularly bad at
| modifying popular concepts in novel ways (way worse
| "generalization" than what I observe in language models).
| emp17344 wrote:
| Maybe LLMs always fail to generalize outside their data set,
| and it's just less noticeable with written language.
| cluckindan wrote:
| This is it. They're language models which predict next
| tokens probabilistically and a sampler picks one according
| to the desired "temperature". Any generalization outside
| their data set is an artifact of random sampling:
| happenstance and circumstance, not genuine substance.
| cluckindan wrote:
| However: do humans have that genuine substance? Is human
| invention and ingenuity more than trial and error, more
| than adaptation and application of existing knowledge?
| Can humans generalize outside their data set?
|
| A yes-answer here implies belief in some sort of gnostic
| method of knowledge acquisition. Certainly that comes
| with a high burden of proof!
| dawidloubser wrote:
| Yes
| cluckindan wrote:
| Can you elaborate on what you mean by that, and prove it?
|
| https://journals.sagepub.com/doi/10.1177/0963721425133621
| 2
| sophrosyne42 wrote:
| Yes. Humans can perform abduction, extrapolating given
| information to new information. LLMs cannot, they can
| only interpolate new data based on existing data.
| IshKebab wrote:
| They definitely don't _completely fail_ to generalise. You
| can easily prove that by asking them something completely
| novel.
|
| Do you mean that LLMs might display a similar tendency to
| modify popular concepts? If so that definitely might be the
| case and would be fairly easy to test.
|
| Something like "tell me the lord's prayer but it's our
| mother instead of our father", or maybe "write a haiku but
| with 5 syllables on every line"?
|
| Let me try those ... nah ChatGPT nailed them both. Feels
| like it's particular to image generation.
| immibis wrote:
| They used to do poorly with modified riddles, but I
| assume those have been added to their training data now
| (https://huggingface.co/datasets/marcodsn/altered-riddles
| ?)
|
| Like, the response to "... The surgeon (who is male and
| is the boy's father) says: I can't operate on this boy!
| He's my son! How is this possible?" used to be "The
| surgeon is the boy's mother"
|
| The response to "... At each door is a guard, each of
| which always lies. What question should I ask to decide
| which door to choose?" would be an explanation of how
| asking the guard what the other guard would say would
| tell you the opposite of which door you should go
| through.
| phire wrote:
| Most image models are diffusion models, not LLMs, and have
| a bunch of other idiosyncrasies.
|
| So I suspect it's more that lessons from diffusion image
| models don't carry over to text LLMs.
|
| And the Image models which are based on multi-mode LLMs
| (like Nano Banana) seem to do a lot better at novel
| concepts.
| CobrastanJorji wrote:
| Also, they're fundamentally bad at math. They can draw a
| clock because they've seen clocks, but going further requires
| some calculations they can't do.
|
| For example, try asking Nano Banana to do something simpler,
| like "draw a picture of 13 circles." It likely will not work.
| IAmGraydon wrote:
| That's because they literally cannot do that. Doing what you're
| asking requires an understanding of why the numbers on the
| clock face are where they are and what it would mean if there
| was an extra hour on the clock (ie that you would have to
| divide 360 by 13 to begin to understand where the numbers would
| go). AI models have no concept of anything that's not included
| in their training data. Yet people continue to anthropomorphize
| this technology and are surprised when it becomes obvious that
| it's not actually thinking.
| bobbylarrybobby wrote:
| It's interesting because if you asked them to write code to
| generate an SVG of a clock, they'd probably use a loop from 1
| to 12, using sin and cos of the angle (given by the loop
| index over 12 times 2pi) to place the numerals. They know how
| to do this, and so they basically understand the process that
| generates a clock face. And extrapolating from that to 13
| hours is trivial (for a human). So the fact that they can't
| do this extrapolation on their own is very odd.
| echelon wrote:
| gpt-image-1 and Google Imagen understand prompts, they just
| don't have training data to cover these use cases.
|
| gpt-image-1 and Imagen are wickedly smart.
|
| The new Nano Banana 2 that has been briefly teased around the
| internet can solve incredibly complicated differential
| equations on chalk boards with full proof of work.
| phkahler wrote:
| >> The new Nano Banana 2 that has been briefly teased
| around the internet can solve incredibly complicated
| differential equations on chalk boards with full proof of
| work.
|
| That's great, but I bet it can't tie it's own shoes.
| esafak wrote:
| And a submarine can't swim. Big deal.
| echelon wrote:
| No, but I can get it to do a lot of work.
|
| It's a part of my daily tool box.
| energy123 wrote:
| The hope was for this understanding to emerge as the most
| efficient solution to the next-token prediction problem.
|
| Put another way, it was hoped that once the dataset got rich
| enough, developing this understanding is actually more
| efficient for the neural network than memorizing the training
| data.
|
| The useful question to ask, if you believe the hope is not
| bearing fruit, is _why_. Point specifically to the absent
| data or the flawed assumption being made.
|
| Or more realistically, put in the creative and difficult
| research work required to discover the answer to that
| question.
| ryandrake wrote:
| I wonder if you would have more success if you painstakingly
| described the shape and features of a clock in great detail
| but never used the words clock or time or anything that might
| give the AI the hint that they were supposed to output
| something like a clock.
| BrandoElFollito wrote:
| And this is a problem for me. I guess that it would work,
| but as soon as the word "clock" appears, gone is the
| request because a clock HAS.12.HOURS.
|
| I use this a lot in cybersecurity when I need to do
| something "illegal". I am refused help, until I say that I
| am doing research on cybersecurity. In that case no
| problem.
| Workaccount2 wrote:
| The problem is more likely the tokenization of images than
| anything. These models do their absolute worst when pictures
| are involved, but are seemingly miraculous at generalizing
| with just text.
| chemotaxis wrote:
| I wonder if it's because we mean different things by
| generalization.
|
| For text, "generalization" is still "generate text that
| conforms to all the usual rules of the language". For
| images of 13-hour clock faces, we're explicitly asking the
| LLM to violate the inferred rules of the universe.
|
| I think a good analogy would be asking an LLM to write in
| English, except the word "the" now means "purple". They
| will struggle to adhere to this prompt in a conversation.
| Workaccount2 wrote:
| That's true, but I think humans would stumble a lot too
| (try reading old printed text from the 18fh cenfury where
| fhey used "f" insfead of t in prinf, if's a real frick fo
| gef frough).
|
| However humans are pretty adept at discerning images,
| even ones outside the norm. I really think there is some
| kind of architectural block hampering transformers
| ability to really "see" images. For instance if you show
| any model a picture of a dog with 5 legs (a fifth leg
| photoshopped to it's belly) they all say there are only 4
| legs. And will argue with you about it. Hell GPT-5 even
| wrote a leg detection script in python (impressive) which
| detected the 5 legs, and then it said the script was
| bugged, and modified the parameters until one of the legs
| wasn't detected, lol.
| onraglanroad wrote:
| An "f" never replaced a "t".
|
| You probably mean the "long s" that looks like an "f".
| godelski wrote:
| Yes, the problem is that these so called "world models" do
| not actually contain a model of the world, or any world
| echelon wrote:
| That's just a patch to the training data.
|
| Once companies see this starting to show up in the evals and
| criticisms, they'll go out of their way to fix it.
| rideontime wrote:
| What would the "patch" be? Manually create some images of
| 13-hour clocks and add them to the training data? How does
| that solution scale?
| godelski wrote:
| s/13/17/g ;)
| coffeecoders wrote:
| LLMs are terrible for out-of-distribution (OOD) tasks. You
| should use chain of thought suppression and give constaints
| explictly.
|
| My prompt to Grok:
|
| ---
|
| Follow these rules exactly:
|
| - There are 13 hours, labeled 1-13.
|
| - There are 13 ticks.
|
| - The center of each number is at angle: index * (360/13)
|
| - Do not infer anything else.
|
| - Do not apply knowledge of normal clocks.
|
| Use the following variables:
|
| HOUR_COUNT = 13
|
| ANGLE_PER_HOUR = 360 / 13 // 27.692307deg
|
| Use index i [?] [0..12] for hour marks:
|
| angle_i = i * ANGLE_PER_HOUR
|
| I want html/css (single file) of a 13-hour analog clock.
|
| ---
|
| Output from grok.
|
| https://jsfiddle.net/y9zukcnx/1/
| BrandoElFollito wrote:
| Well, that's cheating :) You asked it to generate code, which
| is ok because it does not represent a direct generated image
| of a clock.
|
| Can grok generate images? What would the result be?
|
| I will try your prompt on chatgpt and gemini
| BrandoElFollito wrote:
| Gemini failed miserably - a standard 12 hours clock
|
| Same for chatgpt
|
| And perplexity replaced 12 with 13
| dwringer wrote:
| > Please create a highly unusual 13-hour analog clock
| widget, synchronized to system time, with fully animated
| hands that move in real time, and not 12 but 13 hour
| markings - each will be spaced at not 5-minute intervals,
| but at 4-minute-37-second intervals. This makes room for
| all 13 hour markings. Please pay attention to the correct
| alignment of the 13 numbers and the 13 hour marks, as
| well as the alignment of the hands on the face.
|
| This gave me a correct clock face on Gemini- after the
| model spent _a lot_ of time thinking (and kind of
| thrashing in a loop for a while). The functionality isn
| 't quite right, not that it entirely makes sense in the
| first place, but the face - at least in terms of the hour
| marks - looks OK to me.[0]
|
| [0] https://aistudio.google.com/app/prompts?state=%7B%22i
| ds%22:%...
| chemotaxis wrote:
| > Follow these rules exactly:
|
| "Here's the line-by-line specification of the program I need
| you to write. Write that program."
| signatoremo wrote:
| Can you write this program in any language?
| chemotaxis wrote:
| No, do I need to?
| bigfishrunning wrote:
| Yes.
| serf wrote:
| it's lazy to dust off the major advantages of a pseudocode-
| to-anylanguage transpiler as if it's somehow easy or
| commonplace.
| chiwilliams wrote:
| I'll also note that the output isn't quite right --- the top
| number should be 13 rather than 1!
| layer8 wrote:
| I mean, the specification for the hour marks (angle_i)
| starts with a mark at angle 0. It just followed that spec.
| ;)
| NooneAtAll3 wrote:
| close enough, but digit at the top should be the highest, not
| 1 :/
| BrandoElFollito wrote:
| This is really cool. I tried to prompt gemini but every time I
| got _the same picture_. I do not know how to share a session
| (like it is possible with Chatgpt) but the prompts were
|
| If a clock had 13 hours, what would be the angle between two of
| these 13 hours?
|
| Generate an image of such a clock
|
| No, I want the clock to have 13 distinct hours, with the angle
| between them as you calculated above
|
| This is the same image. There need to be 13 hour marks around
| the dial, evenly spaced
|
| ... And its last answer was
|
| You are absolutely right, my apologies. It seems I made an
| error and generated the same image again. I will correct that
| immediately.
|
| Here is an image of a clock face with 13 distinct hour marks,
| evenly spaced around the dial, reflecting the angle we
| calculated.
|
| And the very same clock, with 12 hours, and a 13th above the
| 12...
| ryandrake wrote:
| This is probably my biggest problem with AI tools, having
| played around with them more lately.
|
| "You're absolutely right! I made a mistake. I have now
| comprehensively solved this problem. Here is the corrected
| output: [totally incorrect output]."
|
| None of them ever seem to have the ability to say "I cannot
| seem to do this" or "I am uncertain if this is correct,
| confidence level 25%" The only time they will give up or
| refuse to do something is when they are deliberately
| programmed to censor for often dubious "AI safety" reasons.
| All other times, they come back again and again with extreme
| confidence as they totally produce garbage output.
| BrandoElFollito wrote:
| I agree, I see the same even in simple code where they will
| bend backwards apologizing and generate very similar crap.
|
| It is like they are sometimes stuck in a local energetic
| minimum and will just wobble around various similar (and
| incorrect) answers.
|
| What was annoying in my attempt above is that the picture
| was _identical_ for every attempt
| ryandrake wrote:
| These tools 'attitude' reminds me of an eager, but
| incompetent intern or a poorly trained administrative
| assistant, who works for a powerful CEO. All sycophancy,
| confidence and positive energy, but not really getting
| much done.
| SamBam wrote:
| The issue is the they always say "Here's the final,
| correct answer" before they've written the answer, so of
| course the LLM has no idea if it's going to be right
| before it starts, because it has no clue what it's going
| to say.
|
| I wonder how it would do if instead it were told "Do not
| tell me at the start that the solution is going to be
| correct. Instead, tell me the solution, and at the end
| tell me if you think it's correct or not."
|
| I have found that on certain logic puzzles that it simply
| cannot get right, it always tells me that it's _going_ to
| get it quite "this last time," but if asked later it
| always recognizes its errors.
| int_19h wrote:
| Gemini specifically is actually kinda notorious for giving
| up.
|
| https://www.reddit.com/r/artificial/comments/1mp5mks/this_i
| s...
| notatoad wrote:
| you can click the share icon (the two-way branch icon, it
| doesn't look like apple's share icon) under the image it
| generates to share the conversation.
|
| i'm curious if the clock image it was giving you was the same
| one it was giving me
|
| https://gemini.google.com/share/780db71cfb73
| BrandoElFollito wrote:
| Thanks for the tip about sharing!
|
| No, my clock was an old style one, to be put on a shelf.
| But at least it had a "13" proudly right above the "12" :)
|
| This reminds me my kids when they were in kindergarden and
| were bringing home their art that needed extra explanation
| to realize what it was. But they were very proud!
| deathanatos wrote:
| Generate an image of a clock face, but instead of the usual 12
| hour numbering, number it with 13 hours.
|
| Gemini, 2.5 Flash or "Nano Banana" or whatever we're calling it
| these days. https://imgur.com/a/1sSeFX7
|
| A normal (ish) 12h clock. It numbered it twice, in two
| concentric rings. The outer ring is normal, but the inner ring
| numbers the 4th hour as "IIII" (fine, and a thing that clocks
| do) and the 8th hour as "VIIII" (wtf).
| bar000n wrote:
| It should be pretty clear already that anything which is
| based (limited?) to communicating words/text can never grasp
| conceptual thinking.
|
| We have yet to design a language to cover that, and it might
| be just a donquijotism we're all diving into.
| rideontime wrote:
| Really? I can grasp the concept behind that command just
| fine.
| bayindirh wrote:
| > We have yet to design a language to cover that, and it
| might be just a donquijotism we're all diving into.
|
| We have a very comprehensive and precise spec for that [0].
|
| If you don't want to hop through the certificate warning,
| here's the transcript:
|
| - Some day, we won't even need coders any more. We'll be
| able to just write the specification and the program will
| write itself.
|
| - Oh wow, you're right! We'll be able to write a
| comprehensive and precise spec and bam, we won't need
| programmers any more.
|
| - Exactly
|
| - And do you know the industry term for a project
| specification that is comprehensive and precise enough to
| generate a program?
|
| - Uh... no...
|
| - Code, it's called code.
|
| [0]: https://www.commitstrip.com/en/2016/08/25/a-very-
| comprehensi...
| snickerbockers wrote:
| Ive been thinking about that a lot too. Fundamentally
| it's just a different way of telling the computer what to
| do and if it seems like telling an llm to make a program
| is less work than writing it yourself then either your
| program is extremely trivial or there are dozens of
| redundant programs in the training set that are nearly
| identical.
|
| If you're actualy doing real work you have nothing to
| fear from LLMs because any prompt which is specific
| enough to create a given computer program is going to be
| comparable in terms of complexity and effort to having
| done it yourself.
| Uehreka wrote:
| I don't think that's clear at all. In fact the proficiency
| of LLMs at a wide variety of tasks would seem to indicate
| that language is a highly efficient encoding of human
| thought, much moreso than people used to think.
| tsunamifury wrote:
| Yea it's amazing that the parent post literally
| misunderstands the fundamental realities of LLMs and the
| compression they reveal in linguistics even if blurry is
| incredible.
| XenophileJKO wrote:
| I mean, that's not really "true".
|
| https://claude.ai/public/artifacts/0f1b67b7-020c-46e9-9536-
| c...
| giancarlostoro wrote:
| Weird, I never tried that, I tried all the usual tricks that
| usually work including swearing at the model (this scarily
| works surprisingly well with LLMs) and nothing. I even tried to
| go the opposite direction, I want a 6 hour clock.
| usui wrote:
| I've been trying for the longest time and across models to
| generate pictures or cartoons of people with six fingers and
| now they won't do it. They always say they accomplished it, but
| the result always has 5 fingers. I hate being gaslit.
| andix wrote:
| I gave this "riddle" to various models:
|
| > The farmer and the goat are going to the river. They look
| into the sky and see three clouds shaped like: a wolf, a
| cabbage and a boat that can carry the farmer and one item. How
| can they safely cross the river?
|
| Most of them are just giving the result to the well known river
| crossing riddle. Some "feel" that something is off, but still
| have a hard time to figure out that wolf, boat and cabbage are
| just clouds.
| userbinator wrote:
| Basically a variation of
| https://en.wikipedia.org/wiki/Age_of_the_captain
| jampa wrote:
| There are few examples of this as well:
|
| https://www.reddit.com/r/singularity/comments/1fqjaxy/contex.
| ..
| andix wrote:
| It really shows how LLMs work. It's all about
| probabilities, and not about understanding. If something
| looks very similar to a well known problem, the llm is
| having a hard time to "see" contradictions. Even if it's
| really easy to notice for humans.
| Recursing wrote:
| Claude has no problem with this: https://imgur.com/a/ifSNOVU
|
| Maybe older models?
| andix wrote:
| Try to twist around words and phrases, at some point it
| might start to fail.
|
| I tried it again yesterday with GPT. GPT-5 manages quite
| well too in thinking mode, but starts crackling in instant
| mode. 4o completely failed.
|
| It's not that LLMs are unable to solve things like that at
| all, but it's really easy to find some variations that make
| them struggle really hard.
| chanux wrote:
| Ah! This is so sad. The manager types won't be able to add an
| hour (actually, two) to the day even with AI.
| edub wrote:
| I was able to have AI generate an image that made this, but not
| by diffusion/autoregressive but by having it write Python code
| to create the image.
|
| ChatGPT made a nice looking clock with matplotlib that had some
| bugs that it had to fix (hours were counter-clockwise). Gemini
| made correct code one-shot, it used Pillow instead of
| matplotlib, but it didn't look as nice.
| nl wrote:
| I do playing card generation and almost all struggle beyond the
| "6 of X"
|
| My working theory is that they were trained really hard to
| generate 5 fingers on hands but their counting drops off
| quickly.
| abathologist wrote:
| This is great. If you think that the phenomena of human-like text
| generation evinces human-like intelligence, then this should be
| taken to evince that the systems likely have dementia.
| https://en.wikipedia.org/wiki/Montreal_Cognitive_Assessment
| AIorNot wrote:
| Imagine if I asked you to draw as pixels and operate a clock
| via html or create a jpeg with a pencil and paper and have it
| be accurate.. I suspect your handcoded work to be off by an
| order of magnitutde compared
| jonplackett wrote:
| kimi is kicking ass
| busymom0 wrote:
| Because a new clock is generated every minute, looks like simply
| changing the time by a digit causes the result to be
| significantly different from the previous iteration.
| shevy-java wrote:
| Now that is actually creative.
|
| Granted, it is not a clock - but it could be art. It looks like a
| Picasso. When he was drunk. And took some LSD.
| kburman wrote:
| These types of tests are fundamentally flawed. I was able to
| create perfect clock using gemini 2.5 pro -
| https://gemini.google.com/share/136f07a0fa78
| sinak wrote:
| How are they flawed?
| earthnail wrote:
| The results are not reproducable, as evidenced by parent
| poster.
| micromacrofoot wrote:
| isn't that kind of the point of non-determinism?
| earthnail wrote:
| No. Good nondeterministic models reproducibly generate
| equally desirable output - not identical output, but
| interchangeable.
| micromacrofoot wrote:
| oh I see, thank you for clarifying
| jmdeon wrote:
| Aren't they attempting to also display current time though?
| Your share is a clock starting at midnight/noon. Kimi K2 seems
| to be the best on each refresh.
| Drew_ wrote:
| The website is regenerating the clocks every minute. When I
| opened it, Gemini 2.5 was the only working one. Now, they are
| all broken.
|
| Also, your example is not showing the current time.
| system2 wrote:
| It wouldn't be hard to tell to pick up browser time as the
| default start point. Just a piece of prompt.
| allenu wrote:
| I don't think this is a serious test. It's just an art piece to
| contrast different LLMs taking on the same task, and against
| themselves since it updates every minute. One minute one of the
| results was really good for me and the next minute it was very,
| very bad.
| dwringer wrote:
| Even Gemini Flash did really well for me[0] using two prompts -
| the initial query and one to fix the only error I could
| identify.
|
| > Please generate an analog clock widget, synchronized to
| actual system time, with hands that update in real time and a
| second hand that ticks at least once per second. Make sure all
| the hour markings are visible and put some effort into making a
| modern, stylish clock face.
|
| Followed by:
|
| > Currently the hands are working perfectly but they're
| translated incorrectly making then uncentered. Can you ensure
| that each one is translated to the correct position on the
| clock face?
|
| [0]
| https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
| lxe wrote:
| Honestly, I think if you track the performance of each over time,
| since these get regenerated once in a while, you can then have a
| very, very useful and cohesive benchmark.
| 1yvino wrote:
| i wonder kwen prompt woud look like hallucination?
| fschuett wrote:
| Reminds me of this: https://www.youtube.com/watch?v=OGbhJjXl9Rk
| S0y wrote:
| To be fair, This is a deceptively hard task.
| bobbylarrybobby wrote:
| Without AI assistance, this should take ~10-15 minutes for a
| human. Maybe add 5 minutes if you're not allowed to use d3.
| alexmorley wrote:
| It's just html/css so no js at all let alone d3.
| postalrat wrote:
| Whats your hourly rate? I'll pay you to make as many as you
| can in a few hours if you share the video.
| Mashimo wrote:
| I would not even know how to draw a circle with CSS to be
| honest.
| Bolwin wrote:
| Pretty sure css has a sin() fn, that's half your work
| zkmon wrote:
| Was Claude banned from this Olympics?
| giancarlostoro wrote:
| Haiku is the lightweight Claude model, I'm not sure why they
| picked the weaker model.
| collimarco wrote:
| In any case those clocks are all extremely inaccurate, even if AI
| could build a decent UI (which is not the case).
|
| Some months ago I published this site for fun:
| https://timeutc.com There's a lot of code involved to make it
| precise to the ms, including adjusting based on network delay,
| frame refresh rate instead of using setTimeout and much more. If
| you are curious take a look at the source code.
| mstipetic wrote:
| GPT-5 is embarrassing itself. Kimi and DeepSeek are very
| consistently good. Wild that you can just download these models.
| shubham_zingle wrote:
| not sure about the accuracy though, although shooting in the dark
| awkwam wrote:
| Limiting the model to only use 2000 tokens while also asking it
| to output ONLY HTML/CSS is just stupid. It's like asking a
| programmer to perform the same task while removing half their
| brain and also forget about their programming experience. This is
| a stupid and meaningless benchmark.
| system2 wrote:
| Ask Claude or ChatGPT to write it in Python, and you will see
| what they are capable of. HTML + CSS has never been the strong
| suit of any of these models.
| camalouu wrote:
| Claude generates some js/css stuff even when i don't ask for
| it. I think Claude itself at least believes he is good at this.
| munro wrote:
| Amazing, some people are so enamored with LLMs who use them for
| soft outcomes, and disagree with me when I say be careful they're
| not perfect -- this is such a great non technical way to explain
| the reality I'm seeing when using on hard outcome coding/logic
| tasks. "Hey this test is failing", _LLM deletes test_ , "FIXED!"
| worldsayshi wrote:
| Yeah it seems crazy to use LLM on any task where the output
| can't be easily verified.
| palmotea wrote:
| > Yeah it seems crazy to use LLM on any task where the output
| can't be easily verified.
|
| I disagree, those tasks are _perfect_ for LLMs, since a bug
| you can 't verify isn't a problem when vibecoding.
| mopsi wrote:
| > "Hey this test is failing", LLM deletes test, "FIXED!"
|
| A nice continuation of the tradition of folk stories about
| supernatural entities like teapots or lamps that grant wishes
| and take them literally. "And that's why, kids, you should
| always review your AI-assisted commits."
| derbOac wrote:
| Something that struck me when I was looking at the clocks is
| that we _know_ what a clock is supposed to look and act like.
|
| What about when we don't know what it's supposed to look like?
|
| Lately I've been wrestling with the fact that unlike, say, a
| generalized linear model fit to data with some inferential
| theory, we don't have a theory or model for the uncertainty
| about LLM products. We recognize when it's off about things we
| know are off, but don't have a way to estimate when it's off
| other than to check it against reality, which is probably the
| exception to how it's used rather than the rule.
| ehnto wrote:
| I need to be delicate with wording here, but this is why it's
| a worry that all the least intelligent people you know could
| be using AI.
|
| It's why non-coders think it's doing an amazing job at
| software.
|
| But it's worryingly why using it for research, where you
| necessarily don't know what you don't know, is going to trip
| up even smarter people.
| markatkinson wrote:
| To be fair I'd probably also delete the test.
| novemp wrote:
| Oh cool, it's the schizophrenia clock-drawing test but for AI.
| otterley wrote:
| Watching this over the past few minutes, it looks like Kimi K2
| generates the best clock face most consistently. I'd never heard
| of that model before today!
|
| Qwen 2.5's clocks, on the other hand, look like they never make
| it out of the womb.
| bArray wrote:
| It could be that the prompt is accidentally (or purposefully)
| more optimised for Kimi K2, or that Kimi K2 is better trained
| on this particular data. LLM's need "prompt engineers" for a
| reason to get the most out of a particular model.
| energy123 wrote:
| Goes to show the "frontier" is not really one frontier. It's
| a social/mathematical construct that's useful for a broad
| comparison, but if you have a niche task, there's no
| substitute for trying the different models.
| observationist wrote:
| It's not fair to use prompts tailored to a particular model
| when doing comparisons like this - one shot results that
| generalize across a domain demonstrate solid knowledge of the
| domain. You can use prompting and context hacking to get any
| particular model to behave pseudo-competently in almost any
| domain, even the tiny <1B models, for some set of questions.
| You could include an entire framework and model for rendering
| clocks and times that allowed all 9 models to perform fairly
| well.
|
| This experiment, however, clearly states the goal with this
| prompt: `Create HTML/CSS of an analog clock showing ${time}.
| Include numbers (or numerals) if you wish, and have a CSS
| animated second hand. Make it responsive and use a white
| background. Return ONLY the HTML/CSS code with no markdown
| formatting.`
|
| An LLM should be able to interpret that, and should be able
| to perform a wide range of tasks in that same style -
| countdown timers, clocks, calendars, floating quote bubble
| cycling through list of 100 pithy quotations, etc.
| Individual, clearly defined elements should have complex
| representations in latent space that correspond to the human
| understanding of those elements. Tasks and operations and
| goals should likewise align with our understanding. Qwen 2.5
| and some others clearly aren't modeling clocks very well, or
| maybe the html/css rendering latents are broken. If you pick
| a semantic axis(like analog clocks), you can run a suite of
| tests to demonstrate their understanding by using limited
| one-shot interactions.
|
| Reasoning models can adapt on the fly, and are capable of
| cheating - one shots might have crappy representations for
| some contexts, but after a lot of repetition and refinement,
| as long as there's a stable, well represented proxy for
| quality somewhere in the semantics it understands, it can
| deconstruct a task to fundamentals and eventually reach high
| quality output.
|
| These type of tests also allow us to identify mode collapses
| - you can use complex sophisticated prompting to get most
| image models to produce accurate analog clocks displaying any
| time, but in the simple one shot tests, the models tend to
| only be able to produce the time 10:10, and you'll get wild
| artifacts and distortions if you try to force any other
| configuration of hands.
|
| Image models are so bad at hands that they couldn't even get
| clock hands right, until recently anyway. Nano banana and
| some other models are much better at avoiding mode collapses,
| and can traverse complex and sophisticated compositions
| smoothly. You want that same sort of semantic generalization
| in text generating models, so hopefully some of the
| techniques cross over to other modalities.
|
| I keep hoping they'll be able to use SAE or some form of
| analysis on static weight distributions in order to uncover
| some sort of structural feature of mode collapse, with a
| taxonomy of different failure modes and causes, like limited
| data, or corrupt/poisoned data, and so on. Seems like if you
| had that, you could deliberately iterate on, correct issues,
| or generate supporting training material to offset big
| distortions in a model.
| jquery wrote:
| Qwen 2.5 is so bad it's good. Some really insane results if
| you watch it for a while. Almost like it's taking the piss.
| bigfishrunning wrote:
| How much engineering do prompt engineers do? Is it
| engineering when you add "photorealistic. correct number of
| fingers and teeth. High quality." to the end of a prompt?
|
| we should call them "prompt witch doctors" or maybe "prompt
| alchemists".
| Dilettante_ wrote:
| "How is engineering a real science? You just build the
| bridge so it doesn't fall down."
| vohk wrote:
| Nah.
|
| Actual engineers have professional standards bodies and
| legal liability when they shirk and the bridge falls down
| or the plane crashes or your wiring starts on fire.
|
| Software "engineers" are none of those things but can at
| least emulate the approaches and strive for
| reproducibility and testability. Skilled craftsman; not
| engineers.
|
| Prompt "engineers" is yet another few steps down the
| ladder, working out mostly by feel what magic words best
| tickle each model, and generally with no understanding of
| what's actually going on under the hood. Closer to a chef
| coming up with new meals for a restaurant than anything
| resembling engineering.
|
| The battle on the use of language around engineer has
| long been lost but applying it to the subjective creative
| exercise of writing prompts is just more job title
| inflation. Something doesn't need to be engineering to be
| a legitimate job.
| Dilettante_ wrote:
| The battle on the use of language around engineer has
| long been lost
|
| That's really the core of the issue: We're just having
| the age-old battle of prescriptivism vs descriptivism
| again. An "engineer", etymologically, is basically just
| "a person who comes up with stuff", one who is
| "ingenious". I'm tempted to say it's _you
| prescriptivists_ who are making a "battle" out of this.
| subjective creative exercise of writing prompts
|
| Implying that there are no testable results, no objective
| success or failure states? Come on man.
| jahewson wrote:
| Engineers use their ingenuity. That's it.
|
| If physical engineers understood everything then
| standards would not have changed in many decades. Safety
| factors would be mostly unnecessary. Clearly not the
| case.
| skeeter2020 wrote:
| >> Engineers use their ingenuity. That's it.
|
| If this was enough all novel creation would be
| engineering and that's clearly not true. Engineering
| attempts to discover & understand consistent outcomes
| when a myriad of variables are altered, and the
| boundaries where the variables exceed a model's
| predictive powers - then add buffer for the unknown.
| Manipulating prompts (and much of software development)
| attempts to control the model to limit the number of
| variables to obtain some form of useful abstraction.
| Physical engineering can't do this.
| BoorishBears wrote:
| I like that actually, I've spent the last year probably
| 60:40 between post-training and prompt engineering/witch
| doctoring (the two go together more than most people
| realize)
|
| Some of it is engineering-like, but I've also picked up a
| sixth sense when modifying prompts about what parts are
| affecting the behavior I want to modify for certain models,
| and that feels very witch doctory!
|
| The more engineering-like part is essentially trying to RE
| a black box model's post-training, but that goes over some
| people's heads so I'm happy to help keep the "it's just
| voodoo and guessing" narrative going instead :)
| lanstin wrote:
| I think the coherence behind prompt engineering is not in
| the literal meanings of the words but finding the
| vocabulary used by the sources that have your solution.
| Ask questions like a high school math student and you get
| elementary words back. Ask questions in the lingo of a
| Linux bigot and you will get good awk scripts back. Use
| academic maths language and arXiv answers will be
| produced.
| scrollop wrote:
| "...and do it really well or my grandmother will be killed
| by her kidnappers! And I'll give you a tip of 2 billion
| dollars!!! Hurry, they're coming!"
| carterschonwald wrote:
| Ive heard this actually works annoyingly well
| DrewADesign wrote:
| We've created technology so sophisticated it is
| vulnerable to social engineering attacks.
| skeeter2020 wrote:
| this has worked - and continues to do so - very well to
| escape guard rails. If a direct appeal doesn't work you
| can then talk them around with only a handful of prompts.
| carterschonwald wrote:
| Also the amount of adjacent remarks being always topical
| flsvor confusion is cartoonish. Im playing with ideas for
| making thst better
| DrewADesign wrote:
| You're absolutely right! People should pay attention to
| this broadly applicable and important consideration.
| manmal wrote:
| Adding this to my snippets.
| WJW wrote:
| Well if it works consistently, I don't see any problem with
| that. If they have a clear theory of when to add
| "photorealistic" and when to add "correct number of wheels
| on the bus" to get the output they want, it's engineering.
| If they don't have a (falsifiable) theory, it's probably
| not engineering.
|
| Of course, the service they really provide is for
| businesses to feel they "do AI", and whether or not they do
| real engineering is as relevant as if your favorite
| pornstars' boobs are real or not.
| jahewson wrote:
| Maybe we could keep the conversation out of the gutter.
| rrr_oh_man wrote:
| Porn is taxable income, not the gutter.
| jrflowers wrote:
| You don't really see much porn in the gutters these days
| with the decline in popularity of print publishing. It's
| almost all online now
| leptons wrote:
| >as relevant as if your favorite pornstars' boobs are
| real or not
|
| This matters more than you might think.
| tomrod wrote:
| It could be bioengineering if you add that to a clock
| prompt then connect it to CRISPR process for out putting
| DNA.
|
| Horrifying prospect, tbh
| int_19h wrote:
| I write quite a lot of prompts, and the closest analogy
| that I can think of is a shaman trying to appease the
| spirits.
| minikomi wrote:
| I find it a surprisingly similar mindset to songwriting,
| a lot of local maxima searching and spaghetti flinging.
| Sometime you hit a good groove and explore it.
| skeeter2020 wrote:
| It might be even more ridiculous to make this something
| akin to art over engineering.
| davidsainez wrote:
| Sure, we are still closer to alchemy than materials
| science, but its still early days. But consider this
| blogpost that was on the front page today:
| https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-
| prompt.... The table on the bottom shows a generally steady
| increase in performance just by iterating on prompts. It
| feels like we are on the path to true engineering.
| raddan wrote:
| Engineers usually have at least some sense as to why
| their efforts work though. Does anybody who iterates on
| prompts have even the fuzziest idea why they work? Or
| what the improvement might be? I do not.
| skeeter2020 wrote:
| If there is ANY relationship to engineering here maybe
| it's like reverse engineering a bios in a clean room,
| were you poke away and see what happens. The missing part
| is the use of anything resembling the scientific method
| in terms of hypothesis, experiment design, observation
| guiding actions, etc and the deep knowledge that will
| allow you to understand WHY something might be happening
| based on the inputs. "Prompt Engineering" seems about as
| close to this as probing for land mines in a battlefield,
| only with no experience and your eyes closed.
| tamimio wrote:
| > we should call them "prompt witch doctors" or maybe
| "prompt alchemists".
|
| Oh absolutely not! Only in engineering you are allowed to
| get called an engineer for no apparent reason, do that in
| other white collar and you are behind the bars because of
| fraudulent claims.
| skeeter2020 wrote:
| we used to just call them "good at googling". I've never
| met a self-described prompt engineer who had anything close
| to engineering education and experience. Seems like an
| extension on the 6-week boot camp == software engineer
| trend.
| woodson wrote:
| Just use something like DSPy/Ax and optimize your module for
| any given LLM (based on sample data and metrics) and you're
| mostly good. No need to manually wordsmith prompts.
| andix wrote:
| I think the selection of models is a bit off. Haiku instead
| of Sonnet for example. Kimi K2's capabilities are closer to
| Sonnet than to Haiku. GPT-5 might be in the non-reasoning
| mode, which routes to a smaller model.
| ceroxylon wrote:
| I had my suspicions about the GPT-5 routing as well. When I
| first looked at it, the clock was by far the best; after
| the minute went by and everything refreshed, the next three
| were some of the worst of the group. I was wondering if it
| just hit a lucky path in routing the first time.
| frizlab wrote:
| I knew of Kimi K2 because it's the model used by Kagi to
| generate the AI answers when query ends with an interrogation
| point.
| OJFord wrote:
| It's also one of the few 'recommended' models in Kagi
| Assistant (multi-model ChatGPT basically, available on paid
| plans).
| Bolwin wrote:
| Really? They must've switched recently cause that was around
| before kimi came out
| frizlab wrote:
| Yes, this is recent. Before it was other model(s), not sure
| which.
| abixb wrote:
| >Qwen 2.5's clocks, on the other hand, look like they never
| make it out of the womb.
|
| More like fell headfirst into the ground.
|
| I'm disappointed with Gemini 2.5 (not sure Pro or Flash) --
| I've personally had _fantastic_ results with Gemini 2.5 Pro
| building PWA, especially since the May 2025 "coding update."
| [0]
|
| [0] https://blog.google/products/gemini/gemini-2-5-pro-updates/
| jquery wrote:
| I've been using Kimi K2 a lot this month. Gives me
| Japanese->English translations at near human levels of quality,
| while respecting rules and context I give it in a very long,
| multi-page system prompt to improve fidelity of translation for
| a given translation target (sometimes markup tags need to be
| preserved, sometimes deleted, etc.). It doesn't require a
| thinking step to generate this level of translation quality,
| making it suitable for real-time translation. It doesn't start
| getting confused when I feed it a couple dozen lines of
| previous translation context, like certain other LLMs do...
| instead the translation actually improves with more context
| instead of degrading. It's never refused a translation for
| "safety" purposes either (GPT and Gemini love to interrupt my
| novels and tell me certain behavior is illegal or immoral, and
| censor various anatomical words).
| komali2 wrote:
| > GPT and Gemini love to interrupt my novels and tell me
| certain behavior is illegal or immoral, and censor various
| anatomical words
|
| Lol, are you using ai to create fan translations of eroMan
| Hua ?
| jquery wrote:
| soreHe nokotokaQuan Ran wakaran...Rong Tan dayo.
| meinhabiziyuarunoberutoranobe, tamanierow
| kbar13 wrote:
| i noticed the second hand is off tho. gemini has the most
| accurate one.
| buffaloPizzaBoy wrote:
| Right as you said that, I checked kimi k2's "clock" and it was
| just the ascii art: -\\_(tsu)_/-
|
| I wonder if that is some type of fallback for errors querying
| the model, or k2 actually created the html/css to display that.
| basch wrote:
| my GPT-40 was 100% perfect on the first click. Since then,
| garbage. Gemini 2.5 perfect on the 3rd click.
| paulddraper wrote:
| Kimi K2 is legitimately good.
| stogot wrote:
| When I clicked, everything was garbage except Grok and
| DeepSeek. kimi was the worst clock
| frankfrank13 wrote:
| I find that Kimi K2 _looks_ the best, but i 've noticed the
| time is often wrong!
| Mistletoe wrote:
| Qwen's clocks are highly entertaining. Like if you asked an
| alien "make me a clock".
| dilap wrote:
| I'm a huge K2 fan, it has a personality that feels very
| distinct from other models (not syccophantic at all), and is
| quite smart. Also pretty good at creative writing (tho not 100%
| slop free).
|
| K2 hosted on groq is pretty crazy for intellgence/second. (Low
| rate limits still, tho.)
| nightpool wrote:
| It would be cool to also AI generate the favicon using some
| sort of image model.
| oaktowner wrote:
| Perhaps Qwen 2.5 should be known as Dali 2.!?
| wowczarek wrote:
| Interestingly, either I'm _hallucinating_ this, or DeepSeek
| started to consistently show a clock without failures and with
| good time, where it previously didn't. ...aaand as I was typing
| this, it barfed a train wreck. Never mind, move along... No,
| wait, it's good again, no, wait...
| earth2mars wrote:
| https://gemini.google.com/share/00967146a995 works perfectly fine
| with gemini 2.5 pro
| lanewinfield wrote:
| nice. I restrict to 2000 tokens for mine, how many was that?
| esafak wrote:
| how do you do that?
| earth2mars wrote:
| I used exactly the same prompt this site uses. Nothing else.
| agildehaus wrote:
| I'm assuming the "Gemini 2.5" referenced on this site is
| Flash, not Pro. Pro is insane, and 3.0 is just around the
| corner.
| lanewinfield wrote:
| hi, I made this. thank you for posting.
|
| I love clocks and I love finding the edges of what any given
| technology is capable of.
|
| I've watched this for many hours and Kimi frequently gets the
| most accurate clock but also the least variation and is most
| boring. Qwen is often times the most insane and makes me laugh.
| Which one is "better?"
| anigbrowl wrote:
| I really like this. The broken ones are sometimes just
| failures, but sometimes provide intriguing new design ideas.
| jdiff wrote:
| This same principle is why my favorite image generation model
| is the earlier models from 2019-2020 where they could only
| reliably generate soup. It's like Rorschach tests, it's not
| about what's there, it's about what you see in them. I don't
| want a bot to make art for me, sometimes I just want some
| shroom-induced inspirational smears.
| nemomarx wrote:
| I really miss that deepdream aesthetic with the dogs eyes
| popping up everywhere.
| csours wrote:
| LOVE IT!
|
| It would be really cool if I could zoom out and have everything
| scale properly!
| Fabricio20 wrote:
| Why is this different per user? I sent this to a few friends
| and they all see different things from what i'm seeing, for the
| same time..?
| samtheprogram wrote:
| It regenerates on page load. I find that pretty useful.
|
| Grok 4 and Kimi nailed it the first time for me, then only
| Kimi on the second pass.
| malfist wrote:
| Not on page load, it regenerates every minute. There's a
| little hovering question mark in the top right that
| explains things, including the prompt to the models.
| layer8 wrote:
| It's different per minute, not per user.
| bspammer wrote:
| If you're keeping all the generated clocks in a database, I'd
| love to see a Facemash style spin-off website where users pick
| the best clock between two options, with a leaderboard. I want
| to know what the best clock Qwen ever made was!
| nightpool wrote:
| Yes! Please do this
| abixb wrote:
| We might be on to creating a new crowd-ranked LLM benchmark
| here.
| addandsubtract wrote:
| A pelican wearing a working watch
| danw1979 wrote:
| Using it to time bicycle race ?
| layer8 wrote:
| Not the best, but the most amusing.
| ks2048 wrote:
| Nice job! Maybe let users click an example to see the raw
| source (LLM output)
| chemotaxis wrote:
| This is honestly the best thing I've seen on HN this month.
| It's stupid, enlightening... funny and profound and the same
| time. I have a strong temptation to pick some of these designs
| and build them in real life.
|
| I applaud you for spending money to get it done.
| hakcermani wrote:
| .. would you mind sharing the prompt .. in a gist perhaps .
| ceroxylon wrote:
| They have it available on the site under the (?) button:
|
| "Create HTML/CSS of an analog clock showing ${time}. Include
| numbers (or numerals) if you wish, and have a CSS animated
| second hand. Make it responsive and use a white background.
| Return ONLY the HTML/CSS code with no markdown formatting."
| smusamashah wrote:
| Please make it show last 5 (or some other number) of clocks for
| each model. It will be nice to see the deviation and variety
| for each model at a glance.
| jdietrich wrote:
| Clock drawing is widely used as a test for assessing dementia.
| Sometimes the LLMs fail in ways that are fairly predictable if
| you're familiar with CSS and typical shortcomings of LLMs, but
| sometimes they fail in ways that are less obvious from a
| technical perspective but are _exactly the same_ failure modes
| as cognitively-impaired humans.
|
| I think you might have stumbled upon something surprisingly
| profound.
|
| https://www.psychdb.com/cognitive-testing/clock-drawing-test
| TheJoeMan wrote:
| Figure 6 with the square clock would be a cool modern art
| piece.
| xrisk wrote:
| Maybe explainable via the fact that these tests are part of
| the LLM training set?
| jorgesborges wrote:
| Conceptual deficit is a great failure mode description. The
| inability to retrieve "meaning" about the clock -- having
| some understanding about its shape and function but not its
| intent to convey time to us -- is familiar with a lot of bad
| LLM output.
| overfeed wrote:
| > Clock drawing is widely used as a test for assessing
| dementia
|
| Interestingly, clocks are also an easy tell for when you're
| dreaming, if you're a lucid dreamer; they never work normally
| in dreams.
| ghurtado wrote:
| In lucid dreams there's a whole category of things like
| this: reading a paragraph of text, looking at a clock
| (digital or analog), or working any kind of technology more
| complex than a calculator.
|
| For me personally, even light switches have been a huge
| tell in the past, so basically almost anything electrical.
|
| I've always held the utterly unscientific position that
| this is because the brain only has enough GPU cycles to
| show you an approximation of what the dream world looks
| like, but to actually run a whole simulation behind the
| scenes would require more FLOPs than it has available.
| After all, the brain also needs to run the "player"
| threads: It's already super busy.
|
| Stretching the analogy past the point of absurdity, this is
| a bit like modern video game optimizations: the mountains
| in the distance are just a painting on a surface, and the
| remote on that couch is just a messy blur of pixels when
| you look at it up close.
|
| So the dreaming brain is like a very clever video game
| developer, I guess.
| tablatom wrote:
| Wait, lucid dreamers need tells to know where they are?!?
| Kiro wrote:
| Yes, that's how you enter the lucid state. You find ways
| to tell that you're dreaming and condition yourself to
| check for those while awake. Eventually you will do it
| inside a dream and realize that you're dreaming.
| Kiboneu wrote:
| Yeah. It's very common to notice anomalies inside of a
| dream. But the anomalies weave into the dream and feel
| normal. You don't have much agency to enter a lucid state
| from a pre-lucid dream.
|
| So the idea is to develop habits called "reality checks"
| when you are awake. You look for the broken clock kind of
| anomalies that the grandparent comment mentioned. You
| have to be open to the possibility of dreaming, which is
| hard to do.
|
| Consider this difficulty. Are you dreaming?
|
| ...
|
| ...
|
| How much time did it take to think "no"? Or did you even
| take this question seriously? Maybe because you are
| reading a hn comment about lucid dreams, that question is
| interpreted as an example instead of a genuine question
| worth investigating, right? That's the difficulty. Try it
| again.
|
| The key is that the habit you're developing isn't just
| the check itself -- it's the thinking that you have
| during the check, which should lead you to investigate.
|
| You do these checks frequently enough you end up doing it
| in a dream. Boom.
|
| There's also an aspect of identifying recurring patterns
| during prelucidity. That's why it helps to keep a dream
| journal for your non-lucid dreams.
|
| There are other methods too.
| david-gpu wrote:
| Plenty of folks out there know when they are dreaming
| just like they know when they are awake. It varies from
| person to person.
| DuperPower wrote:
| be careful as adding consciousness to a dream means CPU
| cycles so you wake Up more tired, its cool for kids and
| teens but grown adults shouldnt explore this to avoid bad
| rest
| travisjungroth wrote:
| That's a caution to getting addicted to it, but not never
| doing it. I've had powerful experiences in lucid dreaming
| that I wouldn't trade for a little more rest. I was
| already in a retreat where I was basically resting all
| the time.
| conradev wrote:
| I met someone once who claimed that he lucid dreams
| almost every night by default and it is exhausting. He
| smokes weed at night to avoid dreaming entirely. I didn't
| dig in super deep, but it sounded pretty intense!
| david-gpu wrote:
| IMO they would benefit from skipping the weed and instead
| continue to practice lucid dreaming. Over time they will
| develop their skill and will learn to simply contemplate
| the dream without reacting to it. It is a calming
| experience.
| david-gpu wrote:
| Over time, with accumulated experience, all dreams are
| lucid from the start. Because of that they are very calm
| and pleasant; the dreamer is no longer reactive to what
| happens in the dream because they know nothing is at
| stake.
| lordnacho wrote:
| Didn't you ever watch Inception? You have to carry around
| a little spinning top to test which level of VM you're
| inside of.
| conradev wrote:
| The first time it happened to me, it was accidental. I
| dreamed that I was in a college classroom but I realized
| that I never went to college. I was not trying to and had
| never lucid dreamed before, and so it was very
| surprising.
| BoredomIsFun wrote:
| My brain learned how to maintain legible text in dreams,
| I cannot use it in lucid dreaming anymore...
| danw1979 wrote:
| For me it's phones... specifically dialling a number
| manually. No matter how carefully I dial, the number on the
| screen is rarely correct.
| allarm wrote:
| It seems that I've been stuck in a lucid dream for a
| couple of decades, no matter how carefully write text on
| a phone keyboard it never comes out as intended.
| luckman212 wrote:
| Tank ypu foe wriiting this
| amelius wrote:
| Whenever I dial a number while in a dream, the person I'm
| trying to call always turns out to be right next to me.
| biztos wrote:
| Do they look normal but just not work normally?
|
| Maybe reality is a world of broken clocks, and they only
| "work" in the simulation.
| ACCount37 wrote:
| LLMs don't do this because they have "people with dementia
| draw clocks that way" in their data. They do it because
| they're similar enough to human minds in function that they
| often fail in similar ways.
|
| An amusing pattern that dates back to "1kg of steel is
| heavier of course" in GPT-3.5.
| kaffekaka wrote:
| How do you _know_ this?
|
| Obviously, humans failing in these ways ARE in the training
| set. So it should definitely affect LLM output.
| ACCount37 wrote:
| First: generalization. The failure modes extend to unseen
| tasks. That specific way to fail at "1kg of steel" sure
| was in the training data, but novel closed set logic
| puzzles couldn't have been. They display similar
| failures. The same "vibe-based reasoning" process of
| "steel has heavy vibes, feather has light vibes, thus,
| steel is heavier" produces other similar failures.
|
| Second: the failures go away with capability (raw scale,
| reasoning training, test-time compute), on seen and
| unseen tasks both. Which is a strong hint that the model
| was truly failing, rather than being capable of doing a
| task but choosing to faithfully imitate a human failure
| instead.
|
| I don't think the influence of human failures in the
| training data on the LLMs is nil, but it's not just a
| surface-level failure repetition behavior.
| BHSPitMonkey wrote:
| I would think the way humans draw clocks has more in common
| with image generation models (which probably do a bit better
| with this task overall) than a language model producing SVG
| markup, though.
| charliewallace wrote:
| Very cool! I also love clocks, especially weird ones, and
| recently put up this 3D Moebius Strip clock, hope you like it:
| https://www.mobiusclock.com
| AnonHP wrote:
| Could you please change and adjust the positions of the titles
| (like GPT 5)? On Firefox Focus on iOS, the spacing is
| inconsistent (seems like it moves due to the space taken by the
| clock). After one or two of them, I had to scroll all the way
| down to the bottom and come back up to understand which title
| is linked to which clock.
| brianjking wrote:
| This is an awesome benchmark. Officially one of my favorites
| now. Thank you for making this.
| ryandrake wrote:
| I've been struggling all week trying to get Claude Code to write
| code to produce visual (not the usual, verifiable, text on a
| terminal) output in the form of a SDL_GPU rendered scene
| consisting of the usual things like shaders, pipelines, buffers,
| textures and samplers, vertex and index data and so on, and boy
| it just doesn't seem to know what it's doing. Despite providing
| paragraphs-long, detailed prompts. Despite describing each
| uniform and each matrix that needs to be sent. Despite giving it
| extremely detailed guidance about what order things need to be
| done in. It would have been faster for me to just write the code
| myself.
|
| When it fails a couple of times it will try to put logging in
| place and then confidently tell me things like "The vertex data
| has been sent to the renderer, therefore the output is correct!"
| When I suggest it take a screenshot of the output each time to
| verify correctness, it does, and then declares victory over an
| entirely incorrect screenshot. When I suggest it write unit
| tests, it does so, but the tests are worthless and only tests
| that the incorrect code it wrote is always incorrect in the same
| ways.
|
| When it fails even more times, it will get into this what I like
| to call "intern engineer" mode where it just tries random things
| that I know are not going to work. And if I let it keep going, it
| will end up modifying the entire source tree with random "try
| this" crap. And each iteration, it confidently tells me:
| "Perfect! I have found the root cause! It is [garbage bullshit].
| I have corrected it and the code is now completely working!"
|
| These tools are cute, but they really need to go a long way
| before they are actually useful for anything more than trivial
| toy projects.
| fancy_pantser wrote:
| Have you given using MCPs to provide documentation and examples
| a shot? I always have to bring in docs since I don't work in
| Python and TS+React (which it seems more capable at) and force
| it to review those in addition to any specification. e.g.
| Context7
| ryandrake wrote:
| Haven't looked into MCPs yet. Thanks for the suggestion!
| rossant wrote:
| Have you tried OpenAI Codex with GPT5.1? I'm using it for
| similar GPU rendering stuff and it appears to do an excellent
| job.
| jamilton wrote:
| I know this has been said many times before, but I wonder why
| this is such a common outcome. Maybe from negative outcomes
| being underrepresented in the training data? Maybe that plus
| being something slightly niche and complex?
|
| The screenshot method not working is unsurprising to me, VLLMs
| visual reasoning is very bad with details because they (as far
| as I understand) do not really have access to those details,
| just the image embedding and maybe an OCR'd transcript.
| poszlem wrote:
| I'm not sure if it's just me, but I've also noticed Claude
| becoming even more lazy. For example, I've asked it several
| times to fix my tests. It'll fix four or five of them, then
| start struggling with the next couple, and suddenly declare
| something like: "All done, fixed 5 out of 10 tests. I can't fix
| the remaining ones", followed by a long, convoluted explanation
| about why that's actually a good thing.
| __MatrixMan__ wrote:
| I don't know if it has gotten worse, but I definitely find
| Claude is way too eager to celebrate success when it has done
| nothing.
|
| It's annoying but I prefer it to how Gemini gets depressed if
| it takes a few tries to make progress. Like, thanks for not
| gaslighing me, but now I'm feeling sorry for a big pile of
| numbers, which was not a stated goal in my prompt.
| paxys wrote:
| Something I'm not able to wrap my head around is that Kimi K2 is
| the only model that produces a ticking second hand on every
| attempt while the rest of them are always moving continuously.
| What fundamental differences in model training or implementation
| can result in this disparity? Or was this use case programmed in
| K2 after the fact?
| aavshr wrote:
| just curious, why not the sonnet models? In my personal
| experience, Anthropic's Sonnet models are the best when it comes
| to things like this!
| xyproto wrote:
| Try adding to the prompt that it has a PhD in Computer Science
| and have many methods for dealing with complexity.
|
| This gives better results, at least for me.
| bigfishrunning wrote:
| Why does that give better results? Is this phenomena
| measurable? How would "you have a phd in computer science"
| change its ability to interpret prose? Every interaction with
| an LLM seems like superstition.
| bpt3 wrote:
| It's wild how much the output varies for the same model for each
| run.
|
| I'm not sure if this was the intent or not, but it sure
| highlights how unreliable LLMs are.
| eastbound wrote:
| Security-wise, this is a website that takes the straight output
| of AI and serves it for execution on their website.
|
| I know, developers do the same, but at least they check it in Git
| to notice their mistakes. Here is an opportunity for AI to call a
| Google Authentication on you, or anything else.
| bongodongobob wrote:
| Weird. Sonnet 4.5 one shotted it with:
|
| Create an interactive artifact of an analog clock face that keeps
| time properly.
|
| https://claude.ai/public/artifacts/75daae76-3621-4c47-a684-d...
| amelius wrote:
| Maybe they can ask Sora to make variations of:
|
| https://slate.com/human-interest/2016/07/martin-baas-giant-r...
| whimsicalism wrote:
| Kimi K2 is obviously the best, but gpt-5 has the most gorgeous
| ones when it works
| orly01 wrote:
| What does it mean that each model is allowed 2000 tokens to
| generate its clock?
| jcmontx wrote:
| Grok is impressive, I should give it a shot
| Waterluvian wrote:
| How do they do time without JavaScript? Is there an API I'm not
| aware of?
| bloppe wrote:
| CSS animation. It's not the real time. Just a hypothetical
| time.
| Waterluvian wrote:
| I'm imagining some must be using JS because I'm seeing
| (rarely...) times that are perfectly correct.
| bloppe wrote:
| Actually you're right. If you view source, you can see
| `const response = await fetch(`/api/clocks?time=${encodeURI
| Component(localTime)}`);`. I'm not sure how that API works,
| but it's definitely reading the current time using JS, then
| somehow embedding it in the HTML / CSS of each LLM.
| vultour wrote:
| It's crafted with a prompt that gives the AI the current
| time, then it simply refreshes every minute so the seconds
| start at zero correctly.
| bhandziuk wrote:
| Looks like css keyframes
| ssl-3 wrote:
| This really needs to be an xscreensaver hack.
| nasir wrote:
| where's opus/sonnet! very curious on that!
| ticulatedspline wrote:
| This is cool, interesting to see how consistent some models are
| (both in success and failure)
|
| I tried gpt-oss-20b (my go-to local) and it looks ok though not
| very accurate. It decided to omit numbers. It also took 4500
| tokens while thinking.
|
| I'd be interested in seeing it with some more token leeway as
| well as comparing two or more similar prompts. like using
| "current time" instead of "${time}" and being more prescriptive
| about including numbers
| porphyra wrote:
| LLMs can't "look" at the rendered HTML output to see if what they
| generated makes sense or not. But there ought to be a way to do
| that right? To let the model iterate until what it generates
| looks right.
|
| Currently, at work, I'm using Cursor for something that has an
| OpenGL visualization program. It's incredibly frustrating trying
| to describe bugs to the AI because it is completely blind. Like I
| just wanna tell it "there's no line connecting these two points
| but there ought to be one!" or "your polygon is obviously
| malformed as it is missing a bunch of points and intersects
| itself" but it's impossible. I end up having to make the AI add
| debug prints to, say, print out the position of each vertex, in
| order to convince it that it has a bug. Very high friction and
| annoying!!!
| TheKidCoder wrote:
| Kinda - Hand waiving over the question of if an LLM can really
| "look" but you can connect Cursor to a Puppeteer MCP server
| which will allow it to iterate with "eyes" by using Puppeteer
| to screenshot it's own output. Still has issues, but it does
| solve really silly mistakes often simply by having this MCP
| available.
| firtoz wrote:
| Cursor has this with their "browser" function for web dev,
| quite useful
|
| You can also give it a mcp setup that it can send a screenshot
| to the conversation, though unsure if anyone made an easy
| enough "take screenshot of a specific window id" kind of mcp,
| so may need to be built first
|
| I guess you could also ask it to build that mcp for you...
| fragmede wrote:
| Claude totally can, same with ChatGPT. Upload a picture to
| either one of them via the app and tell it there's no line
| where there should be. There's some plumbing involved to get it
| to work in Claude code or codex, but yes, computers can "see".
| If you have lm-server, there's tons of non-text models you can
| point your code at.
| pil0u wrote:
| I had some success providing screenshots to Cursor directly. It
| worked well for web UIs as well as generated graphs in Python.
| It makes them a bit less blind, though I feel more iterations
| are required.
| EMM_386 wrote:
| You can absolutely do this. In fact, with Claude Anthropic
| encourages you to send it screenshots. It works very well if
| you aren't expecting pixel-perfection.
|
| YMMV with other models but Sonnet 4.5 is good with things like
| this - writing the code, "seeing" the output and then iterating
| on it.
| kwanbix wrote:
| What a waste of energy.
| mandolingual wrote:
| Always interesting/uncanny when AI is tested with human cognitive
| tests https://www.psychdb.com/cognitive-testing/clock-drawing-
| test.
| hansmayer wrote:
| Very funny. It seems the Qwen generates the funniest outputs :)
| csours wrote:
| Oh, Qwen, buddy, you sure are TRYING
| Imanari wrote:
| Qwens clocks are hilarious
| cornonthecobra wrote:
| I like Deepseek v3.1's idea of radially-aligning each hour
| number's y-axis ("1" is rotated 30deg from vertical, "2" at
| 60deg, etc.). It would be even better if the numbers were rotated
| anticlockwise.
|
| I'm not sure what Qwen 2.5 is doing, but I've seen similar in
| contemporary art galleries.
| gloosx wrote:
| anyone tried opening this from mobile? not a single clock renders
| correctly, almost looks like a joke on LLMs
| rtcode_io wrote:
| See https://clock.rt.ht/::code
|
| AI-optimized <analog-clock>!
|
| People expect perfection on first attempt. This took a brief
| joint session:
|
| HI: define the custom element API design (attribute/property
| behavior) and the CSS parts
|
| AI: draw the rest of the f... owl
| speedgoose wrote:
| This is a white page, am I missing something?
| DeathArrow wrote:
| How can Deepseek and Kimi get it right while Haiku, Gemini and
| GPT are making a mess?
| 0xCE0 wrote:
| Seems like Will's clock drawing test in Hannibal :)
| gwbas1c wrote:
| Reminds me of the Alzheimer's "draw a clock" test.
|
| Makes me think that LLMs are like people with dementia! Perhaps
| it's the best way to relate to an LLM?
| hollow-moe wrote:
| obviously they're all broken on firefox, no one uses firefox
| anyways
| kylecazar wrote:
| Non-determinism at it's finest. The clock is perfect, the refresh
| happens, the clock looks like a Dali painting.
| jeremycarter wrote:
| Last year I wrote a simple system using Semantic Kernel, backed
| by functions inside Microsoft Orleans, which for the most part
| was a business logic DSL processor by LLM. Your business logic
| was just text, and you gave it the operation as text.
|
| Nothing could be relied upon to be deterministic, it was so
| funny to see it try to do operations.
|
| Recently I re-ran it with newer models and was drastically
| better, especially with temperature tweaks.
| __fst__ wrote:
| This is why we need TeraWatt DCs, to generate code for world
| clocks every minute.
| teaearlgraycold wrote:
| Qwen 2.5 doing a surprisingly good job (as of right now).
| maxdo wrote:
| Selection of western models is weird no gpt-5.1 , opus 4.1 (
| nailed it perfectly ) Something I quickly tested
| Bengalilol wrote:
| Qwen doesn't care about clocks, it goes the Dali way, without
| melting.
|
| It even made a Nietzsche clock (I saw one <body> </body> which
| was surprisingly empty).
|
| It definitely wins the creative award.
| HarHarVeryFunny wrote:
| Looks like we've got a new Turing test here: "draw me a clock"
| bitwize wrote:
| I'm reminded of the "draw a clock" test neurologists use to
| screen for dementia and brain damage.
| accrual wrote:
| I love that GPT-5 is putting the clock hands way outside the
| frame and just generally is a mess. Maybe we'll look back on
| these mistakes just like watching kids grow up and fumble basic
| tasks. Humorous in its own unique way.
| palmotea wrote:
| > Maybe we'll look back on these hilarious mistakes just like
| watching kids grow up and fumble basic tasks.
|
| Or regret: "why didn't we stop it when we could?"
| anon_cow1111 wrote:
| I'm having a hard time believing this site is honest, especially
| with how ridiculous the scaling and rotation of numbers is for
| most of them. I dumped his prompt into chatgpt to try it myself
| and it did create a very neat clock face with the numbers at the
| correct position+animated second hand, it just got the exact time
| wrong, being a few hours off.
|
| Edit: the time may actually have been perfect now that I account
| for my isp's geo-located time zone
| perfmode wrote:
| i read that the OP limited the output to 2000 tokens.
| lanewinfield wrote:
| ^ this! there's a lot of clocks to generate so I've
| challenged it to stick to a small(er) amount of code
| anon_cow1111 wrote:
| I got a ~1600 character reply from gpt, including spaces and
| it worked first shot dumping into an html doc. I think that
| probably fits ok in the limit? (If I missed something obvious
| feel free to tell me I'm an idiot)
| Springtime wrote:
| On the second minute I had the AI World Clocks site open
| the GPT-5 generated version displayed a perfect clock. Its
| clock before and every clock from it since has had very
| apparent issues though.
|
| If you could get a perfect clock several times for the
| identical prompt in fresh contexts with the same model then
| it'd be a better comparison. Potentially the ChatGPT site
| you're using though is doing some adjustments that the API
| fed version isn't.
| Zopieux wrote:
| On the contrary, in my experience this is very typical of the
| average failure mode / output of early 2025 LLMs for HTML of
| SVG.
| ada1981 wrote:
| Sonnet 4.5 did this easily
| https://claude.ai/public/artifacts/c1bb5d57-573b-49e0-9539-7...
| edfletcher_t137 wrote:
| Lack of Claude is a glaring oversight given how popular it is as
| an agentic coding model...
| chaosprint wrote:
| This is such a great idea! Surprisingly, the Kimi K2 is the only
| one without any obvious problems. And it is even not the complete
| K2 thinking version? This made me reread this article from a few
| days ago:
|
| https://entropytown.com/articles/2025-11-07-kimi-k2-thinking...
| esotericwarfare wrote:
| This is an AD for Kimi K2
| miohtama wrote:
| The new Turing time test
| bigbluedots wrote:
| I just realized I'm running late, it's almost -2!
|
| More seriously, I'd love to see how the models perform the same
| task with a larger token allowance.
| bigbluedots wrote:
| Is there a "draw a pelican riding a bicycle" version?
| padolsey wrote:
| We've done this!
| https://weval.org/analysis/visual__pelican/f141a8500de7f37f/...
| anonzzzies wrote:
| Sonnet 4.5 does it flawless. Tried 8 times.
| fnord77 wrote:
| whatever model Cursor uses was telling me the date was March 12,
| 2023
| imchillyb wrote:
| I love qwen, it tries so hard with its little paddle and never
| gets anywhere.
| cyberjill wrote:
| 666
| wanderingmind wrote:
| The more I look at it, the more I realise the reason for
| cognitive overload I feel when using LLMs for coding. Same prompt
| to same model for a pretty straight forward task produces such
| wildly different outputs. Now, imagine how wildly different the
| code outputs when trying to generate two different logical
| functions. The casings are different, commenting is different, no
| semantic continuity. Now maybe if I give detailed prompts and ask
| it to follow, it might follow, but from my experience prompt
| adherence is not so great as well. I am at the stage where I just
| use LLMs as auto correct, rather than using it for any
| generation.
| bwhiting2356 wrote:
| You should render it, show an image to the model and allow it to
| iterate. No person has to one-shot code without seeing what it
| looks like.
| wewtyflakes wrote:
| It is funny to see the performance improve across many of the
| models, somewhat miraculously, throughout the day today.
| stym06 wrote:
| If a human had done this, these would be at a museum
| woopwoop wrote:
| The qwen clocks are art.
| josfredo wrote:
| Watching these gives me a strong feeling of unease. Art-wise, it
| is a very beautiful project.
| 3oil3 wrote:
| I wonder which model will silently be updated and suddenly start
| drawing clocks with Audemars-Piguet-level kind of complications.
| jsmo wrote:
| lol
| shahzaibmushtaq wrote:
| Interesting idea!
|
| Why is a new clock being rendered every minute? Or AI models are
| evolving and improving every minute.
| Vera_Wilde wrote:
| It's really beautiful! Super clean UI.
|
| The thing I always want from timezone tools is: "Let me simulate
| a date after one side has shifted but the other hasn't."
|
| Humans do badly with DST offset transitions; computers do great
| with them.
| JamesAdir wrote:
| I believe that in a day or two, the companies will address this
| and it would be solved by them for that use case
| surfingdino wrote:
| What a wonderfully visual example of the crap LLMs turn
| everything into. I am eagerly awaiting the collapse of the LLM
| bubble. JetBrains added this crap to their otherwise fine series
| of IDEs and now I have to keep removing randomly inserted import
| statements and keep fixing hallucinated names of functions
| suggested instead of the names of functions that I have already
| defined in the same file. Lack of determinism where we expect it
| (most of the things we do, tbh) is creating more problems than it
| is solving.
| anotheryou wrote:
| Claude Sonnet 4.5 with a little thinking:
| https://imgur.com/a/zcJOnKy
|
| no thinking: better clock but not current time (the prompt is
| confusing here though): https://imgur.com/a/kRK3Q18
| themgt wrote:
| Just saw Gemini 2.5 with a little thinking:
| https://imgur.com/a/nypRD7x
| arendtio wrote:
| Pretty cool already!
|
| I use 'Sonnet 4.5 thinking' and 'Composer 1' (Cursor) the most,
| so it would be interesting to see how such SOTA models perform in
| this task.
| boxedemp wrote:
| That's super neat. I'll keep checking back to this site as new
| models are released. It's an interesting benchmark.
| baidoct wrote:
| GPT-5 looks broken
| Zeraous wrote:
| How Kimi is better than other BILLION$ companys is really fun
| warpspin wrote:
| Lol. This is supposed to replace me at my job already?
|
| Great experiment!
| adriatp wrote:
| deepseek representing
| RugnirViking wrote:
| whats going on with kimi k2 and being reasonable/so unique in so
| many of these benchmarks ive seen recently? I will have to try it
| out further for stuff. is it any good at programming?
| Bolwin wrote:
| Yes, it trades blows with glm for the best open source model
| adi_kurian wrote:
| Think this is just prompt eng tbh. One shot Haiku 3.5
| (https://claude.ai/share/66c17968-485e-4d15-974b-4f6958e1e2fd)
| decent looking too.
|
| Got it to work on gpt 3.5T w modified prompt (albeit not as good
| - https://pastebin.com/gjEVSEcJ)
|
| `single html file, working analog clock showing current time,
| numbers positioned (aligned) correctly via trig calc (dynamic),
| all three hands, second hand ticks, 400px, clean AF aesthetic
| R/Greenberg Associates circa 2017. empathy, hci, define > design
| > implement.`
___________________________________________________________________
(page generated 2025-11-15 23:01 UTC)