hngopher.com

       [HN Gopher] AI World Clocks
       ___________________________________________________________________
        
       AI World Clocks
        
       "Every minute, a new clock is rendered by nine different AI
       models."
        
       Author : waxpancake
       Score  : 1283 points
       Date   : 2025-11-14 18:35 UTC (1 days ago)
        
 (HTM) web link (clocks.brianmoore.com)
 (TXT) w3m dump (clocks.brianmoore.com)
        
       | kfarr wrote:
       | Add some voting and you got yourself an AI World Clock arena!
       | https://artificialanalysis.ai/image/arena
        
         | BrandoElFollito wrote:
         | Thank you very much.... It was a fun game until I got to the
         | prompt
         | 
         | Place a baby elephant in the green chair
         | 
         | I cannot unsee what I saw and it is 21:30 here so I have an
         | hour or so to eliminate the picture from my mind or I will have
         | nightmares.
        
       | syx wrote:
       | I'm very curious about the monthly bill for such a creative
       | project, surely some of these are pre rendered?
        
         | coffeecoders wrote:
         | Napkin math:
         | 
         | 9 AIs x 43,200 minutes = 388,800 requests/month
         | 
         | 388,800 requests x 200 tokens = 77,760,000 tokens/month [?] 78M
         | tokens
         | 
         | Cost varies from 10 cents to $1 per 1M tokens.
         | 
         | Using the mid-price, the cost is around $50/month.
         | 
         | ---
         | 
         | Hopefully, the OP has this endpoint protected -
         | https://clocks.brianmoore.com/api/clocks?time=11:19AM
        
           | whimsicalism wrote:
           | i think it is cached on the minute level, responses cannot be
           | that fast
        
       | ugh123 wrote:
       | Cool, and marginally informative on the current state of things.
       | but kind of a waste of energy given everything is re-done every
       | minute to compare. We'd probably only need a handful of each to
       | see the meaningful differences.
        
         | whoisjuan wrote:
         | It's actually quite fascinating if you watch it for 5 minutes.
         | Some models are overall bad, but others nail it in one minute
         | and butcher it in the next.
         | 
         | It's perhaps the best example I have seen of model drift driven
         | by just small, seemingly unimportant changes to the prompt.
        
           | alister wrote:
           | > _model drift driven by just small, seemingly unimportant
           | changes to the prompt_
           | 
           | What changes to the prompt are you referring to?
           | 
           | According the comment on the site, the prompt is the
           | following:
           | 
           |  _Create HTML /CSS of an analog clock showing ${time}.
           | Include numbers (or numerals) if you wish, and have a CSS
           | animated second hand. Make it responsive and use a white
           | background. Return ONLY the HTML/CSS code with no markdown
           | formatting._
           | 
           | The prompt doesn't seem to change.
        
             | sambaumann wrote:
             | presumably the time is replaced with the actual current
             | time at each generation. I wonder if they are actually
             | generated every minute or if all 6480 permutations (720
             | minutes in a day * 9 llms) were generated and just show on
             | a schedule
        
             | whoisjuan wrote:
             | The time given to the model. So the difference between two
             | generations is just somethng trivially different like:
             | "12:35" vs 12:36"
        
           | moffkalast wrote:
           | Kimi seems the only reliable one which is a bit surprising,
           | and GPT 4o is consistently better than GPT 5 which on the
           | other hand is unfortunately not surprising at all.
        
           | nbaugh1 wrote:
           | It is really interesting to watch them for a while. QWEN
           | keeps outputting some really abstract interpretations of a
           | clock, KIMI is consistently very good, GPT5's results line up
           | exactly with my experience with its code output (overly
           | complex and never working correctly)
        
           | bglusman wrote:
           | We can't know how much is about the prompt though and how
           | much is just stochastic randomness in the behavior of that
           | model on that prompt, right? I mean, even given identical
           | prompts, even at temp 0, models don't always behave
           | identically.... at least, as far as I know? Some of the
           | reasons why are I think still a research question, but I
           | think its a fact nonetheless.
        
         | ascorbic wrote:
         | The energy usage is minuscule.
        
           | jdiff wrote:
           | It's wasteful. If someone built a clock out of 47
           | microservices that called out to 193 APIs to check the
           | current time, location, time zone, and preferred display
           | format we'd rightfully criticize it for similar reasons.
           | 
           | In a world where Javascript and Electron are still getting
           | (again, rightfully) skewered for inefficiency despite often
           | exceeding the performance of many compiled languages, we
           | should not dismiss the discussion around efficiency so
           | easily.
        
             | Arisaka1 wrote:
             | What I find amusing with this argument is that, no one ever
             | brought power savings when e.g. used "let me google that
             | for you" instead of giving someone the answer to their
             | question, because we saw the utility of teaching others how
             | to Google. But apparently we can't see the utility of
             | measuring the oversold competence of current AI models,
             | given sufficiently large sampling size.
        
             | saulpw wrote:
             | Let's do some math.
             | 
             | 60x24x30 = 40k AI calls per month per model. Let's suppose
             | there are 1000 output tokens (might it be 10k tokens? Seems
             | like a lot for this task). So 40m tokens per model.
             | 
             | The price for 1m output tokens[0] ranges from $.10
             | (qwen-2.5) to $60 (GPT-4). So $4/mo for the cheapest, and
             | $2.5k/mo for the most expensive.
             | 
             | So this might cost several thousand dollars a month?
             | Something smells funny. But you're right, throttling it to
             | once an hour would achieve a similar goal and likely cost
             | less than $100/mo (which is still more than I would spend
             | on a project like this).
             | 
             | [0] https://pricepertoken.com/
        
               | qwe----3 wrote:
               | They use 4o (maybe a mini version?)(
        
             | berkes wrote:
             | Yes it is wasteful.
             | 
             | But I presume you light up Christmas lights in December,
             | drive to the theater to watch a movie or fire up a campfire
             | on holiday. That too is "wasteful". It's not needed, other,
             | or far more efficient ways exist to achieve the same. And
             | in absolute numbers, far more energy intensive than running
             | an LLM to create 9 clocks every minute. We do things to
             | learn, have fun, be weird, make art, or just spend time.
             | 
             | Now, if Rolex starts building watches by running an LLM to
             | drive its production machines or if we replace millions of
             | wall clocks with ones that "Run an LLM every second", then
             | sure, the waste is an actual problem.
             | 
             | Point I'm trying to make is that it's OK to consider or
             | debate the energy use of LLMs compared to alternatives. But
             | that bringing up that debate in a context where someone is
             | creative, or having a fun time, its not, IMO. Because a lot
             | of "fun" activities use a lot of energy, and that too isn't
             | automatically "wasteful".
        
           | ugh123 wrote:
           | Hmm, curious. How did you come up with that?
        
         | energy123 wrote:
         | I sort of assumed they cached like 30 inferences and just
         | repeat them, but maybe I'm being too cynical.
        
       | PeterStuer wrote:
       | Why? This is diagonal to how LLM's work, and trivially solved by
       | a minimal hybrid front/sub system.
        
         | em3rgent0rdr wrote:
         | To gauge.
        
         | bayindirh wrote:
         | Because, LLMs are touted to be the silver bullet of silver
         | bullets. Built upon world's knowledge, and with the capacity to
         | call upon updated information with agents, they are ought to
         | rival the top programmers 3 days ago.
        
           | awkwam wrote:
           | They might be touted like that but it seems like you don't
           | understand how they work. The example in the article shows
           | that the prompt is limiting the LLM by giving it access to
           | only 2000 tokens and also saying "ONLY OUTPUT ...". This is
           | like me asking you to solve the same problem but forcing you
           | do de-activate half of your brain + forget any programming
           | experience you have. It's just stupid.
        
             | bayindirh wrote:
             | > like you don't understand how they work.
             | 
             | I would not make such assumptions.
             | 
             | > The example in the article shows that the prompt is
             | limiting the LLM by giving it access to only 2000 tokens
             | and also saying "ONLY OUTPUT ..."
             | 
             | The site is pretty simple, method is pretty
             | straightforward. If you believe this is unfair, you can
             | always build one yourself.
             | 
             | > It's just stupid.
             | 
             | No, it's a great way of testing things within constraints.
        
       | em3rgent0rdr wrote:
       | Most look like they were done by a beginner programmer on crack,
       | but every once in a while a correct one appears.
        
         | morkalork wrote:
         | I'd say more like a blind programmer in the early stages of
         | dementia. Able to write code, unable to form a mental image of
         | what it would render as and can't see the final result.
        
         | pixl97 wrote:
         | DeepSeek and Kimi seem to have correct ones most of the time
         | I've looked.
        
           | em3rgent0rdr wrote:
           | yes, and sometimes Grok.
        
             | pixl97 wrote:
             | The hour hand commonly seems off on Grok.
        
           | BrandoElFollito wrote:
           | DeepSeek told me that it cannot generate pictures and
           | suggested code (which is very different)
        
         | shafoshaf wrote:
         | It's interesting how drawing a clock is one of the primary
         | signals for dementia. https://www.verywellhealth.com/the-clock-
         | drawing-test-98619
        
           | BrandoElFollito wrote:
           | This is very interesting, thank you.
           | 
           | I could not get to the store because of the cookie banner
           | that does not work (at left on mobile chrome and ff). The
           | Internet Archive page: https://archive.ph/qz4ep
           | 
           | I wonder how this test could be modified for people that have
           | neurological problems - my father's hands shake a lot but I
           | would like to try the test on him (I do not have suspicions,
           | just curious).
           | 
           | I passed it :)
        
           | technothrasher wrote:
           | "One variation of the test is to provide the person with a
           | blank piece of paper and ask them to draw a clock showing 10
           | minutes after 11. The word "hands" is not used to avoid
           | giving clues."
           | 
           | Hmm, ambiguity. I would be the smart ass that drew a digital
           | clock for them, or a shaku-dokei.
        
         | energy123 wrote:
         | If they can identify which one is correct, then it's the same
         | as always being correct, just with an expensive compute budget.
        
       | larodi wrote:
       | would be gr8t to also see the prompt this was done with
        
         | creade wrote:
         | The ? has "Create HTML/CSS of an analog clock showing ${time}.
         | Include numbers (or numerals) if you wish, and have a CSS
         | animated second hand. Make it responsive and use a white
         | background. Return ONLY the HTML/CSS code with no markdown
         | formatting."
        
       | bananatron wrote:
       | grok's looks like one of those clocks you'd find at a novelty
       | shop
        
       | AlfredBarnes wrote:
       | Its cool to see them get it right .....sometimes
        
       | zkmon wrote:
       | Why are Deepseek and Kimi are beating other models by so much
       | margin? Is this to do with their specialization for this task?
        
       | baltimore wrote:
       | Since the first (good) image generation models became available,
       | I've been trying to get them to generate an image of a clock with
       | 13 instead of the usual 12 hour divisions. I have not been
       | successful. Usually they will just replace the "12" with a "13"
       | and/or mess up the clock face in some other way.
       | 
       | I'd be interested if anyone else is successful. Share how you did
       | it!
        
         | snek_case wrote:
         | From my experience they quickly fail to understand anything
         | beyond a superficial description of the image you want.
        
           | atorodius wrote:
           | That's less and less true
           | 
           | https://minimaxir.com/2025/11/nano-banana-prompts/
        
             | dang wrote:
             | Related ongoing thread:
             | 
             |  _Nano Banana can be prompt engineered for nuanced AI image
             | generation_ - https://news.ycombinator.com/item?id=45917875
             | - Nov 2025 (214 comments)
        
         | Scene_Cast2 wrote:
         | I've noticed that image models are particularly bad at
         | modifying popular concepts in novel ways (way worse
         | "generalization" than what I observe in language models).
        
           | emp17344 wrote:
           | Maybe LLMs always fail to generalize outside their data set,
           | and it's just less noticeable with written language.
        
             | cluckindan wrote:
             | This is it. They're language models which predict next
             | tokens probabilistically and a sampler picks one according
             | to the desired "temperature". Any generalization outside
             | their data set is an artifact of random sampling:
             | happenstance and circumstance, not genuine substance.
        
               | cluckindan wrote:
               | However: do humans have that genuine substance? Is human
               | invention and ingenuity more than trial and error, more
               | than adaptation and application of existing knowledge?
               | Can humans generalize outside their data set?
               | 
               | A yes-answer here implies belief in some sort of gnostic
               | method of knowledge acquisition. Certainly that comes
               | with a high burden of proof!
        
               | dawidloubser wrote:
               | Yes
        
               | cluckindan wrote:
               | Can you elaborate on what you mean by that, and prove it?
               | 
               | https://journals.sagepub.com/doi/10.1177/0963721425133621
               | 2
        
               | sophrosyne42 wrote:
               | Yes. Humans can perform abduction, extrapolating given
               | information to new information. LLMs cannot, they can
               | only interpolate new data based on existing data.
        
             | IshKebab wrote:
             | They definitely don't _completely fail_ to generalise. You
             | can easily prove that by asking them something completely
             | novel.
             | 
             | Do you mean that LLMs might display a similar tendency to
             | modify popular concepts? If so that definitely might be the
             | case and would be fairly easy to test.
             | 
             | Something like "tell me the lord's prayer but it's our
             | mother instead of our father", or maybe "write a haiku but
             | with 5 syllables on every line"?
             | 
             | Let me try those ... nah ChatGPT nailed them both. Feels
             | like it's particular to image generation.
        
               | immibis wrote:
               | They used to do poorly with modified riddles, but I
               | assume those have been added to their training data now
               | (https://huggingface.co/datasets/marcodsn/altered-riddles
               | ?)
               | 
               | Like, the response to "... The surgeon (who is male and
               | is the boy's father) says: I can't operate on this boy!
               | He's my son! How is this possible?" used to be "The
               | surgeon is the boy's mother"
               | 
               | The response to "... At each door is a guard, each of
               | which always lies. What question should I ask to decide
               | which door to choose?" would be an explanation of how
               | asking the guard what the other guard would say would
               | tell you the opposite of which door you should go
               | through.
        
             | phire wrote:
             | Most image models are diffusion models, not LLMs, and have
             | a bunch of other idiosyncrasies.
             | 
             | So I suspect it's more that lessons from diffusion image
             | models don't carry over to text LLMs.
             | 
             | And the Image models which are based on multi-mode LLMs
             | (like Nano Banana) seem to do a lot better at novel
             | concepts.
        
           | CobrastanJorji wrote:
           | Also, they're fundamentally bad at math. They can draw a
           | clock because they've seen clocks, but going further requires
           | some calculations they can't do.
           | 
           | For example, try asking Nano Banana to do something simpler,
           | like "draw a picture of 13 circles." It likely will not work.
        
         | IAmGraydon wrote:
         | That's because they literally cannot do that. Doing what you're
         | asking requires an understanding of why the numbers on the
         | clock face are where they are and what it would mean if there
         | was an extra hour on the clock (ie that you would have to
         | divide 360 by 13 to begin to understand where the numbers would
         | go). AI models have no concept of anything that's not included
         | in their training data. Yet people continue to anthropomorphize
         | this technology and are surprised when it becomes obvious that
         | it's not actually thinking.
        
           | bobbylarrybobby wrote:
           | It's interesting because if you asked them to write code to
           | generate an SVG of a clock, they'd probably use a loop from 1
           | to 12, using sin and cos of the angle (given by the loop
           | index over 12 times 2pi) to place the numerals. They know how
           | to do this, and so they basically understand the process that
           | generates a clock face. And extrapolating from that to 13
           | hours is trivial (for a human). So the fact that they can't
           | do this extrapolation on their own is very odd.
        
           | echelon wrote:
           | gpt-image-1 and Google Imagen understand prompts, they just
           | don't have training data to cover these use cases.
           | 
           | gpt-image-1 and Imagen are wickedly smart.
           | 
           | The new Nano Banana 2 that has been briefly teased around the
           | internet can solve incredibly complicated differential
           | equations on chalk boards with full proof of work.
        
             | phkahler wrote:
             | >> The new Nano Banana 2 that has been briefly teased
             | around the internet can solve incredibly complicated
             | differential equations on chalk boards with full proof of
             | work.
             | 
             | That's great, but I bet it can't tie it's own shoes.
        
               | esafak wrote:
               | And a submarine can't swim. Big deal.
        
               | echelon wrote:
               | No, but I can get it to do a lot of work.
               | 
               | It's a part of my daily tool box.
        
           | energy123 wrote:
           | The hope was for this understanding to emerge as the most
           | efficient solution to the next-token prediction problem.
           | 
           | Put another way, it was hoped that once the dataset got rich
           | enough, developing this understanding is actually more
           | efficient for the neural network than memorizing the training
           | data.
           | 
           | The useful question to ask, if you believe the hope is not
           | bearing fruit, is _why_. Point specifically to the absent
           | data or the flawed assumption being made.
           | 
           | Or more realistically, put in the creative and difficult
           | research work required to discover the answer to that
           | question.
        
           | ryandrake wrote:
           | I wonder if you would have more success if you painstakingly
           | described the shape and features of a clock in great detail
           | but never used the words clock or time or anything that might
           | give the AI the hint that they were supposed to output
           | something like a clock.
        
             | BrandoElFollito wrote:
             | And this is a problem for me. I guess that it would work,
             | but as soon as the word "clock" appears, gone is the
             | request because a clock HAS.12.HOURS.
             | 
             | I use this a lot in cybersecurity when I need to do
             | something "illegal". I am refused help, until I say that I
             | am doing research on cybersecurity. In that case no
             | problem.
        
           | Workaccount2 wrote:
           | The problem is more likely the tokenization of images than
           | anything. These models do their absolute worst when pictures
           | are involved, but are seemingly miraculous at generalizing
           | with just text.
        
             | chemotaxis wrote:
             | I wonder if it's because we mean different things by
             | generalization.
             | 
             | For text, "generalization" is still "generate text that
             | conforms to all the usual rules of the language". For
             | images of 13-hour clock faces, we're explicitly asking the
             | LLM to violate the inferred rules of the universe.
             | 
             | I think a good analogy would be asking an LLM to write in
             | English, except the word "the" now means "purple". They
             | will struggle to adhere to this prompt in a conversation.
        
               | Workaccount2 wrote:
               | That's true, but I think humans would stumble a lot too
               | (try reading old printed text from the 18fh cenfury where
               | fhey used "f" insfead of t in prinf, if's a real frick fo
               | gef frough).
               | 
               | However humans are pretty adept at discerning images,
               | even ones outside the norm. I really think there is some
               | kind of architectural block hampering transformers
               | ability to really "see" images. For instance if you show
               | any model a picture of a dog with 5 legs (a fifth leg
               | photoshopped to it's belly) they all say there are only 4
               | legs. And will argue with you about it. Hell GPT-5 even
               | wrote a leg detection script in python (impressive) which
               | detected the 5 legs, and then it said the script was
               | bugged, and modified the parameters until one of the legs
               | wasn't detected, lol.
        
               | onraglanroad wrote:
               | An "f" never replaced a "t".
               | 
               | You probably mean the "long s" that looks like an "f".
        
           | godelski wrote:
           | Yes, the problem is that these so called "world models" do
           | not actually contain a model of the world, or any world
        
         | echelon wrote:
         | That's just a patch to the training data.
         | 
         | Once companies see this starting to show up in the evals and
         | criticisms, they'll go out of their way to fix it.
        
           | rideontime wrote:
           | What would the "patch" be? Manually create some images of
           | 13-hour clocks and add them to the training data? How does
           | that solution scale?
        
           | godelski wrote:
           | s/13/17/g ;)
        
         | coffeecoders wrote:
         | LLMs are terrible for out-of-distribution (OOD) tasks. You
         | should use chain of thought suppression and give constaints
         | explictly.
         | 
         | My prompt to Grok:
         | 
         | ---
         | 
         | Follow these rules exactly:
         | 
         | - There are 13 hours, labeled 1-13.
         | 
         | - There are 13 ticks.
         | 
         | - The center of each number is at angle: index * (360/13)
         | 
         | - Do not infer anything else.
         | 
         | - Do not apply knowledge of normal clocks.
         | 
         | Use the following variables:
         | 
         | HOUR_COUNT = 13
         | 
         | ANGLE_PER_HOUR = 360 / 13 // 27.692307deg
         | 
         | Use index i [?] [0..12] for hour marks:
         | 
         | angle_i = i * ANGLE_PER_HOUR
         | 
         | I want html/css (single file) of a 13-hour analog clock.
         | 
         | ---
         | 
         | Output from grok.
         | 
         | https://jsfiddle.net/y9zukcnx/1/
        
           | BrandoElFollito wrote:
           | Well, that's cheating :) You asked it to generate code, which
           | is ok because it does not represent a direct generated image
           | of a clock.
           | 
           | Can grok generate images? What would the result be?
           | 
           | I will try your prompt on chatgpt and gemini
        
             | BrandoElFollito wrote:
             | Gemini failed miserably - a standard 12 hours clock
             | 
             | Same for chatgpt
             | 
             | And perplexity replaced 12 with 13
        
               | dwringer wrote:
               | > Please create a highly unusual 13-hour analog clock
               | widget, synchronized to system time, with fully animated
               | hands that move in real time, and not 12 but 13 hour
               | markings - each will be spaced at not 5-minute intervals,
               | but at 4-minute-37-second intervals. This makes room for
               | all 13 hour markings. Please pay attention to the correct
               | alignment of the 13 numbers and the 13 hour marks, as
               | well as the alignment of the hands on the face.
               | 
               | This gave me a correct clock face on Gemini- after the
               | model spent _a lot_ of time thinking (and kind of
               | thrashing in a loop for a while). The functionality isn
               | 't quite right, not that it entirely makes sense in the
               | first place, but the face - at least in terms of the hour
               | marks - looks OK to me.[0]
               | 
               | [0] https://aistudio.google.com/app/prompts?state=%7B%22i
               | ds%22:%...
        
           | chemotaxis wrote:
           | > Follow these rules exactly:
           | 
           | "Here's the line-by-line specification of the program I need
           | you to write. Write that program."
        
             | signatoremo wrote:
             | Can you write this program in any language?
        
               | chemotaxis wrote:
               | No, do I need to?
        
               | bigfishrunning wrote:
               | Yes.
        
             | serf wrote:
             | it's lazy to dust off the major advantages of a pseudocode-
             | to-anylanguage transpiler as if it's somehow easy or
             | commonplace.
        
           | chiwilliams wrote:
           | I'll also note that the output isn't quite right --- the top
           | number should be 13 rather than 1!
        
             | layer8 wrote:
             | I mean, the specification for the hour marks (angle_i)
             | starts with a mark at angle 0. It just followed that spec.
             | ;)
        
           | NooneAtAll3 wrote:
           | close enough, but digit at the top should be the highest, not
           | 1 :/
        
         | BrandoElFollito wrote:
         | This is really cool. I tried to prompt gemini but every time I
         | got _the same picture_. I do not know how to share a session
         | (like it is possible with Chatgpt) but the prompts were
         | 
         | If a clock had 13 hours, what would be the angle between two of
         | these 13 hours?
         | 
         | Generate an image of such a clock
         | 
         | No, I want the clock to have 13 distinct hours, with the angle
         | between them as you calculated above
         | 
         | This is the same image. There need to be 13 hour marks around
         | the dial, evenly spaced
         | 
         | ... And its last answer was
         | 
         | You are absolutely right, my apologies. It seems I made an
         | error and generated the same image again. I will correct that
         | immediately.
         | 
         | Here is an image of a clock face with 13 distinct hour marks,
         | evenly spaced around the dial, reflecting the angle we
         | calculated.
         | 
         | And the very same clock, with 12 hours, and a 13th above the
         | 12...
        
           | ryandrake wrote:
           | This is probably my biggest problem with AI tools, having
           | played around with them more lately.
           | 
           | "You're absolutely right! I made a mistake. I have now
           | comprehensively solved this problem. Here is the corrected
           | output: [totally incorrect output]."
           | 
           | None of them ever seem to have the ability to say "I cannot
           | seem to do this" or "I am uncertain if this is correct,
           | confidence level 25%" The only time they will give up or
           | refuse to do something is when they are deliberately
           | programmed to censor for often dubious "AI safety" reasons.
           | All other times, they come back again and again with extreme
           | confidence as they totally produce garbage output.
        
             | BrandoElFollito wrote:
             | I agree, I see the same even in simple code where they will
             | bend backwards apologizing and generate very similar crap.
             | 
             | It is like they are sometimes stuck in a local energetic
             | minimum and will just wobble around various similar (and
             | incorrect) answers.
             | 
             | What was annoying in my attempt above is that the picture
             | was _identical_ for every attempt
        
               | ryandrake wrote:
               | These tools 'attitude' reminds me of an eager, but
               | incompetent intern or a poorly trained administrative
               | assistant, who works for a powerful CEO. All sycophancy,
               | confidence and positive energy, but not really getting
               | much done.
        
               | SamBam wrote:
               | The issue is the they always say "Here's the final,
               | correct answer" before they've written the answer, so of
               | course the LLM has no idea if it's going to be right
               | before it starts, because it has no clue what it's going
               | to say.
               | 
               | I wonder how it would do if instead it were told "Do not
               | tell me at the start that the solution is going to be
               | correct. Instead, tell me the solution, and at the end
               | tell me if you think it's correct or not."
               | 
               | I have found that on certain logic puzzles that it simply
               | cannot get right, it always tells me that it's _going_ to
               | get it quite  "this last time," but if asked later it
               | always recognizes its errors.
        
             | int_19h wrote:
             | Gemini specifically is actually kinda notorious for giving
             | up.
             | 
             | https://www.reddit.com/r/artificial/comments/1mp5mks/this_i
             | s...
        
           | notatoad wrote:
           | you can click the share icon (the two-way branch icon, it
           | doesn't look like apple's share icon) under the image it
           | generates to share the conversation.
           | 
           | i'm curious if the clock image it was giving you was the same
           | one it was giving me
           | 
           | https://gemini.google.com/share/780db71cfb73
        
             | BrandoElFollito wrote:
             | Thanks for the tip about sharing!
             | 
             | No, my clock was an old style one, to be put on a shelf.
             | But at least it had a "13" proudly right above the "12" :)
             | 
             | This reminds me my kids when they were in kindergarden and
             | were bringing home their art that needed extra explanation
             | to realize what it was. But they were very proud!
        
         | deathanatos wrote:
         | Generate an image of a clock face, but instead of the usual 12
         | hour numbering, number it with 13 hours.
         | 
         | Gemini, 2.5 Flash or "Nano Banana" or whatever we're calling it
         | these days. https://imgur.com/a/1sSeFX7
         | 
         | A normal (ish) 12h clock. It numbered it twice, in two
         | concentric rings. The outer ring is normal, but the inner ring
         | numbers the 4th hour as "IIII" (fine, and a thing that clocks
         | do) and the 8th hour as "VIIII" (wtf).
        
           | bar000n wrote:
           | It should be pretty clear already that anything which is
           | based (limited?) to communicating words/text can never grasp
           | conceptual thinking.
           | 
           | We have yet to design a language to cover that, and it might
           | be just a donquijotism we're all diving into.
        
             | rideontime wrote:
             | Really? I can grasp the concept behind that command just
             | fine.
        
             | bayindirh wrote:
             | > We have yet to design a language to cover that, and it
             | might be just a donquijotism we're all diving into.
             | 
             | We have a very comprehensive and precise spec for that [0].
             | 
             | If you don't want to hop through the certificate warning,
             | here's the transcript:
             | 
             | - Some day, we won't even need coders any more. We'll be
             | able to just write the specification and the program will
             | write itself.
             | 
             | - Oh wow, you're right! We'll be able to write a
             | comprehensive and precise spec and bam, we won't need
             | programmers any more.
             | 
             | - Exactly
             | 
             | - And do you know the industry term for a project
             | specification that is comprehensive and precise enough to
             | generate a program?
             | 
             | - Uh... no...
             | 
             | - Code, it's called code.
             | 
             | [0]: https://www.commitstrip.com/en/2016/08/25/a-very-
             | comprehensi...
        
               | snickerbockers wrote:
               | Ive been thinking about that a lot too. Fundamentally
               | it's just a different way of telling the computer what to
               | do and if it seems like telling an llm to make a program
               | is less work than writing it yourself then either your
               | program is extremely trivial or there are dozens of
               | redundant programs in the training set that are nearly
               | identical.
               | 
               | If you're actualy doing real work you have nothing to
               | fear from LLMs because any prompt which is specific
               | enough to create a given computer program is going to be
               | comparable in terms of complexity and effort to having
               | done it yourself.
        
             | Uehreka wrote:
             | I don't think that's clear at all. In fact the proficiency
             | of LLMs at a wide variety of tasks would seem to indicate
             | that language is a highly efficient encoding of human
             | thought, much moreso than people used to think.
        
               | tsunamifury wrote:
               | Yea it's amazing that the parent post literally
               | misunderstands the fundamental realities of LLMs and the
               | compression they reveal in linguistics even if blurry is
               | incredible.
        
             | XenophileJKO wrote:
             | I mean, that's not really "true".
             | 
             | https://claude.ai/public/artifacts/0f1b67b7-020c-46e9-9536-
             | c...
        
         | giancarlostoro wrote:
         | Weird, I never tried that, I tried all the usual tricks that
         | usually work including swearing at the model (this scarily
         | works surprisingly well with LLMs) and nothing. I even tried to
         | go the opposite direction, I want a 6 hour clock.
        
         | usui wrote:
         | I've been trying for the longest time and across models to
         | generate pictures or cartoons of people with six fingers and
         | now they won't do it. They always say they accomplished it, but
         | the result always has 5 fingers. I hate being gaslit.
        
         | andix wrote:
         | I gave this "riddle" to various models:
         | 
         | > The farmer and the goat are going to the river. They look
         | into the sky and see three clouds shaped like: a wolf, a
         | cabbage and a boat that can carry the farmer and one item. How
         | can they safely cross the river?
         | 
         | Most of them are just giving the result to the well known river
         | crossing riddle. Some "feel" that something is off, but still
         | have a hard time to figure out that wolf, boat and cabbage are
         | just clouds.
        
           | userbinator wrote:
           | Basically a variation of
           | https://en.wikipedia.org/wiki/Age_of_the_captain
        
           | jampa wrote:
           | There are few examples of this as well:
           | 
           | https://www.reddit.com/r/singularity/comments/1fqjaxy/contex.
           | ..
        
             | andix wrote:
             | It really shows how LLMs work. It's all about
             | probabilities, and not about understanding. If something
             | looks very similar to a well known problem, the llm is
             | having a hard time to "see" contradictions. Even if it's
             | really easy to notice for humans.
        
           | Recursing wrote:
           | Claude has no problem with this: https://imgur.com/a/ifSNOVU
           | 
           | Maybe older models?
        
             | andix wrote:
             | Try to twist around words and phrases, at some point it
             | might start to fail.
             | 
             | I tried it again yesterday with GPT. GPT-5 manages quite
             | well too in thinking mode, but starts crackling in instant
             | mode. 4o completely failed.
             | 
             | It's not that LLMs are unable to solve things like that at
             | all, but it's really easy to find some variations that make
             | them struggle really hard.
        
         | chanux wrote:
         | Ah! This is so sad. The manager types won't be able to add an
         | hour (actually, two) to the day even with AI.
        
         | edub wrote:
         | I was able to have AI generate an image that made this, but not
         | by diffusion/autoregressive but by having it write Python code
         | to create the image.
         | 
         | ChatGPT made a nice looking clock with matplotlib that had some
         | bugs that it had to fix (hours were counter-clockwise). Gemini
         | made correct code one-shot, it used Pillow instead of
         | matplotlib, but it didn't look as nice.
        
         | nl wrote:
         | I do playing card generation and almost all struggle beyond the
         | "6 of X"
         | 
         | My working theory is that they were trained really hard to
         | generate 5 fingers on hands but their counting drops off
         | quickly.
        
       | abathologist wrote:
       | This is great. If you think that the phenomena of human-like text
       | generation evinces human-like intelligence, then this should be
       | taken to evince that the systems likely have dementia.
       | https://en.wikipedia.org/wiki/Montreal_Cognitive_Assessment
        
         | AIorNot wrote:
         | Imagine if I asked you to draw as pixels and operate a clock
         | via html or create a jpeg with a pencil and paper and have it
         | be accurate.. I suspect your handcoded work to be off by an
         | order of magnitutde compared
        
       | jonplackett wrote:
       | kimi is kicking ass
        
       | busymom0 wrote:
       | Because a new clock is generated every minute, looks like simply
       | changing the time by a digit causes the result to be
       | significantly different from the previous iteration.
        
       | shevy-java wrote:
       | Now that is actually creative.
       | 
       | Granted, it is not a clock - but it could be art. It looks like a
       | Picasso. When he was drunk. And took some LSD.
        
       | kburman wrote:
       | These types of tests are fundamentally flawed. I was able to
       | create perfect clock using gemini 2.5 pro -
       | https://gemini.google.com/share/136f07a0fa78
        
         | sinak wrote:
         | How are they flawed?
        
           | earthnail wrote:
           | The results are not reproducable, as evidenced by parent
           | poster.
        
             | micromacrofoot wrote:
             | isn't that kind of the point of non-determinism?
        
               | earthnail wrote:
               | No. Good nondeterministic models reproducibly generate
               | equally desirable output - not identical output, but
               | interchangeable.
        
               | micromacrofoot wrote:
               | oh I see, thank you for clarifying
        
         | jmdeon wrote:
         | Aren't they attempting to also display current time though?
         | Your share is a clock starting at midnight/noon. Kimi K2 seems
         | to be the best on each refresh.
        
         | Drew_ wrote:
         | The website is regenerating the clocks every minute. When I
         | opened it, Gemini 2.5 was the only working one. Now, they are
         | all broken.
         | 
         | Also, your example is not showing the current time.
        
           | system2 wrote:
           | It wouldn't be hard to tell to pick up browser time as the
           | default start point. Just a piece of prompt.
        
         | allenu wrote:
         | I don't think this is a serious test. It's just an art piece to
         | contrast different LLMs taking on the same task, and against
         | themselves since it updates every minute. One minute one of the
         | results was really good for me and the next minute it was very,
         | very bad.
        
         | dwringer wrote:
         | Even Gemini Flash did really well for me[0] using two prompts -
         | the initial query and one to fix the only error I could
         | identify.
         | 
         | > Please generate an analog clock widget, synchronized to
         | actual system time, with hands that update in real time and a
         | second hand that ticks at least once per second. Make sure all
         | the hour markings are visible and put some effort into making a
         | modern, stylish clock face.
         | 
         | Followed by:
         | 
         | > Currently the hands are working perfectly but they're
         | translated incorrectly making then uncentered. Can you ensure
         | that each one is translated to the correct position on the
         | clock face?
         | 
         | [0]
         | https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
        
       | lxe wrote:
       | Honestly, I think if you track the performance of each over time,
       | since these get regenerated once in a while, you can then have a
       | very, very useful and cohesive benchmark.
        
       | 1yvino wrote:
       | i wonder kwen prompt woud look like hallucination?
        
       | fschuett wrote:
       | Reminds me of this: https://www.youtube.com/watch?v=OGbhJjXl9Rk
        
       | S0y wrote:
       | To be fair, This is a deceptively hard task.
        
         | bobbylarrybobby wrote:
         | Without AI assistance, this should take ~10-15 minutes for a
         | human. Maybe add 5 minutes if you're not allowed to use d3.
        
           | alexmorley wrote:
           | It's just html/css so no js at all let alone d3.
        
           | postalrat wrote:
           | Whats your hourly rate? I'll pay you to make as many as you
           | can in a few hours if you share the video.
        
           | Mashimo wrote:
           | I would not even know how to draw a circle with CSS to be
           | honest.
        
             | Bolwin wrote:
             | Pretty sure css has a sin() fn, that's half your work
        
       | zkmon wrote:
       | Was Claude banned from this Olympics?
        
         | giancarlostoro wrote:
         | Haiku is the lightweight Claude model, I'm not sure why they
         | picked the weaker model.
        
       | collimarco wrote:
       | In any case those clocks are all extremely inaccurate, even if AI
       | could build a decent UI (which is not the case).
       | 
       | Some months ago I published this site for fun:
       | https://timeutc.com There's a lot of code involved to make it
       | precise to the ms, including adjusting based on network delay,
       | frame refresh rate instead of using setTimeout and much more. If
       | you are curious take a look at the source code.
        
       | mstipetic wrote:
       | GPT-5 is embarrassing itself. Kimi and DeepSeek are very
       | consistently good. Wild that you can just download these models.
        
       | shubham_zingle wrote:
       | not sure about the accuracy though, although shooting in the dark
        
       | awkwam wrote:
       | Limiting the model to only use 2000 tokens while also asking it
       | to output ONLY HTML/CSS is just stupid. It's like asking a
       | programmer to perform the same task while removing half their
       | brain and also forget about their programming experience. This is
       | a stupid and meaningless benchmark.
        
       | system2 wrote:
       | Ask Claude or ChatGPT to write it in Python, and you will see
       | what they are capable of. HTML + CSS has never been the strong
       | suit of any of these models.
        
         | camalouu wrote:
         | Claude generates some js/css stuff even when i don't ask for
         | it. I think Claude itself at least believes he is good at this.
        
       | munro wrote:
       | Amazing, some people are so enamored with LLMs who use them for
       | soft outcomes, and disagree with me when I say be careful they're
       | not perfect -- this is such a great non technical way to explain
       | the reality I'm seeing when using on hard outcome coding/logic
       | tasks. "Hey this test is failing", _LLM deletes test_ , "FIXED!"
        
         | worldsayshi wrote:
         | Yeah it seems crazy to use LLM on any task where the output
         | can't be easily verified.
        
           | palmotea wrote:
           | > Yeah it seems crazy to use LLM on any task where the output
           | can't be easily verified.
           | 
           | I disagree, those tasks are _perfect_ for LLMs, since a bug
           | you can 't verify isn't a problem when vibecoding.
        
         | mopsi wrote:
         | > "Hey this test is failing", LLM deletes test, "FIXED!"
         | 
         | A nice continuation of the tradition of folk stories about
         | supernatural entities like teapots or lamps that grant wishes
         | and take them literally. "And that's why, kids, you should
         | always review your AI-assisted commits."
        
         | derbOac wrote:
         | Something that struck me when I was looking at the clocks is
         | that we _know_ what a clock is supposed to look and act like.
         | 
         | What about when we don't know what it's supposed to look like?
         | 
         | Lately I've been wrestling with the fact that unlike, say, a
         | generalized linear model fit to data with some inferential
         | theory, we don't have a theory or model for the uncertainty
         | about LLM products. We recognize when it's off about things we
         | know are off, but don't have a way to estimate when it's off
         | other than to check it against reality, which is probably the
         | exception to how it's used rather than the rule.
        
           | ehnto wrote:
           | I need to be delicate with wording here, but this is why it's
           | a worry that all the least intelligent people you know could
           | be using AI.
           | 
           | It's why non-coders think it's doing an amazing job at
           | software.
           | 
           | But it's worryingly why using it for research, where you
           | necessarily don't know what you don't know, is going to trip
           | up even smarter people.
        
         | markatkinson wrote:
         | To be fair I'd probably also delete the test.
        
       | novemp wrote:
       | Oh cool, it's the schizophrenia clock-drawing test but for AI.
        
       | otterley wrote:
       | Watching this over the past few minutes, it looks like Kimi K2
       | generates the best clock face most consistently. I'd never heard
       | of that model before today!
       | 
       | Qwen 2.5's clocks, on the other hand, look like they never make
       | it out of the womb.
        
         | bArray wrote:
         | It could be that the prompt is accidentally (or purposefully)
         | more optimised for Kimi K2, or that Kimi K2 is better trained
         | on this particular data. LLM's need "prompt engineers" for a
         | reason to get the most out of a particular model.
        
           | energy123 wrote:
           | Goes to show the "frontier" is not really one frontier. It's
           | a social/mathematical construct that's useful for a broad
           | comparison, but if you have a niche task, there's no
           | substitute for trying the different models.
        
           | observationist wrote:
           | It's not fair to use prompts tailored to a particular model
           | when doing comparisons like this - one shot results that
           | generalize across a domain demonstrate solid knowledge of the
           | domain. You can use prompting and context hacking to get any
           | particular model to behave pseudo-competently in almost any
           | domain, even the tiny <1B models, for some set of questions.
           | You could include an entire framework and model for rendering
           | clocks and times that allowed all 9 models to perform fairly
           | well.
           | 
           | This experiment, however, clearly states the goal with this
           | prompt: `Create HTML/CSS of an analog clock showing ${time}.
           | Include numbers (or numerals) if you wish, and have a CSS
           | animated second hand. Make it responsive and use a white
           | background. Return ONLY the HTML/CSS code with no markdown
           | formatting.`
           | 
           | An LLM should be able to interpret that, and should be able
           | to perform a wide range of tasks in that same style -
           | countdown timers, clocks, calendars, floating quote bubble
           | cycling through list of 100 pithy quotations, etc.
           | Individual, clearly defined elements should have complex
           | representations in latent space that correspond to the human
           | understanding of those elements. Tasks and operations and
           | goals should likewise align with our understanding. Qwen 2.5
           | and some others clearly aren't modeling clocks very well, or
           | maybe the html/css rendering latents are broken. If you pick
           | a semantic axis(like analog clocks), you can run a suite of
           | tests to demonstrate their understanding by using limited
           | one-shot interactions.
           | 
           | Reasoning models can adapt on the fly, and are capable of
           | cheating - one shots might have crappy representations for
           | some contexts, but after a lot of repetition and refinement,
           | as long as there's a stable, well represented proxy for
           | quality somewhere in the semantics it understands, it can
           | deconstruct a task to fundamentals and eventually reach high
           | quality output.
           | 
           | These type of tests also allow us to identify mode collapses
           | - you can use complex sophisticated prompting to get most
           | image models to produce accurate analog clocks displaying any
           | time, but in the simple one shot tests, the models tend to
           | only be able to produce the time 10:10, and you'll get wild
           | artifacts and distortions if you try to force any other
           | configuration of hands.
           | 
           | Image models are so bad at hands that they couldn't even get
           | clock hands right, until recently anyway. Nano banana and
           | some other models are much better at avoiding mode collapses,
           | and can traverse complex and sophisticated compositions
           | smoothly. You want that same sort of semantic generalization
           | in text generating models, so hopefully some of the
           | techniques cross over to other modalities.
           | 
           | I keep hoping they'll be able to use SAE or some form of
           | analysis on static weight distributions in order to uncover
           | some sort of structural feature of mode collapse, with a
           | taxonomy of different failure modes and causes, like limited
           | data, or corrupt/poisoned data, and so on. Seems like if you
           | had that, you could deliberately iterate on, correct issues,
           | or generate supporting training material to offset big
           | distortions in a model.
        
             | jquery wrote:
             | Qwen 2.5 is so bad it's good. Some really insane results if
             | you watch it for a while. Almost like it's taking the piss.
        
           | bigfishrunning wrote:
           | How much engineering do prompt engineers do? Is it
           | engineering when you add "photorealistic. correct number of
           | fingers and teeth. High quality." to the end of a prompt?
           | 
           | we should call them "prompt witch doctors" or maybe "prompt
           | alchemists".
        
             | Dilettante_ wrote:
             | "How is engineering a real science? You just build the
             | bridge so it doesn't fall down."
        
               | vohk wrote:
               | Nah.
               | 
               | Actual engineers have professional standards bodies and
               | legal liability when they shirk and the bridge falls down
               | or the plane crashes or your wiring starts on fire.
               | 
               | Software "engineers" are none of those things but can at
               | least emulate the approaches and strive for
               | reproducibility and testability. Skilled craftsman; not
               | engineers.
               | 
               | Prompt "engineers" is yet another few steps down the
               | ladder, working out mostly by feel what magic words best
               | tickle each model, and generally with no understanding of
               | what's actually going on under the hood. Closer to a chef
               | coming up with new meals for a restaurant than anything
               | resembling engineering.
               | 
               | The battle on the use of language around engineer has
               | long been lost but applying it to the subjective creative
               | exercise of writing prompts is just more job title
               | inflation. Something doesn't need to be engineering to be
               | a legitimate job.
        
               | Dilettante_ wrote:
               | The battle on the use of language around engineer has
               | long been lost
               | 
               | That's really the core of the issue: We're just having
               | the age-old battle of prescriptivism vs descriptivism
               | again. An "engineer", etymologically, is basically just
               | "a person who comes up with stuff", one who is
               | "ingenious". I'm tempted to say it's _you
               | prescriptivists_ who are making a  "battle" out of this.
               | subjective creative exercise of writing prompts
               | 
               | Implying that there are no testable results, no objective
               | success or failure states? Come on man.
        
               | jahewson wrote:
               | Engineers use their ingenuity. That's it.
               | 
               | If physical engineers understood everything then
               | standards would not have changed in many decades. Safety
               | factors would be mostly unnecessary. Clearly not the
               | case.
        
               | skeeter2020 wrote:
               | >> Engineers use their ingenuity. That's it.
               | 
               | If this was enough all novel creation would be
               | engineering and that's clearly not true. Engineering
               | attempts to discover & understand consistent outcomes
               | when a myriad of variables are altered, and the
               | boundaries where the variables exceed a model's
               | predictive powers - then add buffer for the unknown.
               | Manipulating prompts (and much of software development)
               | attempts to control the model to limit the number of
               | variables to obtain some form of useful abstraction.
               | Physical engineering can't do this.
        
             | BoorishBears wrote:
             | I like that actually, I've spent the last year probably
             | 60:40 between post-training and prompt engineering/witch
             | doctoring (the two go together more than most people
             | realize)
             | 
             | Some of it is engineering-like, but I've also picked up a
             | sixth sense when modifying prompts about what parts are
             | affecting the behavior I want to modify for certain models,
             | and that feels very witch doctory!
             | 
             | The more engineering-like part is essentially trying to RE
             | a black box model's post-training, but that goes over some
             | people's heads so I'm happy to help keep the "it's just
             | voodoo and guessing" narrative going instead :)
        
               | lanstin wrote:
               | I think the coherence behind prompt engineering is not in
               | the literal meanings of the words but finding the
               | vocabulary used by the sources that have your solution.
               | Ask questions like a high school math student and you get
               | elementary words back. Ask questions in the lingo of a
               | Linux bigot and you will get good awk scripts back. Use
               | academic maths language and arXiv answers will be
               | produced.
        
             | scrollop wrote:
             | "...and do it really well or my grandmother will be killed
             | by her kidnappers! And I'll give you a tip of 2 billion
             | dollars!!! Hurry, they're coming!"
        
               | carterschonwald wrote:
               | Ive heard this actually works annoyingly well
        
               | DrewADesign wrote:
               | We've created technology so sophisticated it is
               | vulnerable to social engineering attacks.
        
               | skeeter2020 wrote:
               | this has worked - and continues to do so - very well to
               | escape guard rails. If a direct appeal doesn't work you
               | can then talk them around with only a handful of prompts.
        
               | carterschonwald wrote:
               | Also the amount of adjacent remarks being always topical
               | flsvor confusion is cartoonish. Im playing with ideas for
               | making thst better
        
               | DrewADesign wrote:
               | You're absolutely right! People should pay attention to
               | this broadly applicable and important consideration.
        
               | manmal wrote:
               | Adding this to my snippets.
        
             | WJW wrote:
             | Well if it works consistently, I don't see any problem with
             | that. If they have a clear theory of when to add
             | "photorealistic" and when to add "correct number of wheels
             | on the bus" to get the output they want, it's engineering.
             | If they don't have a (falsifiable) theory, it's probably
             | not engineering.
             | 
             | Of course, the service they really provide is for
             | businesses to feel they "do AI", and whether or not they do
             | real engineering is as relevant as if your favorite
             | pornstars' boobs are real or not.
        
               | jahewson wrote:
               | Maybe we could keep the conversation out of the gutter.
        
               | rrr_oh_man wrote:
               | Porn is taxable income, not the gutter.
        
               | jrflowers wrote:
               | You don't really see much porn in the gutters these days
               | with the decline in popularity of print publishing. It's
               | almost all online now
        
               | leptons wrote:
               | >as relevant as if your favorite pornstars' boobs are
               | real or not
               | 
               | This matters more than you might think.
        
             | tomrod wrote:
             | It could be bioengineering if you add that to a clock
             | prompt then connect it to CRISPR process for out putting
             | DNA.
             | 
             | Horrifying prospect, tbh
        
             | int_19h wrote:
             | I write quite a lot of prompts, and the closest analogy
             | that I can think of is a shaman trying to appease the
             | spirits.
        
               | minikomi wrote:
               | I find it a surprisingly similar mindset to songwriting,
               | a lot of local maxima searching and spaghetti flinging.
               | Sometime you hit a good groove and explore it.
        
               | skeeter2020 wrote:
               | It might be even more ridiculous to make this something
               | akin to art over engineering.
        
             | davidsainez wrote:
             | Sure, we are still closer to alchemy than materials
             | science, but its still early days. But consider this
             | blogpost that was on the front page today:
             | https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-
             | prompt.... The table on the bottom shows a generally steady
             | increase in performance just by iterating on prompts. It
             | feels like we are on the path to true engineering.
        
               | raddan wrote:
               | Engineers usually have at least some sense as to why
               | their efforts work though. Does anybody who iterates on
               | prompts have even the fuzziest idea why they work? Or
               | what the improvement might be? I do not.
        
               | skeeter2020 wrote:
               | If there is ANY relationship to engineering here maybe
               | it's like reverse engineering a bios in a clean room,
               | were you poke away and see what happens. The missing part
               | is the use of anything resembling the scientific method
               | in terms of hypothesis, experiment design, observation
               | guiding actions, etc and the deep knowledge that will
               | allow you to understand WHY something might be happening
               | based on the inputs. "Prompt Engineering" seems about as
               | close to this as probing for land mines in a battlefield,
               | only with no experience and your eyes closed.
        
             | tamimio wrote:
             | > we should call them "prompt witch doctors" or maybe
             | "prompt alchemists".
             | 
             | Oh absolutely not! Only in engineering you are allowed to
             | get called an engineer for no apparent reason, do that in
             | other white collar and you are behind the bars because of
             | fraudulent claims.
        
             | skeeter2020 wrote:
             | we used to just call them "good at googling". I've never
             | met a self-described prompt engineer who had anything close
             | to engineering education and experience. Seems like an
             | extension on the 6-week boot camp == software engineer
             | trend.
        
           | woodson wrote:
           | Just use something like DSPy/Ax and optimize your module for
           | any given LLM (based on sample data and metrics) and you're
           | mostly good. No need to manually wordsmith prompts.
        
           | andix wrote:
           | I think the selection of models is a bit off. Haiku instead
           | of Sonnet for example. Kimi K2's capabilities are closer to
           | Sonnet than to Haiku. GPT-5 might be in the non-reasoning
           | mode, which routes to a smaller model.
        
             | ceroxylon wrote:
             | I had my suspicions about the GPT-5 routing as well. When I
             | first looked at it, the clock was by far the best; after
             | the minute went by and everything refreshed, the next three
             | were some of the worst of the group. I was wondering if it
             | just hit a lucky path in routing the first time.
        
         | frizlab wrote:
         | I knew of Kimi K2 because it's the model used by Kagi to
         | generate the AI answers when query ends with an interrogation
         | point.
        
           | OJFord wrote:
           | It's also one of the few 'recommended' models in Kagi
           | Assistant (multi-model ChatGPT basically, available on paid
           | plans).
        
           | Bolwin wrote:
           | Really? They must've switched recently cause that was around
           | before kimi came out
        
             | frizlab wrote:
             | Yes, this is recent. Before it was other model(s), not sure
             | which.
        
         | abixb wrote:
         | >Qwen 2.5's clocks, on the other hand, look like they never
         | make it out of the womb.
         | 
         | More like fell headfirst into the ground.
         | 
         | I'm disappointed with Gemini 2.5 (not sure Pro or Flash) --
         | I've personally had _fantastic_ results with Gemini 2.5 Pro
         | building PWA, especially since the May 2025 "coding update."
         | [0]
         | 
         | [0] https://blog.google/products/gemini/gemini-2-5-pro-updates/
        
         | jquery wrote:
         | I've been using Kimi K2 a lot this month. Gives me
         | Japanese->English translations at near human levels of quality,
         | while respecting rules and context I give it in a very long,
         | multi-page system prompt to improve fidelity of translation for
         | a given translation target (sometimes markup tags need to be
         | preserved, sometimes deleted, etc.). It doesn't require a
         | thinking step to generate this level of translation quality,
         | making it suitable for real-time translation. It doesn't start
         | getting confused when I feed it a couple dozen lines of
         | previous translation context, like certain other LLMs do...
         | instead the translation actually improves with more context
         | instead of degrading. It's never refused a translation for
         | "safety" purposes either (GPT and Gemini love to interrupt my
         | novels and tell me certain behavior is illegal or immoral, and
         | censor various anatomical words).
        
           | komali2 wrote:
           | > GPT and Gemini love to interrupt my novels and tell me
           | certain behavior is illegal or immoral, and censor various
           | anatomical words
           | 
           | Lol, are you using ai to create fan translations of eroMan
           | Hua  ?
        
             | jquery wrote:
             | soreHe nokotokaQuan Ran wakaran...Rong Tan dayo.
             | meinhabiziyuarunoberutoranobe, tamanierow
        
         | kbar13 wrote:
         | i noticed the second hand is off tho. gemini has the most
         | accurate one.
        
         | buffaloPizzaBoy wrote:
         | Right as you said that, I checked kimi k2's "clock" and it was
         | just the ascii art: -\\_(tsu)_/-
         | 
         | I wonder if that is some type of fallback for errors querying
         | the model, or k2 actually created the html/css to display that.
        
         | basch wrote:
         | my GPT-40 was 100% perfect on the first click. Since then,
         | garbage. Gemini 2.5 perfect on the 3rd click.
        
         | paulddraper wrote:
         | Kimi K2 is legitimately good.
        
         | stogot wrote:
         | When I clicked, everything was garbage except Grok and
         | DeepSeek. kimi was the worst clock
        
         | frankfrank13 wrote:
         | I find that Kimi K2 _looks_ the best, but i 've noticed the
         | time is often wrong!
        
         | Mistletoe wrote:
         | Qwen's clocks are highly entertaining. Like if you asked an
         | alien "make me a clock".
        
         | dilap wrote:
         | I'm a huge K2 fan, it has a personality that feels very
         | distinct from other models (not syccophantic at all), and is
         | quite smart. Also pretty good at creative writing (tho not 100%
         | slop free).
         | 
         | K2 hosted on groq is pretty crazy for intellgence/second. (Low
         | rate limits still, tho.)
        
         | nightpool wrote:
         | It would be cool to also AI generate the favicon using some
         | sort of image model.
        
         | oaktowner wrote:
         | Perhaps Qwen 2.5 should be known as Dali 2.!?
        
         | wowczarek wrote:
         | Interestingly, either I'm _hallucinating_ this, or DeepSeek
         | started to consistently show a clock without failures and with
         | good time, where it previously didn't. ...aaand as I was typing
         | this, it barfed a train wreck. Never mind, move along... No,
         | wait, it's good again, no, wait...
        
       | earth2mars wrote:
       | https://gemini.google.com/share/00967146a995 works perfectly fine
       | with gemini 2.5 pro
        
         | lanewinfield wrote:
         | nice. I restrict to 2000 tokens for mine, how many was that?
        
         | esafak wrote:
         | how do you do that?
        
           | earth2mars wrote:
           | I used exactly the same prompt this site uses. Nothing else.
        
           | agildehaus wrote:
           | I'm assuming the "Gemini 2.5" referenced on this site is
           | Flash, not Pro. Pro is insane, and 3.0 is just around the
           | corner.
        
       | lanewinfield wrote:
       | hi, I made this. thank you for posting.
       | 
       | I love clocks and I love finding the edges of what any given
       | technology is capable of.
       | 
       | I've watched this for many hours and Kimi frequently gets the
       | most accurate clock but also the least variation and is most
       | boring. Qwen is often times the most insane and makes me laugh.
       | Which one is "better?"
        
         | anigbrowl wrote:
         | I really like this. The broken ones are sometimes just
         | failures, but sometimes provide intriguing new design ideas.
        
           | jdiff wrote:
           | This same principle is why my favorite image generation model
           | is the earlier models from 2019-2020 where they could only
           | reliably generate soup. It's like Rorschach tests, it's not
           | about what's there, it's about what you see in them. I don't
           | want a bot to make art for me, sometimes I just want some
           | shroom-induced inspirational smears.
        
             | nemomarx wrote:
             | I really miss that deepdream aesthetic with the dogs eyes
             | popping up everywhere.
        
         | csours wrote:
         | LOVE IT!
         | 
         | It would be really cool if I could zoom out and have everything
         | scale properly!
        
         | Fabricio20 wrote:
         | Why is this different per user? I sent this to a few friends
         | and they all see different things from what i'm seeing, for the
         | same time..?
        
           | samtheprogram wrote:
           | It regenerates on page load. I find that pretty useful.
           | 
           | Grok 4 and Kimi nailed it the first time for me, then only
           | Kimi on the second pass.
        
             | malfist wrote:
             | Not on page load, it regenerates every minute. There's a
             | little hovering question mark in the top right that
             | explains things, including the prompt to the models.
        
           | layer8 wrote:
           | It's different per minute, not per user.
        
         | bspammer wrote:
         | If you're keeping all the generated clocks in a database, I'd
         | love to see a Facemash style spin-off website where users pick
         | the best clock between two options, with a leaderboard. I want
         | to know what the best clock Qwen ever made was!
        
           | nightpool wrote:
           | Yes! Please do this
        
           | abixb wrote:
           | We might be on to creating a new crowd-ranked LLM benchmark
           | here.
        
             | addandsubtract wrote:
             | A pelican wearing a working watch
        
               | danw1979 wrote:
               | Using it to time bicycle race ?
        
           | layer8 wrote:
           | Not the best, but the most amusing.
        
         | ks2048 wrote:
         | Nice job! Maybe let users click an example to see the raw
         | source (LLM output)
        
         | chemotaxis wrote:
         | This is honestly the best thing I've seen on HN this month.
         | It's stupid, enlightening... funny and profound and the same
         | time. I have a strong temptation to pick some of these designs
         | and build them in real life.
         | 
         | I applaud you for spending money to get it done.
        
         | hakcermani wrote:
         | .. would you mind sharing the prompt .. in a gist perhaps .
        
           | ceroxylon wrote:
           | They have it available on the site under the (?) button:
           | 
           | "Create HTML/CSS of an analog clock showing ${time}. Include
           | numbers (or numerals) if you wish, and have a CSS animated
           | second hand. Make it responsive and use a white background.
           | Return ONLY the HTML/CSS code with no markdown formatting."
        
         | smusamashah wrote:
         | Please make it show last 5 (or some other number) of clocks for
         | each model. It will be nice to see the deviation and variety
         | for each model at a glance.
        
         | jdietrich wrote:
         | Clock drawing is widely used as a test for assessing dementia.
         | Sometimes the LLMs fail in ways that are fairly predictable if
         | you're familiar with CSS and typical shortcomings of LLMs, but
         | sometimes they fail in ways that are less obvious from a
         | technical perspective but are _exactly the same_ failure modes
         | as cognitively-impaired humans.
         | 
         | I think you might have stumbled upon something surprisingly
         | profound.
         | 
         | https://www.psychdb.com/cognitive-testing/clock-drawing-test
        
           | TheJoeMan wrote:
           | Figure 6 with the square clock would be a cool modern art
           | piece.
        
           | xrisk wrote:
           | Maybe explainable via the fact that these tests are part of
           | the LLM training set?
        
           | jorgesborges wrote:
           | Conceptual deficit is a great failure mode description. The
           | inability to retrieve "meaning" about the clock -- having
           | some understanding about its shape and function but not its
           | intent to convey time to us -- is familiar with a lot of bad
           | LLM output.
        
           | overfeed wrote:
           | > Clock drawing is widely used as a test for assessing
           | dementia
           | 
           | Interestingly, clocks are also an easy tell for when you're
           | dreaming, if you're a lucid dreamer; they never work normally
           | in dreams.
        
             | ghurtado wrote:
             | In lucid dreams there's a whole category of things like
             | this: reading a paragraph of text, looking at a clock
             | (digital or analog), or working any kind of technology more
             | complex than a calculator.
             | 
             | For me personally, even light switches have been a huge
             | tell in the past, so basically almost anything electrical.
             | 
             | I've always held the utterly unscientific position that
             | this is because the brain only has enough GPU cycles to
             | show you an approximation of what the dream world looks
             | like, but to actually run a whole simulation behind the
             | scenes would require more FLOPs than it has available.
             | After all, the brain also needs to run the "player"
             | threads: It's already super busy.
             | 
             | Stretching the analogy past the point of absurdity, this is
             | a bit like modern video game optimizations: the mountains
             | in the distance are just a painting on a surface, and the
             | remote on that couch is just a messy blur of pixels when
             | you look at it up close.
             | 
             | So the dreaming brain is like a very clever video game
             | developer, I guess.
        
               | tablatom wrote:
               | Wait, lucid dreamers need tells to know where they are?!?
        
               | Kiro wrote:
               | Yes, that's how you enter the lucid state. You find ways
               | to tell that you're dreaming and condition yourself to
               | check for those while awake. Eventually you will do it
               | inside a dream and realize that you're dreaming.
        
               | Kiboneu wrote:
               | Yeah. It's very common to notice anomalies inside of a
               | dream. But the anomalies weave into the dream and feel
               | normal. You don't have much agency to enter a lucid state
               | from a pre-lucid dream.
               | 
               | So the idea is to develop habits called "reality checks"
               | when you are awake. You look for the broken clock kind of
               | anomalies that the grandparent comment mentioned. You
               | have to be open to the possibility of dreaming, which is
               | hard to do.
               | 
               | Consider this difficulty. Are you dreaming?
               | 
               | ...
               | 
               | ...
               | 
               | How much time did it take to think "no"? Or did you even
               | take this question seriously? Maybe because you are
               | reading a hn comment about lucid dreams, that question is
               | interpreted as an example instead of a genuine question
               | worth investigating, right? That's the difficulty. Try it
               | again.
               | 
               | The key is that the habit you're developing isn't just
               | the check itself -- it's the thinking that you have
               | during the check, which should lead you to investigate.
               | 
               | You do these checks frequently enough you end up doing it
               | in a dream. Boom.
               | 
               | There's also an aspect of identifying recurring patterns
               | during prelucidity. That's why it helps to keep a dream
               | journal for your non-lucid dreams.
               | 
               | There are other methods too.
        
               | david-gpu wrote:
               | Plenty of folks out there know when they are dreaming
               | just like they know when they are awake. It varies from
               | person to person.
        
               | DuperPower wrote:
               | be careful as adding consciousness to a dream means CPU
               | cycles so you wake Up more tired, its cool for kids and
               | teens but grown adults shouldnt explore this to avoid bad
               | rest
        
               | travisjungroth wrote:
               | That's a caution to getting addicted to it, but not never
               | doing it. I've had powerful experiences in lucid dreaming
               | that I wouldn't trade for a little more rest. I was
               | already in a retreat where I was basically resting all
               | the time.
        
               | conradev wrote:
               | I met someone once who claimed that he lucid dreams
               | almost every night by default and it is exhausting. He
               | smokes weed at night to avoid dreaming entirely. I didn't
               | dig in super deep, but it sounded pretty intense!
        
               | david-gpu wrote:
               | IMO they would benefit from skipping the weed and instead
               | continue to practice lucid dreaming. Over time they will
               | develop their skill and will learn to simply contemplate
               | the dream without reacting to it. It is a calming
               | experience.
        
               | david-gpu wrote:
               | Over time, with accumulated experience, all dreams are
               | lucid from the start. Because of that they are very calm
               | and pleasant; the dreamer is no longer reactive to what
               | happens in the dream because they know nothing is at
               | stake.
        
               | lordnacho wrote:
               | Didn't you ever watch Inception? You have to carry around
               | a little spinning top to test which level of VM you're
               | inside of.
        
               | conradev wrote:
               | The first time it happened to me, it was accidental. I
               | dreamed that I was in a college classroom but I realized
               | that I never went to college. I was not trying to and had
               | never lucid dreamed before, and so it was very
               | surprising.
        
               | BoredomIsFun wrote:
               | My brain learned how to maintain legible text in dreams,
               | I cannot use it in lucid dreaming anymore...
        
             | danw1979 wrote:
             | For me it's phones... specifically dialling a number
             | manually. No matter how carefully I dial, the number on the
             | screen is rarely correct.
        
               | allarm wrote:
               | It seems that I've been stuck in a lucid dream for a
               | couple of decades, no matter how carefully write text on
               | a phone keyboard it never comes out as intended.
        
               | luckman212 wrote:
               | Tank ypu foe wriiting this
        
               | amelius wrote:
               | Whenever I dial a number while in a dream, the person I'm
               | trying to call always turns out to be right next to me.
        
             | biztos wrote:
             | Do they look normal but just not work normally?
             | 
             | Maybe reality is a world of broken clocks, and they only
             | "work" in the simulation.
        
           | ACCount37 wrote:
           | LLMs don't do this because they have "people with dementia
           | draw clocks that way" in their data. They do it because
           | they're similar enough to human minds in function that they
           | often fail in similar ways.
           | 
           | An amusing pattern that dates back to "1kg of steel is
           | heavier of course" in GPT-3.5.
        
             | kaffekaka wrote:
             | How do you _know_ this?
             | 
             | Obviously, humans failing in these ways ARE in the training
             | set. So it should definitely affect LLM output.
        
               | ACCount37 wrote:
               | First: generalization. The failure modes extend to unseen
               | tasks. That specific way to fail at "1kg of steel" sure
               | was in the training data, but novel closed set logic
               | puzzles couldn't have been. They display similar
               | failures. The same "vibe-based reasoning" process of
               | "steel has heavy vibes, feather has light vibes, thus,
               | steel is heavier" produces other similar failures.
               | 
               | Second: the failures go away with capability (raw scale,
               | reasoning training, test-time compute), on seen and
               | unseen tasks both. Which is a strong hint that the model
               | was truly failing, rather than being capable of doing a
               | task but choosing to faithfully imitate a human failure
               | instead.
               | 
               | I don't think the influence of human failures in the
               | training data on the LLMs is nil, but it's not just a
               | surface-level failure repetition behavior.
        
           | BHSPitMonkey wrote:
           | I would think the way humans draw clocks has more in common
           | with image generation models (which probably do a bit better
           | with this task overall) than a language model producing SVG
           | markup, though.
        
         | charliewallace wrote:
         | Very cool! I also love clocks, especially weird ones, and
         | recently put up this 3D Moebius Strip clock, hope you like it:
         | https://www.mobiusclock.com
        
         | AnonHP wrote:
         | Could you please change and adjust the positions of the titles
         | (like GPT 5)? On Firefox Focus on iOS, the spacing is
         | inconsistent (seems like it moves due to the space taken by the
         | clock). After one or two of them, I had to scroll all the way
         | down to the bottom and come back up to understand which title
         | is linked to which clock.
        
         | brianjking wrote:
         | This is an awesome benchmark. Officially one of my favorites
         | now. Thank you for making this.
        
       | ryandrake wrote:
       | I've been struggling all week trying to get Claude Code to write
       | code to produce visual (not the usual, verifiable, text on a
       | terminal) output in the form of a SDL_GPU rendered scene
       | consisting of the usual things like shaders, pipelines, buffers,
       | textures and samplers, vertex and index data and so on, and boy
       | it just doesn't seem to know what it's doing. Despite providing
       | paragraphs-long, detailed prompts. Despite describing each
       | uniform and each matrix that needs to be sent. Despite giving it
       | extremely detailed guidance about what order things need to be
       | done in. It would have been faster for me to just write the code
       | myself.
       | 
       | When it fails a couple of times it will try to put logging in
       | place and then confidently tell me things like "The vertex data
       | has been sent to the renderer, therefore the output is correct!"
       | When I suggest it take a screenshot of the output each time to
       | verify correctness, it does, and then declares victory over an
       | entirely incorrect screenshot. When I suggest it write unit
       | tests, it does so, but the tests are worthless and only tests
       | that the incorrect code it wrote is always incorrect in the same
       | ways.
       | 
       | When it fails even more times, it will get into this what I like
       | to call "intern engineer" mode where it just tries random things
       | that I know are not going to work. And if I let it keep going, it
       | will end up modifying the entire source tree with random "try
       | this" crap. And each iteration, it confidently tells me:
       | "Perfect! I have found the root cause! It is [garbage bullshit].
       | I have corrected it and the code is now completely working!"
       | 
       | These tools are cute, but they really need to go a long way
       | before they are actually useful for anything more than trivial
       | toy projects.
        
         | fancy_pantser wrote:
         | Have you given using MCPs to provide documentation and examples
         | a shot? I always have to bring in docs since I don't work in
         | Python and TS+React (which it seems more capable at) and force
         | it to review those in addition to any specification. e.g.
         | Context7
        
           | ryandrake wrote:
           | Haven't looked into MCPs yet. Thanks for the suggestion!
        
         | rossant wrote:
         | Have you tried OpenAI Codex with GPT5.1? I'm using it for
         | similar GPU rendering stuff and it appears to do an excellent
         | job.
        
         | jamilton wrote:
         | I know this has been said many times before, but I wonder why
         | this is such a common outcome. Maybe from negative outcomes
         | being underrepresented in the training data? Maybe that plus
         | being something slightly niche and complex?
         | 
         | The screenshot method not working is unsurprising to me, VLLMs
         | visual reasoning is very bad with details because they (as far
         | as I understand) do not really have access to those details,
         | just the image embedding and maybe an OCR'd transcript.
        
         | poszlem wrote:
         | I'm not sure if it's just me, but I've also noticed Claude
         | becoming even more lazy. For example, I've asked it several
         | times to fix my tests. It'll fix four or five of them, then
         | start struggling with the next couple, and suddenly declare
         | something like: "All done, fixed 5 out of 10 tests. I can't fix
         | the remaining ones", followed by a long, convoluted explanation
         | about why that's actually a good thing.
        
           | __MatrixMan__ wrote:
           | I don't know if it has gotten worse, but I definitely find
           | Claude is way too eager to celebrate success when it has done
           | nothing.
           | 
           | It's annoying but I prefer it to how Gemini gets depressed if
           | it takes a few tries to make progress. Like, thanks for not
           | gaslighing me, but now I'm feeling sorry for a big pile of
           | numbers, which was not a stated goal in my prompt.
        
       | paxys wrote:
       | Something I'm not able to wrap my head around is that Kimi K2 is
       | the only model that produces a ticking second hand on every
       | attempt while the rest of them are always moving continuously.
       | What fundamental differences in model training or implementation
       | can result in this disparity? Or was this use case programmed in
       | K2 after the fact?
        
       | aavshr wrote:
       | just curious, why not the sonnet models? In my personal
       | experience, Anthropic's Sonnet models are the best when it comes
       | to things like this!
        
       | xyproto wrote:
       | Try adding to the prompt that it has a PhD in Computer Science
       | and have many methods for dealing with complexity.
       | 
       | This gives better results, at least for me.
        
         | bigfishrunning wrote:
         | Why does that give better results? Is this phenomena
         | measurable? How would "you have a phd in computer science"
         | change its ability to interpret prose? Every interaction with
         | an LLM seems like superstition.
        
       | bpt3 wrote:
       | It's wild how much the output varies for the same model for each
       | run.
       | 
       | I'm not sure if this was the intent or not, but it sure
       | highlights how unreliable LLMs are.
        
       | eastbound wrote:
       | Security-wise, this is a website that takes the straight output
       | of AI and serves it for execution on their website.
       | 
       | I know, developers do the same, but at least they check it in Git
       | to notice their mistakes. Here is an opportunity for AI to call a
       | Google Authentication on you, or anything else.
        
       | bongodongobob wrote:
       | Weird. Sonnet 4.5 one shotted it with:
       | 
       | Create an interactive artifact of an analog clock face that keeps
       | time properly.
       | 
       | https://claude.ai/public/artifacts/75daae76-3621-4c47-a684-d...
        
       | amelius wrote:
       | Maybe they can ask Sora to make variations of:
       | 
       | https://slate.com/human-interest/2016/07/martin-baas-giant-r...
        
       | whimsicalism wrote:
       | Kimi K2 is obviously the best, but gpt-5 has the most gorgeous
       | ones when it works
        
       | orly01 wrote:
       | What does it mean that each model is allowed 2000 tokens to
       | generate its clock?
        
       | jcmontx wrote:
       | Grok is impressive, I should give it a shot
        
       | Waterluvian wrote:
       | How do they do time without JavaScript? Is there an API I'm not
       | aware of?
        
         | bloppe wrote:
         | CSS animation. It's not the real time. Just a hypothetical
         | time.
        
           | Waterluvian wrote:
           | I'm imagining some must be using JS because I'm seeing
           | (rarely...) times that are perfectly correct.
        
             | bloppe wrote:
             | Actually you're right. If you view source, you can see
             | `const response = await fetch(`/api/clocks?time=${encodeURI
             | Component(localTime)}`);`. I'm not sure how that API works,
             | but it's definitely reading the current time using JS, then
             | somehow embedding it in the HTML / CSS of each LLM.
        
             | vultour wrote:
             | It's crafted with a prompt that gives the AI the current
             | time, then it simply refreshes every minute so the seconds
             | start at zero correctly.
        
         | bhandziuk wrote:
         | Looks like css keyframes
        
       | ssl-3 wrote:
       | This really needs to be an xscreensaver hack.
        
       | nasir wrote:
       | where's opus/sonnet! very curious on that!
        
       | ticulatedspline wrote:
       | This is cool, interesting to see how consistent some models are
       | (both in success and failure)
       | 
       | I tried gpt-oss-20b (my go-to local) and it looks ok though not
       | very accurate. It decided to omit numbers. It also took 4500
       | tokens while thinking.
       | 
       | I'd be interested in seeing it with some more token leeway as
       | well as comparing two or more similar prompts. like using
       | "current time" instead of "${time}" and being more prescriptive
       | about including numbers
        
       | porphyra wrote:
       | LLMs can't "look" at the rendered HTML output to see if what they
       | generated makes sense or not. But there ought to be a way to do
       | that right? To let the model iterate until what it generates
       | looks right.
       | 
       | Currently, at work, I'm using Cursor for something that has an
       | OpenGL visualization program. It's incredibly frustrating trying
       | to describe bugs to the AI because it is completely blind. Like I
       | just wanna tell it "there's no line connecting these two points
       | but there ought to be one!" or "your polygon is obviously
       | malformed as it is missing a bunch of points and intersects
       | itself" but it's impossible. I end up having to make the AI add
       | debug prints to, say, print out the position of each vertex, in
       | order to convince it that it has a bug. Very high friction and
       | annoying!!!
        
         | TheKidCoder wrote:
         | Kinda - Hand waiving over the question of if an LLM can really
         | "look" but you can connect Cursor to a Puppeteer MCP server
         | which will allow it to iterate with "eyes" by using Puppeteer
         | to screenshot it's own output. Still has issues, but it does
         | solve really silly mistakes often simply by having this MCP
         | available.
        
         | firtoz wrote:
         | Cursor has this with their "browser" function for web dev,
         | quite useful
         | 
         | You can also give it a mcp setup that it can send a screenshot
         | to the conversation, though unsure if anyone made an easy
         | enough "take screenshot of a specific window id" kind of mcp,
         | so may need to be built first
         | 
         | I guess you could also ask it to build that mcp for you...
        
         | fragmede wrote:
         | Claude totally can, same with ChatGPT. Upload a picture to
         | either one of them via the app and tell it there's no line
         | where there should be. There's some plumbing involved to get it
         | to work in Claude code or codex, but yes, computers can "see".
         | If you have lm-server, there's tons of non-text models you can
         | point your code at.
        
         | pil0u wrote:
         | I had some success providing screenshots to Cursor directly. It
         | worked well for web UIs as well as generated graphs in Python.
         | It makes them a bit less blind, though I feel more iterations
         | are required.
        
         | EMM_386 wrote:
         | You can absolutely do this. In fact, with Claude Anthropic
         | encourages you to send it screenshots. It works very well if
         | you aren't expecting pixel-perfection.
         | 
         | YMMV with other models but Sonnet 4.5 is good with things like
         | this - writing the code, "seeing" the output and then iterating
         | on it.
        
       | kwanbix wrote:
       | What a waste of energy.
        
       | mandolingual wrote:
       | Always interesting/uncanny when AI is tested with human cognitive
       | tests https://www.psychdb.com/cognitive-testing/clock-drawing-
       | test.
        
       | hansmayer wrote:
       | Very funny. It seems the Qwen generates the funniest outputs :)
        
         | csours wrote:
         | Oh, Qwen, buddy, you sure are TRYING
        
       | Imanari wrote:
       | Qwens clocks are hilarious
        
       | cornonthecobra wrote:
       | I like Deepseek v3.1's idea of radially-aligning each hour
       | number's y-axis ("1" is rotated 30deg from vertical, "2" at
       | 60deg, etc.). It would be even better if the numbers were rotated
       | anticlockwise.
       | 
       | I'm not sure what Qwen 2.5 is doing, but I've seen similar in
       | contemporary art galleries.
        
       | gloosx wrote:
       | anyone tried opening this from mobile? not a single clock renders
       | correctly, almost looks like a joke on LLMs
        
       | rtcode_io wrote:
       | See https://clock.rt.ht/::code
       | 
       | AI-optimized <analog-clock>!
       | 
       | People expect perfection on first attempt. This took a brief
       | joint session:
       | 
       | HI: define the custom element API design (attribute/property
       | behavior) and the CSS parts
       | 
       | AI: draw the rest of the f... owl
        
         | speedgoose wrote:
         | This is a white page, am I missing something?
        
       | DeathArrow wrote:
       | How can Deepseek and Kimi get it right while Haiku, Gemini and
       | GPT are making a mess?
        
       | 0xCE0 wrote:
       | Seems like Will's clock drawing test in Hannibal :)
        
       | gwbas1c wrote:
       | Reminds me of the Alzheimer's "draw a clock" test.
       | 
       | Makes me think that LLMs are like people with dementia! Perhaps
       | it's the best way to relate to an LLM?
        
       | hollow-moe wrote:
       | obviously they're all broken on firefox, no one uses firefox
       | anyways
        
       | kylecazar wrote:
       | Non-determinism at it's finest. The clock is perfect, the refresh
       | happens, the clock looks like a Dali painting.
        
         | jeremycarter wrote:
         | Last year I wrote a simple system using Semantic Kernel, backed
         | by functions inside Microsoft Orleans, which for the most part
         | was a business logic DSL processor by LLM. Your business logic
         | was just text, and you gave it the operation as text.
         | 
         | Nothing could be relied upon to be deterministic, it was so
         | funny to see it try to do operations.
         | 
         | Recently I re-ran it with newer models and was drastically
         | better, especially with temperature tweaks.
        
       | __fst__ wrote:
       | This is why we need TeraWatt DCs, to generate code for world
       | clocks every minute.
        
       | teaearlgraycold wrote:
       | Qwen 2.5 doing a surprisingly good job (as of right now).
        
       | maxdo wrote:
       | Selection of western models is weird no gpt-5.1 , opus 4.1 (
       | nailed it perfectly ) Something I quickly tested
        
       | Bengalilol wrote:
       | Qwen doesn't care about clocks, it goes the Dali way, without
       | melting.
       | 
       | It even made a Nietzsche clock (I saw one <body> </body> which
       | was surprisingly empty).
       | 
       | It definitely wins the creative award.
        
       | HarHarVeryFunny wrote:
       | Looks like we've got a new Turing test here: "draw me a clock"
        
       | bitwize wrote:
       | I'm reminded of the "draw a clock" test neurologists use to
       | screen for dementia and brain damage.
        
       | accrual wrote:
       | I love that GPT-5 is putting the clock hands way outside the
       | frame and just generally is a mess. Maybe we'll look back on
       | these mistakes just like watching kids grow up and fumble basic
       | tasks. Humorous in its own unique way.
        
         | palmotea wrote:
         | > Maybe we'll look back on these hilarious mistakes just like
         | watching kids grow up and fumble basic tasks.
         | 
         | Or regret: "why didn't we stop it when we could?"
        
       | anon_cow1111 wrote:
       | I'm having a hard time believing this site is honest, especially
       | with how ridiculous the scaling and rotation of numbers is for
       | most of them. I dumped his prompt into chatgpt to try it myself
       | and it did create a very neat clock face with the numbers at the
       | correct position+animated second hand, it just got the exact time
       | wrong, being a few hours off.
       | 
       | Edit: the time may actually have been perfect now that I account
       | for my isp's geo-located time zone
        
         | perfmode wrote:
         | i read that the OP limited the output to 2000 tokens.
        
           | lanewinfield wrote:
           | ^ this! there's a lot of clocks to generate so I've
           | challenged it to stick to a small(er) amount of code
        
           | anon_cow1111 wrote:
           | I got a ~1600 character reply from gpt, including spaces and
           | it worked first shot dumping into an html doc. I think that
           | probably fits ok in the limit? (If I missed something obvious
           | feel free to tell me I'm an idiot)
        
             | Springtime wrote:
             | On the second minute I had the AI World Clocks site open
             | the GPT-5 generated version displayed a perfect clock. Its
             | clock before and every clock from it since has had very
             | apparent issues though.
             | 
             | If you could get a perfect clock several times for the
             | identical prompt in fresh contexts with the same model then
             | it'd be a better comparison. Potentially the ChatGPT site
             | you're using though is doing some adjustments that the API
             | fed version isn't.
        
         | Zopieux wrote:
         | On the contrary, in my experience this is very typical of the
         | average failure mode / output of early 2025 LLMs for HTML of
         | SVG.
        
       | ada1981 wrote:
       | Sonnet 4.5 did this easily
       | https://claude.ai/public/artifacts/c1bb5d57-573b-49e0-9539-7...
        
       | edfletcher_t137 wrote:
       | Lack of Claude is a glaring oversight given how popular it is as
       | an agentic coding model...
        
       | chaosprint wrote:
       | This is such a great idea! Surprisingly, the Kimi K2 is the only
       | one without any obvious problems. And it is even not the complete
       | K2 thinking version? This made me reread this article from a few
       | days ago:
       | 
       | https://entropytown.com/articles/2025-11-07-kimi-k2-thinking...
        
       | esotericwarfare wrote:
       | This is an AD for Kimi K2
        
       | miohtama wrote:
       | The new Turing time test
        
       | bigbluedots wrote:
       | I just realized I'm running late, it's almost -2!
       | 
       | More seriously, I'd love to see how the models perform the same
       | task with a larger token allowance.
        
       | bigbluedots wrote:
       | Is there a "draw a pelican riding a bicycle" version?
        
         | padolsey wrote:
         | We've done this!
         | https://weval.org/analysis/visual__pelican/f141a8500de7f37f/...
        
       | anonzzzies wrote:
       | Sonnet 4.5 does it flawless. Tried 8 times.
        
       | fnord77 wrote:
       | whatever model Cursor uses was telling me the date was March 12,
       | 2023
        
       | imchillyb wrote:
       | I love qwen, it tries so hard with its little paddle and never
       | gets anywhere.
        
       | cyberjill wrote:
       | 666
        
       | wanderingmind wrote:
       | The more I look at it, the more I realise the reason for
       | cognitive overload I feel when using LLMs for coding. Same prompt
       | to same model for a pretty straight forward task produces such
       | wildly different outputs. Now, imagine how wildly different the
       | code outputs when trying to generate two different logical
       | functions. The casings are different, commenting is different, no
       | semantic continuity. Now maybe if I give detailed prompts and ask
       | it to follow, it might follow, but from my experience prompt
       | adherence is not so great as well. I am at the stage where I just
       | use LLMs as auto correct, rather than using it for any
       | generation.
        
       | bwhiting2356 wrote:
       | You should render it, show an image to the model and allow it to
       | iterate. No person has to one-shot code without seeing what it
       | looks like.
        
       | wewtyflakes wrote:
       | It is funny to see the performance improve across many of the
       | models, somewhat miraculously, throughout the day today.
        
       | stym06 wrote:
       | If a human had done this, these would be at a museum
        
       | woopwoop wrote:
       | The qwen clocks are art.
        
       | josfredo wrote:
       | Watching these gives me a strong feeling of unease. Art-wise, it
       | is a very beautiful project.
        
       | 3oil3 wrote:
       | I wonder which model will silently be updated and suddenly start
       | drawing clocks with Audemars-Piguet-level kind of complications.
        
       | jsmo wrote:
       | lol
        
       | shahzaibmushtaq wrote:
       | Interesting idea!
       | 
       | Why is a new clock being rendered every minute? Or AI models are
       | evolving and improving every minute.
        
       | Vera_Wilde wrote:
       | It's really beautiful! Super clean UI.
       | 
       | The thing I always want from timezone tools is: "Let me simulate
       | a date after one side has shifted but the other hasn't."
       | 
       | Humans do badly with DST offset transitions; computers do great
       | with them.
        
       | JamesAdir wrote:
       | I believe that in a day or two, the companies will address this
       | and it would be solved by them for that use case
        
       | surfingdino wrote:
       | What a wonderfully visual example of the crap LLMs turn
       | everything into. I am eagerly awaiting the collapse of the LLM
       | bubble. JetBrains added this crap to their otherwise fine series
       | of IDEs and now I have to keep removing randomly inserted import
       | statements and keep fixing hallucinated names of functions
       | suggested instead of the names of functions that I have already
       | defined in the same file. Lack of determinism where we expect it
       | (most of the things we do, tbh) is creating more problems than it
       | is solving.
        
       | anotheryou wrote:
       | Claude Sonnet 4.5 with a little thinking:
       | https://imgur.com/a/zcJOnKy
       | 
       | no thinking: better clock but not current time (the prompt is
       | confusing here though): https://imgur.com/a/kRK3Q18
        
         | themgt wrote:
         | Just saw Gemini 2.5 with a little thinking:
         | https://imgur.com/a/nypRD7x
        
       | arendtio wrote:
       | Pretty cool already!
       | 
       | I use 'Sonnet 4.5 thinking' and 'Composer 1' (Cursor) the most,
       | so it would be interesting to see how such SOTA models perform in
       | this task.
        
       | boxedemp wrote:
       | That's super neat. I'll keep checking back to this site as new
       | models are released. It's an interesting benchmark.
        
       | baidoct wrote:
       | GPT-5 looks broken
        
       | Zeraous wrote:
       | How Kimi is better than other BILLION$ companys is really fun
        
       | warpspin wrote:
       | Lol. This is supposed to replace me at my job already?
       | 
       | Great experiment!
        
       | adriatp wrote:
       | deepseek representing
        
       | RugnirViking wrote:
       | whats going on with kimi k2 and being reasonable/so unique in so
       | many of these benchmarks ive seen recently? I will have to try it
       | out further for stuff. is it any good at programming?
        
         | Bolwin wrote:
         | Yes, it trades blows with glm for the best open source model
        
       | adi_kurian wrote:
       | Think this is just prompt eng tbh. One shot Haiku 3.5
       | (https://claude.ai/share/66c17968-485e-4d15-974b-4f6958e1e2fd)
       | decent looking too.
       | 
       | Got it to work on gpt 3.5T w modified prompt (albeit not as good
       | - https://pastebin.com/gjEVSEcJ)
       | 
       | `single html file, working analog clock showing current time,
       | numbers positioned (aligned) correctly via trig calc (dynamic),
       | all three hands, second hand ticks, 400px, clean AF aesthetic
       | R/Greenberg Associates circa 2017. empathy, hci, define > design
       | > implement.`
        
       ___________________________________________________________________
       (page generated 2025-11-15 23:01 UTC)