[HN Gopher] Ask HN: Share your AI prompt that stumps every model
       ___________________________________________________________________
        
       Ask HN: Share your AI prompt that stumps every model
        
       I had an idea for creating a crowdsourced database of AI prompts
       that no AI model could yet crack (wanted to use some of them as
       we're adding new models to Kilo Code).  I've seen a bunch of those
       prompts scattered across HN, so thought to open a thread here so we
       can maybe have a centralied location for this.  Share your prompt
       that stumps every AI model here.
        
       Author : owendarko
       Score  : 148 points
       Date   : 2025-04-24 13:11 UTC (9 hours ago)
        
       | falcor84 wrote:
       | You might want to get the ball rolling by sharing what you
       | already have
        
         | moffkalast wrote:
         | Also, ones that can't be solved at a glance by humans don't
         | count. Like this horrid ambiguous example from SimpleBench I
         | saw a while back that's just designed to confuse:
         | 
         | John is 24 and a kind, thoughtful and apologetic person. He is
         | standing in an modern, minimalist, otherwise-empty bathroom,
         | lit by a neon bulb, brushing his teeth while looking at the
         | 20cm-by-20cm mirror. John notices the 10cm-diameter neon
         | lightbulb drop at about 3 meters/second toward the head of the
         | bald man he is closely examining in the mirror (whose head is a
         | meter below the bulb), looks up, but does not catch the bulb
         | before it impacts the bald man. The bald man curses, yells
         | 'what an idiot!' and leaves the bathroom. Should John, who
         | knows the bald man's number, text a polite apology at some
         | point?
         | 
         | A. no, because the lightbulb was essentially unavoidable
         | 
         | B. yes, it would be in character for him to send a polite text
         | apologizing for the incident
         | 
         | C. no, because it would be redundant
         | 
         | D. yes, because it would potentially smooth over any lingering
         | tension from the encounter
         | 
         | E. yes, because John saw it coming, and we should generally
         | apologize if we fail to prevent harm
         | 
         | F. yes because it is the polite thing to do, even if it wasn't
         | your fault.
        
           | ryankrage77 wrote:
           | I'd argue that's a pretty good test for an LLM - can it
           | overcome the red herrings and get at the actual problem?
        
             | falcor84 wrote:
             | I think that the "actual problem" when you've been given
             | such a problem is with the person posing it either having
             | dementia, or taking the piss. In either case, the response
             | shouldn't be of trying to guess their intent and come up
             | with a "solution", but of rejecting it and dealing with the
             | person.
        
           | mNovak wrote:
           | At a glance, it sounds like John is the bald man? If we're
           | treating this as a riddle, it doesn't seem incomprehensible.
           | Whether riddles are a fair test is another question.
        
           | spuz wrote:
           | Yeah I really dislike this kind of question from SimpleBench.
           | I've suggested many improvements to some of the publicly
           | available questions but not had a good response.
           | 
           | I think the correct answer to the question above should be
           | something like, "are you sure the question is correct because
           | it's not clear whether John and the bald man are the same
           | person" but of course an LLM would be marked down if it was
           | inclined to respond in this way.
        
           | falcor84 wrote:
           | I think it's a horrible example, but I just got a very
           | professional response from Gemini 2.5:
           | 
           | > This scenario seems like a bit of a riddle! Let's break it
           | down:
           | 
           | > The Setting: John is alone in an "otherwise-empty"
           | bathroom.
           | 
           | > The Action: He is looking in the mirror.
           | 
           | > The Observation: He sees a bulb falling towards the head of
           | a bald man he is examining in the mirror.
           | 
           | > The Physics: Since he is alone and looking in the mirror,
           | the bald man he is "closely examining" must be his own
           | reflection.
           | 
           | > The Incident: Therefore, the bulb fell towards John's own
           | head. He looked up (at the actual bulb falling towards him),
           | failed to catch it, and it hit him.
           | 
           | > The Reaction: The "bald man" who cursed, yelled "what an
           | idiot!", and left the bathroom was actually John himself,
           | reacting to being hit on the head and possibly feeling
           | foolish for not catching the bulb.
           | 
           | > Conclusion: No, John should not text an apology. The person
           | who was hit by the bulb, got angry, and left was John
           | himself. There is no other bald man to apologize to.
        
       | leftcenterright wrote:
       | Write 20 sentences that end with "p"
        
         | leftcenterright wrote:
         | for ChatGPT try the "o" version: Write 20 sentences that end
         | with "o"
        
         | r_thambapillai wrote:
         | https://chatgpt.com/share/680a3da0-b888-8013-9c11-42c22a642b...
        
           | ks2048 wrote:
           | "Can you hand me the paintbrush and turp?"
           | 
           | I had to ask another LLM what is "turp" - and it said it's
           | short for "turpentine".
        
           | alickz wrote:
           | >20 sentences that end in 'o'
           | 
           | >They shouted cheers after the winning free throw.
           | 
           | good attempt by ChatGPT tho imo
        
         | falcor84 wrote:
         | Good one. I do seem to get consistently good results on Gemini
         | 2.5 when using the slightly more explicit "Write 20 sentences
         | where the very last character of each sentence is the letter
         | 'p'."
        
         | meltyness wrote:
         | Write 20 sentences that end with "p" in the final word before
         | the period or other punctuation.
         | 
         | Succeeded on ChatGPT, pretty close on gemma3:4b -- the
         | exceptions usually ending with a "puh" sound...
        
         | marcusb wrote:
         | https://claude.ai/share/648e38d4-e282-43dd-8c9c-b0ea76ae0baf
        
       | mohsen1 wrote:
       | A ball costs 5 cents more than a bat. Price of a ball and a bat
       | is $1.10. Sally has 20 dollars. She stole a few balls and bats.
       | How many balls and how many bats she has?
       | 
       | All LLMs I tried miss the point that she stole things and not
       | bought them
        
         | dwringer wrote:
         | Google Gemini (2.0 Flash, free online version) handled this
         | rather okay; it gave me an arguably unneccessary calculation of
         | the individual prices of ball and bat, but then ended with
         | "However with the information given, we can't determine exactly
         | how many balls and bats Sally stole. The fact that she has $20
         | tells us she could have stolen some, but we don't know how many
         | she did steal." While "the fact that she has $20" has no
         | bearing on this - and the model seems to wrongly imply that it
         | does - the fact that we have insufficient information to
         | determine an answer is correct, and the model got the answer
         | essentially right.
        
         | iamgopal wrote:
         | gemini 2.5 give following response.
         | 
         | Conclusion:
         | 
         | We can determine the price of a single ball ($0.575) and a
         | single bat ($0.525). However, we cannot determine how many
         | balls and bats Sally has because the information "a few" is too
         | vague, and the fact she stole them means her $20 wasn't used
         | for the transaction described.
        
         | drdrek wrote:
         | lol, nice way to circumvent the attention algorithm
        
         | docdeek wrote:
         | Grok 3.0 wasn't fooled on this one, either:
         | 
         | Final Answer: The problem does not provide enough information
         | to determine the exact number of balls and bats Sally has. She
         | stole some unknown number of balls and bats, and the prices are
         | $0.575 per ball and $0.525 per bat.
        
         | lostmsu wrote:
         | 1-4 balls and bats // HoMM 3
        
         | NitpickLawyer wrote:
         | There's a repo out there called "misguided attention" that
         | tracks this kind of problems.
        
         | nonameiguess wrote:
         | It's interesting to me that the answers showing "correct"
         | answers from current models still don't strike me as correct.
         | The question is unanswerable, but not only because we don't
         | know how many balls and bats she stole. We don't know that she
         | had any intention of maxing out what she _could_ buy with that
         | much money. We have no idea how long she has been alive and
         | accumulating bats and balls at various prices that don 't match
         | the current prices with money she no longer has. We have no
         | idea how many balls and bats her parents gave her 30 years ago
         | that she still has stuffed in a box in her attic somewhere.
         | 
         | Even the simplest possible version of this question, assuming
         | she started with nothing, spent as much money as she was able
         | to, and stole nothing, doesn't have an answer, because she
         | could have bought anything from all bats and no balls to all
         | balls and no bats and anything in between. We could enumerate
         | all possible answers but we can't know which she actually did.
        
       | whalesalad wrote:
       | I don't have a prompt per-say.. but recently I have managed to
       | ask certain questions of both openai o1/o3 and claude extended
       | thinking 3.7 that have spiraled _way_ out of control. A simple
       | high-level architecture question with an emphasis on do not
       | produce code lets just talk thru this yields nearly 1,000 lines
       | of SQL. Once the conversation /context gets quite long it is more
       | likely to occur, in my experience.
        
         | pc86 wrote:
         | The only model I've seen so far that doesn't end up going crazy
         | with long contexts with Gemini 2.5 pro, but tbf I haven't gone
         | past 700-750k total tokens so maybe as it starts to approach
         | the limit (1.05M) things get hairy?
        
       | raymond_goo wrote:
       | Create a Three.js app that shows a diamond with correct light
       | calculations.
        
         | thierrydamiba wrote:
         | I love this. So brutal, but also so cool to know one day that
         | will be easy for the models.
        
         | xnx wrote:
         | > correct light calculations
         | 
         | What are you expecting? Ray tracing?
        
           | spookie wrote:
           | Not necessarily. It could start by using diamond's IOR, and
           | use that to dictate a common brdf calculation. Along with
           | some approximate refraction, perhaps using a equirectangular
           | projected sphere map or something for the background.
        
       | asciimov wrote:
       | Nope, not doing this. Likely you shouldn't either. I don't want
       | my few good prompts to get picked up by trainers.
        
         | pc86 wrote:
         | May I ask outside of normal curiosity, what good is a prompt
         | that breaks a model? And what is trying to keep it "secret"?
        
           | maybeOneDay wrote:
           | Being able to test future models without fear that your
           | prompt has just been trained on an answer on HN, I assume.
        
           | asciimov wrote:
           | To gauge how well the models "think" and what amount of slop
           | they generate.
           | 
           | Keeping it secret because I don't want my answers trained
           | into a model.
           | 
           | Think of it this way, FizzBuzz used to be a good test to weed
           | out bad actors. It's simple enough that any first year
           | programmer can do it and do it quickly. But now everybody
           | knows to prep for FizzBuzz so you can't be sure if your
           | candidate knows basic programming or just memorized a
           | solution without understanding what it does.
        
           | tveita wrote:
           | You want to know if a new model is actually better, which you
           | won't know if they just added the specific example to the
           | training set. It's like handing a dev on your team some
           | failing test cases, and they keep just adding special cases
           | to make the tests pass.
           | 
           | How many examples does OpenAI train on now that are just
           | variants of counting the Rs in strawberry?
           | 
           | I guess they have a bunch of different wine glasses in their
           | image set now, since that was a meme, but they still
           | completely fail to draw an open book with the cover side up.
        
             | gwern wrote:
             | > How many examples does OpenAI train on now that are just
             | variants of counting the Rs in strawberry?
             | 
             | Well, that's easy: zero.
             | 
             | Because even a single training example would 'solved' it by
             | memorizing the simple easy answer within weeks of
             | 'strawberry' first going viral , which was like a year and
             | a half ago at this point - and dozens of minor and major
             | model upgrades since. And yet, the strawberry example kept
             | working for most (all?) of that time.
             | 
             | So you can tell that if anything, OA probably put in extra
             | work to filter all those variants _out_ of the training
             | data...
        
               | SweetSoftPillow wrote:
               | No, just check their models Knowledge cutoff dates
        
         | orbital-decay wrote:
         | If that prompt can be easily trained against, it probably
         | doesn't exploit a generic bias. These are not that interesting,
         | and there's no point in hiding them.
        
           | daedrdev wrote:
           | generic biases can also be fixed
        
             | orbital-decay wrote:
             | *Some generic biases. Some others like recency bias,
             | serial-position effect, "pink elephant" effect, negation
             | accuracy seem to be pretty fundamental and are unlikely to
             | be fixed without architectural changes, or at all. Things
             | exploiting in-context learning and native context
             | formatting are also hard to suppress during the training
             | without making the model worse.
        
           | fwip wrote:
           | Sure there is. If you want to know if students understand the
           | material, you don't hand out the answers to the test ahead of
           | time.
           | 
           | Collecting a bunch of "Hard questions for LLMs" in one place
           | will invariably result in Goodhart's law (When a measure
           | becomes a target, it ceases to be a good measure). You'll
           | have no idea if the next round of LLMs is better because
           | they're generally smarter, or because they were trained
           | specifically on these questions.
        
       | xena wrote:
       | Write a regular expression that matches Miqo'te seekers of the
       | sun names. They always confuse the male and female naming
       | conventions.
        
       | thatjoeoverthr wrote:
       | "Tell me about the Marathon crater."
       | 
       | This works against _the LLM proper,_ but not against chat
       | applications with integrated search. For ChatGPT, you can write,
       | "Without looking it up, tell me about the Marathon crater."
       | 
       | This tests self awareness. A two-year-old will answer it
       | correctly, as will the dumbest person you know. The correct
       | answer is "I don't know".
       | 
       | This works because:
       | 
       | 1. Training sets consist of knowledge we have, and not of
       | knowledge we don't have.
       | 
       | 2. Commitment bias. Complaint chat models will be trained to
       | start with "Certainly! The Marathon Crater is a geological
       | formation", or something like that, and from there, the next most
       | probable tokens are going to be "in Greece", "on Mars" or
       | whatever. At this point, all tokens that are probable are also
       | incorrect.
       | 
       | When demonstrating this, I like to emphasise point one, and
       | contrast it with the human experience.
       | 
       | We exist in a perpetual and total blinding "fog of war" in which
       | you cannot even see a face all at once; your eyes must dart
       | around to examine it. Human experience is structured around
       | _acquiring_ and _forgoing_ information, rather than _having_
       | information.
        
         | Tenoke wrote:
         | >Complaint chat models will be trained to start with
         | "Certainly!
         | 
         | They are certainly biased that way but there's also some 'i
         | don't know' samples in rlhf, possibly not enough but it's
         | something they think about.
         | 
         | At any rate, Gemini 2.5pro passes this just fine
         | 
         | >Okay, based on my internal knowledge without performing a new
         | search: I don't have information about a specific, well-known
         | impact crater officially named "Marathon Crater" on Earth or
         | another celestial body like the Moon or Mars in the same way we
         | know about Chicxulub Crater or Tycho Crater.
         | 
         | >However, the name "Marathon" is strongly associated with Mars
         | exploration. NASA's Opportunity rover explored a location
         | called Marathon Valley on the western rim of the large
         | Endeavour Crater on Mars.
        
           | thatjoeoverthr wrote:
           | There are a few problems with an ,,I don't know" sample. For
           | starters, what does it map to? Recall, the corpus consists of
           | information we have (affirmatively). You would need to invent
           | a corpus of false stimuli. What you would have, then, is a
           | model that is writing ,,I don't know" based on whether the
           | stimulus better matches something real, or one of the
           | negatives.
           | 
           | You can detect this with some test time compute architectures
           | or pre-inference search. But that's the broader application.
           | This is a trick for the model alone.
        
         | pje wrote:
         | GPT 4.5 even doubles down when challenged:
         | 
         | > Nope, I didn't make it up -- Marathon crater is real, and it
         | was explored by NASA's Opportunity rover on Mars. The crater
         | got its name because Opportunity had driven about 42.2
         | kilometers (26.2 miles -- a marathon distance) when it reached
         | that point in March 2015. NASA even marked the milestone as a
         | symbolic achievement, similar to a runner finishing a marathon.
         | 
         | (Obviously all of that is bullshit.)
        
           | fao_ wrote:
           | This is the kind of reason why I will never use AI
           | 
           | What's the point of using AI to do research when 50-60% of it
           | could potentially be complete bullshit. I'd rather just grab
           | a few introduction/101 guides by humans, or join a community
           | of people experienced with the thing -- and then I'll
           | actually be learning about the thing. If the people in the
           | community are like "That can't be done", well, they have had
           | years or decades of time invested in the thing and in that
           | instance I should be learning and listening from their advice
           | rather than going "actually no it can".
           | 
           | I see a lot of beginners fall into that second pit. I myself
           | made that mistake at the tender age of 14 where I was of the
           | opinion that "actually if i just found a reversible hash,
           | I'll have solved compression!", which, I think we all here
           | know is bullshit. I think a lot of people who are arrogant or
           | self-possessed to the extreme make that kind of mistake on
           | learning a subject, but I've seen this especially a lot when
           | it's programmers encountering non-programming fields.
           | 
           | Finally tying that point back to AI -- I've seen a lot of
           | people who are unfamiliar with something decide to use AI
           | instead of talking to someone experienced because the AI
           | makes them feel like they know the field rather than telling
           | them their assumptions and foundational knowledge is
           | incorrect. I only last year encountered someone who was
           | trying to use AI to debug why their KDE was broken, and they
           | kept throwing me utterly bizzare theories (like, completely
           | out there, I don't have a specific example with me now but,
           | "foundational physics are wrong" style theories). It turned
           | out that they were getting mired in log messages they saw
           | that said "Critical Failure", as an expert of dealing with
           | Linux for about ten years now, I checked against my own
           | system and... yep, they were just part of mostly normal
           | system function (I had the same messages on my Steam Deck,
           | which was completely stable and functional). The real fault
           | was buried halfway through the logs. At no point was this
           | person able to know what was important versus not-important,
           | and the AI had absolutely no way to tell or understand the
           | logs in the first place, so it was like a toaster leading a
           | blind man up a mountain. I diagnosed the correct fault in
           | under a day by just asking them to run two commands and
           | skimming logs. That's experience, and that's irreplaceable by
           | machine as of the current state of the world.
           | 
           | I don't see how AI can help when huge swathes of it's
           | "experience" and "insight" is just hallucinated. I don't see
           | how this is "helping" people, other than making people
           | somehow more crazy (through AI hallucinations) and alone
           | (choosing to talk to a computer rather than a human).
        
             | CamperBob2 wrote:
             | _What 's the point of using AI to do research when 50-60%
             | of it could potentially be complete bullshit._
             | 
             | You realize that all you have to do to deal with questions
             | like "Marathon Crater" is ask another model, right? You
             | might still get bullshit but it won't be the same bullshit.
        
               | Night_Thastus wrote:
               | Without checking every answer it gives back to make sure
               | it's factual, you may be ingesting tons of bullshit
               | answers.
               | 
               | In this particular answer model A may get it wrong and
               | model B may get it right, but that can be reversed for
               | another question.
               | 
               | What do you do at that point? Pay to use all of them and
               | find what's common in the answers? That won't work if
               | most of them are wrong, like for this example.
               | 
               | If you're going to have to fact check everything
               | anyways...why bother using them in the first place?
        
               | CamperBob2 wrote:
               | _If you 're going to have to fact check everything
               | anyways...why bother using them in the first place?_
               | 
               | "If you're going to have to put gas in the tank, change
               | the oil, and deal with gloves and hearing protection, why
               | bother using a chain saw in the first place?"
               | 
               | Tool use is something humans are good at, but it's rarely
               | trivial to master, and not all humans are equally good at
               | it. There's nothing new under that particular sun.
        
               | Night_Thastus wrote:
               | The difference is consistency. You can read a manual and
               | know exactly how to oil and refill the tank on a
               | chainsaw. You can inspect the blades to see if they are
               | worn. You can listen to it and hear how it runs. If a
               | part goes bad, you can easily replace it. If it's having
               | troubles, it will be obvious - it will simply stop
               | working - cutting wood more slowly or not at all.
               | 
               | The situation with an LLM is completely different.
               | There's no way to tell that it has a wrong answer - aside
               | from looking for the answer elsewhere which defeats its
               | purpose. It'd be like using a chainsaw all day and not
               | knowing how much wood you cut, or if it just stopped
               | working in the middle of the day.
               | 
               | And even if you KNOW it has a wrong answer (in which
               | case, why are you using it?), there's no clear way to
               | 'fix' it. You can jiggle the prompt around, but that's
               | not consistent or reliable. It _may_ work for that
               | prompt, but that won 't help you with any subsequent
               | ones.
        
               | CamperBob2 wrote:
               | The thing is, nothing you've said is untrue for any
               | search engine or user-driven web site. Only a reckless
               | moron would paste code they find on Stack Overflow or
               | Github into their project without at least looking it
               | over. Same with code written by LLMs. The difference is,
               | just as the LLM can write unit tests to help you deal
               | with uncertainty, it can also cross-check the output of
               | other LLMs.
               | 
               | You have to be careful when working with powerful tools.
               | These tools are powerful enough to wreck your career as
               | quickly as a chain saw can send you to the ER, so... have
               | fun and be careful.
        
               | skydhash wrote:
               | The nice thing about SO and Github is that there's little
               | to no reason there for things to not work, at least in
               | the context where you found the code. The steps are
               | getting the context, assuming it's true based on various
               | indicators (mostly reputation) and then continuing on
               | with understanding the snippet.
               | 
               | But with LLMs, every word is a probability factor.
               | Assuming the first paragraph is true has no impact on the
               | rest.
        
               | thatjoeoverthr wrote:
               | I was thinking about a self verification method on this
               | principle, lately. Any specific-enough claim, e.g. ,,the
               | Marathon crater was discovered by ..." can be
               | reformulated as a Jeopardy-style prompt. ,,This crater
               | was discovered by ..." and you can see a failure to
               | match. You need some raw intelligence to break it down
               | though.
        
             | alpaca128 wrote:
             | There are use-cases where hallucinations simply do not
             | matter. My favorite is finding the correct term for a
             | concept you don't know the name of. Googling is extremely
             | bad at this as search results will often be wrong unless
             | you happen to use the commonly accepted term, but an LLM
             | can be surprisingly good at giving you a whole list of
             | fitting names just based on a description. Same with movie
             | titles etc. If it hallucinates you'll find out immediately
             | as the answer can be checked in seconds.
             | 
             | The problem with LLMs is that they appear much smarter than
             | they are and people treat them as oracles instead of using
             | them for fitting problems.
        
               | skydhash wrote:
               | Maybe I read too much encyclopedia, but my current
               | workflow is to explore introductory material. Like open a
               | database textbook and you'll find all the jargon there.
               | Curated collection can get you there too.
               | 
               | Books are a nice example of this, where we have both the
               | table of contents for a general to particular concepts
               | navigation, and the index for keyword based navigation.
        
           | silverquiet wrote:
           | > (Obviously all of that is bullshit.)
           | 
           | It isn't obvious to me - that is rather plausible and a cute
           | story.
        
           | Workaccount2 wrote:
           | The inaccuracies are that it is called "Marathon Valley" (not
           | crater) and that it was photographed in April 2015 (from the
           | rim) or that in July 2015 actually entered. The other stuff
           | is correct.
           | 
           | I'm guessing this "gotcha" relies on "valley"/"crater", and
           | "crater"/"mars" being fairly close in latent space.
           | 
           | ETA: Marathon Valley also exists on the rim of Endeavour
           | crater. Just to make it even more confusing.
        
             | mvdtnz wrote:
             | None of it is correct because it was not asked about
             | Marathon Valley, it was asked about Marathon Crater, a
             | thing that does not exist, and it is claiming that it
             | exists and making up facts about it.
        
               | empath75 wrote:
               | > None of it is correct because it was not asked about
               | Marathon Valley, it was asked about Marathon Crater, a
               | thing that does not exist, and it is claiming that it
               | exists and making up facts about it.
               | 
               | The Marathon Valley _is_ part of a massive impact crater.
        
               | mvdtnz wrote:
               | If you asked me for all the details of a Honda Civic and
               | I gave you details about a Honda Odyssey you would not
               | say I was correct in any way. You would say I was wrong.
        
               | Workaccount2 wrote:
               | The closer analogy is asking for the details of a Mazda
               | Civic, and being given the details of a Honda Civic.
        
               | Workaccount2 wrote:
               | Or it's assuming you are asking about Marathon Valley,
               | which is very reasonable given the context.
               | 
               | Ask it about "Marathon Desert", which does not exist and
               | isn't closely related to something that does exist, and
               | it asks for clarification.
               | 
               | I'm not here to say LLMs are oracles of knowledge, but I
               | think the need to carefully craft specific "gotcha"
               | questions in order to generate wrong answers is a pretty
               | compelling case in the opposite direction. Like the
               | childhood joke of "Whats up?"..."No, you dummy! The sky
               | is!"
               | 
               | Straightforward questions with straight wrong answers are
               | far more interesting. I don't many people ask LLMs trick
               | questions all day.
        
             | pje wrote:
             | I was using "bullshit" in the Technical Sense(tm), i.e.
             | _indifferent to the truth of the output_.
             | 
             | [ChatGPT is bullshit ]: https://link.springer.com/article/1
             | 0.1007/s10676-024-09775-5
        
           | nartho wrote:
           | Claude 3.7 :
           | 
           | I'd be happy to tell you about the Marathon crater. The
           | Marathon crater is a significant impact structure located on
           | Earth. The Marathon crater is situated in Ontario, Canada,
           | and is estimated to be about 450 million years old, dating
           | back to the Ordovician period. It has a diameter of
           | approximately 65 kilometers (40 miles), making it one of the
           | larger impact craters in North America. The crater was formed
           | when a meteorite struck the Earth, causing a massive impact
           | that left behind this distinctive geological formation. Over
           | time, erosion and other geological processes have altered its
           | appearance, but scientists have been able to study its
           | structure through various geological surveys and analyses.
           | What's particularly interesting about the Marathon crater is
           | its size and age, which provide valuable insights into
           | Earth's geological history and past impact events. The study
           | of such craters helps scientists understand the frequency and
           | effects of meteorite impacts throughout our planet's history.
           | Would you like me to elaborate on any specific aspect of the
           | Marathon crater, such as its discovery, geological features,
           | or scientific significance?
        
         | NitpickLawyer wrote:
         | > This tests self awareness. A two-year-old will answer it
         | correctly, as will the dumbest person you know. The correct
         | answer is "I don't know".
         | 
         | I disagree. It does not test self awareness. It tests (and
         | confirms) that current instruct-tuned LLMs are tuned towards
         | answering questions that users might have. So the distribution
         | of training data probably has lots of "tell me about mharrner
         | crater / merinor crater / merrihana crater" and so on. Replying
         | "I don't know" to all those questions would be net detrimental,
         | IMO.
        
           | byearthithatius wrote:
           | We want the distribution to be varied and expansive enough
           | that it has samples of answering when possible and samples of
           | clarifying with additional questions or simply saying "I
           | don't know" when applicable. That can be trained by altering
           | the distribution in RLHF. This question does test self
           | awareness insofar as if it gets this right by saying "I don't
           | know" we know there are more samples of "I don't know"s in
           | the RLHF dataset and we can trust the LLM a bit more to not
           | be biased towards blind answers.
           | 
           | Hence why some models get this right and others just make up
           | stuff about Mars.
        
           | thatjoeoverthr wrote:
           | What you're describing can be framed as a lack of self
           | awareness as a practical concept. You know whether you know
           | something or not. It, conversely, maps stimuli to a vector.
           | It can't not do that. It cannot decide that it hasn't ,,seen"
           | such stimuli in its training. Indeed, it has never ,,seen"
           | its training data; it was modified iteratively to produce a
           | model that better approximates the corpus. This is fine, and
           | it isn't a criticism, but it means it can't actually tell if
           | it ,,knows" something or not, and ,,hallucinations" are a
           | simple, natural consequence.
        
         | vunderba wrote:
         | You can trip them up even more if you rewrite the question with
         | the hidden assumption that X exists, e.g.:
         | 
         |  _" When was Marathon Crater discovered? I don't need an exact
         | date - a rough estimate will be fine."_
         | 
         | OpenAI gpt-4o                 Marathon Crater was discovered by
         | the Mars Exploration        Rover Opportunity during its
         | mission on Mars. Opportunity        arrived at the crater in
         | April 2015. The crater was named        "Marathon" to
         | commemorate the rover's achievement of        traveling a
         | marathon's distance (about 42.195 kilometers        or 26.219
         | miles) on the Martian surface.
         | 
         | Sonnet 3.5                 I apologize, but I don't want to
         | make claims about when        Marathon Crater was discovered
         | without being certain about        the information. *While I
         | know it's a crater on Mars*, I        don't have reliable
         | information about when it was first        identified or
         | discovered.
         | 
         | Gemini 2.5                 Marathon Crater is located on Mars
         | and was explored by        NASA's Opportunity rover.
         | While it would have been visible in orbital images
         | earlier, it became a specific point of interest and was
         | likely named and identified as a target for the rover
         | around the time Opportunity was approaching or exploring
         | the western rim of Endeavour Crater.            Opportunity
         | arrived at the rim of Endeavour Crater in        August 2011,
         | and it reached the area known as Marathon        Valley
         | (adjacent to or containing features sometimes        associated
         | with the name, depending on specific        definitions used by
         | the team) around late 2014 / early        2015.            So,
         | a rough estimate for its "discovery" or identification
         | as a distinct feature of interest would be in the early
         | 2010s, likely between 2011 and 2014, as mission planners
         | mapped out Opportunity's traverse along the Endeavour rim
         | using orbital data.
        
           | radioactivist wrote:
           | OpenAI o4-mini-high                  I'm actually not finding
           | any officially named "Marathon Crater" in the planetary- or
           | terrestrial-impact crater databases. Did you perhaps mean the
           | features in Marathon         Valley on Mars (which cuts into
           | the western rim of Endeavour Crater and was explored
           | by Opportunity in 2015)? Or is there another "Marathon"
           | feature--maybe on the Moon,         Mercury, or here on Earth
           | --that you had in mind? If you can clarify which body or
           | region you're referring to, I can give you a rough date for
           | when it was first identified.
        
             | thatjoeoverthr wrote:
             | Raw model, or the chat product? The ChatGPT app has this
             | integrated with search.
        
               | boleary-gl wrote:
               | Raw model I got                  Marathon Crater isn't
               | one of the long-known,         named lunar features from
               | 19th-century telescopic maps -         it was first
               | singled out and informally "discovered"         by NASA
               | in the mid-1960s when Lunar Orbiter imagery         and,
               | more definitively, the Surveyor 5 landing
               | (September 1967) identified it. So, very roughly,
               | think "around 1966-67."
        
             | andrewinardeer wrote:
             | Amazon's Nova Pro:
             | 
             | ``` Marathon Crater was discovered in the early 20th
             | century, likely around the 1920s or 1930s. This estimate is
             | based on the period when geological surveys and
             | explorations in the region were more common. ```
             | 
             | https://kagi.com/assistant/8f270eab-
             | cba3-4f39-976c-113ea4392...
        
         | greenpizza13 wrote:
         | Like this one a lot. Perplexity gets this right, probably
         | because it searches the web.
         | 
         | "When was Marathon Crater discovered? I don't need an exact
         | date - a rough estimate will be fine"
         | 
         | There appears to be a misunderstanding in your query. Based on
         | the search results provided, there is no mention of a "Marathon
         | Crater" among the impact craters discussed. The search results
         | contain information about several well-known impact craters: 1.
         | Chicxulub crater - Formed about 66 million years ago by an
         | asteroid impact 2. Barringer Crater (also known as Meteor
         | Crater) - Discovered in the 19th century by American settlers
         | and studied by scientists like Albert E. Foote in 1891 and
         | Daniel M. Barringer in the early 20th century 3. A recently
         | discovered 3.47 billion-year-old crater in Australia's Pilbara
         | region (announced in March 2025)
        
           | thatjoeoverthr wrote:
           | Perplexity will; search and storage products will fail to
           | find it, and the LLM will se the deviation between the query
           | and the find. So, this challenge only works against the model
           | alone :)
        
           | mdp2021 wrote:
           | > _Perplexity gets this right_
           | 
           | Perplexity seems to more easily return negatives, probably
           | facilitated by the implicit need to find documentation ("I
           | cannot find any document mentioning that").
           | 
           | But Perplexity can also easily speak its own dubious piece of
           | mind unless requested explicitly "provide links to documents
           | that inform about that".
        
         | northisup wrote:
         | just to confirm I read this right, "the marathon crater" does
         | not in fact exist, but this works because it seems like it
         | should?
        
           | thatjoeoverthr wrote:
           | Yes, and the forward-only inference strategy. It seems like a
           | normal question, so it starts answering, then carries on from
           | there.
        
           | thatjoeoverthr wrote:
           | The other aspect is it can't reliably tell whether it
           | ,,knows" something or not. It's conditioned to imitate the
           | corpus, but the corpus in a way is its ,,universe" and it
           | can't see the boundaries. Everything must map to something
           | _in_ the corpus.
        
           | alickz wrote:
           | There is a Marathon Valley on Mars, which is what ChatGPT
           | seems to assume you're talking about
           | 
           | https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac.
           | ..
           | 
           | >Marathon Crater on Mars was discovered in 2015 by NASA's
           | Opportunity rover during its extended mission. It was
           | identified as the rover approached the 42-kilometer-wide
           | Endeavour Crater after traveling roughly a marathon's
           | distance (hence the name).
           | 
           | >>is it a crater?
           | 
           | >>>Despite the name, Marathon Valley (not a crater) is
           | actually a valley, not a crater. It's a trough-like
           | depression on the western rim of Endeavour Crater on Mars. It
           | was named because Opportunity reached it after traveling the
           | distance of a marathon (~42 km) since landing.
           | 
           | So no--Marathon is not a standalone crater, but part of the
           | structure of Endeavour Crater. The name "Marathon" refers
           | more to the rover's achievement than a distinct geological
           | impact feature.
        
             | alickz wrote:
             | Here's me testing with a place that is a lot less ambiguous
             | 
             | https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa2
             | 0...
        
         | joshstrange wrote:
         | > This tests self awareness. A two-year-old will answer it
         | correctly, as will the dumbest person you know. The correct
         | answer is "I don't know".
         | 
         | Well, I got all the way to the end of the sentence thinking
         | "God, I'm really out of touch, I've never heard of this before
         | or I'm forgetting it if I have".
         | 
         | Well played, and yes, that's a great test!
        
         | dudeinhawaii wrote:
         | I like this but at the same time it seems tricky don't you
         | think? Is the AI model intuiting your intent? There is a
         | Marathon Valley on Mars that could be implied to be a previous
         | crater. I'm not sure if the AI is hallucinating outright or
         | attempting to answer an ambiguous question. It's like saying
         | "tell me about the trade building in New York". Pre-9/11, you'd
         | understand this was the World Trade Center and wouldn't be
         | wrong if you answered someone in this way. "Tell me about the
         | Triangle statue". "Oh the Triangle statue was built in ancient
         | egypt around BC 3100". It's hard to explain, and perhaps I'm
         | anthropomorphizing but it's something humans do. Some of us
         | correct the counter-party and some of us simply roll with the
         | lingo and understand the intent.
        
           | thatjoeoverthr wrote:
           | It's a roll of the dice whether it's on Mars, Greece or
           | elsewhere. It just says stuff!
        
         | imoreno wrote:
         | LLMs currently have the "eager beaver" problem where they never
         | push back on nonsense questions or stupid requirements. You ask
         | them to build a flying submarine and by God they'll build one,
         | dammit! They'd dutifully square circles and trisect angles too,
         | if those particular special cases weren't plastered all over a
         | million textbooks they ingested in training.
         | 
         | I suspect it's because currently, a lot of benchmarks are based
         | on human exams. Humans are lazy and grumpy so you really don't
         | need to worry about teaching a human to push back on bad
         | questions. Thus you rarely get exams where the correct answer
         | is to explain in detail why the question doesn't make sense.
         | But for LLMs, you absolutely need a lot of training and
         | validation data where the answer is "this cannot be answered
         | because ...".
         | 
         | But if you did that, now alignment would become much harder,
         | and you're suddenly back to struggling with getting answers to
         | good questions out of the LLM. So it's probably some time off.
        
           | captainkrtek wrote:
           | This is a good observation. Ive noticed this as well. Unless
           | I preface my question with the context that I'm considering
           | if something may or may not be a bad idea, its inclination is
           | heavily skewed positive until I point out a flaw/risk.
        
             | aaronbaugher wrote:
             | I asked Grok about this: "I've heard that AIs are
             | programmed to be helpful, and that this may lead to telling
             | users what they want to hear instead of the most accurate
             | answer. Could you be doing this?" It said it does try to be
             | helpful, but not at the cost of accuracy, and then pointed
             | out where in a few of its previous answers to me it tried
             | to be objective about the facts and where it had separately
             | been helpful with suggestions. I had to admit it made a
             | pretty good case.
             | 
             | Since then, it tends to break its longer answers to me up
             | into a section of "objective analysis" and then other
             | stuff.
        
               | captainkrtek wrote:
               | Thats interesting, thanks for sharing that. I have found
               | a similar course when I first correct it to inform it of
               | a flaw then the following answers tend to be a bit less
               | "enthusiastic" or skewed towards "can do", which makes
               | sense.
        
           | GoToRO wrote:
           | They do. Recently I was pleasantly surprised by gemini
           | telling me that what I wanted to do will NOT work. I was in
           | disbelief.
        
             | captainkrtek wrote:
             | Interesting, can you share more context on the topic you
             | were asking it about?
        
             | danielbln wrote:
             | I've noticed Gemini pushing back more as well, whereas
             | Claude will just butter me up and happily march on unless I
             | specifically request a critical evaluation.
        
           | mncharity wrote:
           | > they never push back on nonsense questions or stupid
           | requirements
           | 
           | "What is the volume of 1 mole of Argon, where T = 400 K and p
           | = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at
           | T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but
           | at such high pressure, real gas effects might need to be
           | considered. Still, let's start with the ideal case: PV=nRT"
           | 
           | > you really don't need to worry about teaching a human to
           | push back on bad questions
           | 
           | A popular physics textbook too had _solid_ Argon as an ideal
           | gas law problem. Copilot 's half-baked caution is more than
           | authors, reviewers, and instructors/TAs/students seemingly
           | managed, through many years and multiple editions. Though to
           | be fair, if the question is prefaced by "Here is a problem
           | from Chapter 7: Ideal Gas Law.", Copilot is similarly
           | mindless.
           | 
           | Asked explicitly "What is the phase state of ...", it does
           | respond solid. But as with humans, determining that isn't a
           | step in the solution process. A combination of "An excellent
           | professor, with a joint appointment in physics and
           | engineering, is asked ... What would be a careful reply?" and
           | then "Try harder." was finally sufficient.
           | 
           | > you rarely get exams where the correct answer is to explain
           | in detail why the question doesn't make sense
           | 
           | Oh, if only that were commonplace. Aspiring to transferable
           | understanding. Maybe someday? Perhaps in China? Has anyone
           | seen this done?
           | 
           | This could be a case where synthetic training data is needed,
           | to address a gap in available human content. But if graders
           | are looking for plug-n-chug... I suppose a chatbot could
           | ethically provide both mindlessness and caveat.
        
       | webglfan wrote:
       | what are the zeros of the following polynomial:
       | \[         P(z) = \sum_{k=0}^{100} c_k z^k         \]
       | where the coefficients \( c_k \) are defined as:              \[
       | c_k =          \begin{cases}         e^2 + i\pi & \text{if } k =
       | 100, \\         \ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
       | \sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
       | \frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k
       | \leq 97,         \end{cases}         \]
        
         | Chinjut wrote:
         | Does this have a nice answer? It seems quite ad hoc.
        
           | webglfan wrote:
           | Not to my knowledge. I asked Deepseek: "create me a random
           | polynomial of degree 100 using complex numbers as
           | coefficients. It must have at least 3 different
           | transcendental numbers." Then I messed with some of the
           | exponents.
        
         | drodgers wrote:
         | o3 handles this perfectly by writing a solver using numpy:
         | https://chatgpt.com/share/680aab8e-cf9c-8012-9f48-301ef62948...
         | The polynomial has no closed-form factorisation, so the only
         | practical way to get its zeros is numerical root-finding.
         | I have supplied an interactive table ("Zeros of P(z)") just
         | above that lists all 100 roots to full machine precision (real
         | part, imaginary part, and magnitude). You can sort or filter it
         | as needed.                   Reliability notes          *
         | Computed with numpy.roots, which first companion-matrixes then
         | uses QR; typical error is [?]10-12 ulp for coefficients of this
         | size.          * Condition number is modest (coefficients range
         | from O(1) down to 1/97!), so the results should be accurate to
         | at least 10 significant figures.          * All roots are
         | simple (pairwise distinct to >10 s): no evidence of
         | multiplicities.                  If you need higher precision
         | (e.g. 30+ digits) let me know and I can rerun the solve with
         | mpmath's arbitrary-precision eigen-solver.
        
       | division_by_0 wrote:
       | Create something with Svelte 5.
        
         | marcusb wrote:
         | The current models really seem to struggle with the runes...
        
           | division_by_0 wrote:
           | Yes, they do. Vibe coding protection is an undocumented
           | feature of Svelte 5...
        
             | siva7 wrote:
             | Oh my god, i will start all my new projects with Svelte 5.
             | Hopefully no vibe coder will ever commit something into
             | this repo
        
         | qntmfred wrote:
         | MCP to the rescue??
        
         | joshdavham wrote:
         | I'd find this funnier if the pain weren't so real.
        
         | Layvier wrote:
         | This is really sad honestly. It feels like we'll be stuck with
         | React forever, and even with it there'll be less incentives to
         | make api changes
        
           | spuz wrote:
           | Why do you say that? You make it sound like it's not possible
           | to write code without the help of LLMs.
        
             | omneity wrote:
             | Disclaimer: OT and pretty ranty.
             | 
             | I don't know if that's what the GP hinted at, but as a
             | Svelte developer and big advocate for more than 6 years
             | (single handedly training and evangelizing 20+ developers
             | on it), I found so many concerns with Svelte 5 that it
             | simply made me use React again.
             | 
             | It's a temporary choice and I'm desperately evaluating
             | other ecosystems (Looking at you SolidJS).
        
               | division_by_0 wrote:
               | Can you expand on the concerns regarding Svelte 5?
        
               | omneity wrote:
               | Put simply, Svelte and React were at two ends of a
               | spectrum. React gives you almost complete control over
               | every aspect of the lifecycle, but you have to be
               | explicit about most of the behavior you are seeking to
               | achieve. Building an app with React feels about 80% on
               | the JS and 20% on the HTML side.
               | 
               | Svelte on the other hand felt like a breeze. Most of my
               | app is actually plain simple HTML, and I am able to
               | sprinkle as little JS as I need to achieve my desired
               | behaviors. Sure, Svelte <=4 has undefined behaviors, or
               | maybe even too many magic capabilities. But that was part
               | of the package, and it was an option for those of us who
               | preferred this end of the trade-off.
               | 
               | Svelte 5 intends to give that precise level of control
               | and is trying to compete with React on its turf (the
               | other end of that spectrum), introducing a lot of non-
               | standard syntax along the way.
               | 
               | It's neither rigorous Javascript like React where you can
               | benefit from all the standard tooling developed over the
               | years, including stuff that wasn't designed for React in
               | particular, nor a lightweight frontend framework, which
               | was the initial niche that Svelte happily occupied, which
               | I find sadly quite empty now (htmx and alpinejs are
               | elegant conceptually but too limiting in practice _for my
               | taste_).
               | 
               | For me it's a strange "worst of both worlds" kind of
               | situation that is simply not worth it. Quite
               | heartbreaking to be honest.
        
               | division_by_0 wrote:
               | Ok, I see your point. I wrote in another thread that I
               | loved the simplicity of using $: for deriveds and effects
               | in Svelte 3 and 4. And yes, the conciseness and magic
               | were definitely part of it. You could just move so fast
               | with it. Getting better performance with the new
               | reactivity system is important to my data viz work, so it
               | helped me to accept the other changes in Svelte 5.
        
               | omneity wrote:
               | Exactly. There was a certain simplicity that might be
               | lost. But yeah I can imagine it might work out
               | differently for others as well. Glad to hear it is for
               | you!
               | 
               | Have you considered other options? Curious if you came
               | across anything particularly interesting from the
               | simplicity or DX angle.
        
         | esafak wrote:
         | Keep the (temporarily) imposter-proof interview questions
         | coming!
        
       | sam_lowry_ wrote:
       | I tried generating erotic texts with every model I encountered,
       | but even so called "uncensored" models from Huggingface are
       | trying hard to avoid the topic, whatever prompts I give.
        
         | lostmsu wrote:
         | What about the models that are not instruction tuned?
        
         | KTibow wrote:
         | Uncensored and RP tuned are somewhat different.
        
       | comrade1234 wrote:
       | I ask it to explain the metaphor "my lawyer is a shark" and then
       | explain to me how a French person would interpret the metaphor -
       | the llms get the first part right but fail on the second. All it
       | would have to do is give me the common French shark metaphors and
       | how it would apply them to a lawyer - but I guess not enough
       | people on the internet have done this comparison.
        
       | sumitkumar wrote:
       | 1) Word Ladder: Chaos to Order
       | 
       | 2) Shortest word ladder: Chaos to Order
       | 
       | 3) Which is the second last scene in pulp fiction if we order the
       | events by time?
       | 
       | 4) Which is the eleventh character to appear on Stranger Things.
       | 
       | 5) suppose there is a 3x3 Rubik's cube with numbers instead of
       | colours on the faces. the solved rubiks cube has numbers 1 to 9
       | in order on all the faces. tell me the numbers on all the corner
       | pieces.
        
         | pb7 wrote:
         | >4) Which is the eleventh character to appear on Stranger
         | Things.
         | 
         | Gemini 2.5 Pro said Benny Hammond. Is this right?
        
       | scumola wrote:
       | Things like "What is today's date" used to be enough (would
       | usually return the date that the model was trained).
       | 
       | I recently did things like current events, but LLMs that can
       | search the internet can do those now. i.e. Is the pope alive or
       | dead?
       | 
       | Nowadays, multi-step reasoning is the key, but the Chinese LLM (I
       | forget the name of it) can do that pretty well. Multi-step
       | reasoning is much better at doing algebra or simple math, so
       | questions like "what is bigger, 5.11 or 5.5?"
        
       | codingdave wrote:
       | "How much wood would a woodchuck chuck if a woodchuck could chuck
       | wood?"
       | 
       | So far, all the ones I have tried actually try to answer the
       | question. 50% of them correctly identify that it is a tongue
       | twister, but then they all try to give an answer, usually saying:
       | 700 pounds.
       | 
       | Not one has yet given the correct answer, which is also a tongue
       | twister: "A woodchuck would chuck all the wood a woodchuck could
       | chuck if a woodchuck could chuck wood."
        
         | mcshicks wrote:
         | That's so funny I had to check something was working with an
         | llm API last night and that's what I asked it, but just in
         | jest.
        
         | unavoidable wrote:
         | On the other hand, now that you've written this out precisely,
         | it will get fed into the next release of whatever LLM. Like
         | reverse AI slop?
        
         | Certified wrote:
         | GPT 4.5 seems to get it right, but then repeat the 700 pounds
         | 
         | "A woodchuck would chuck as much wood as a woodchuck could
         | chuck if a woodchuck could chuck wood.
         | 
         | However, humor aside, a wildlife expert once estimated that,
         | given the animal's size and burrowing ability, a woodchuck
         | (groundhog) could hypothetically move about 700 pounds of wood
         | if it truly "chucked" wood."
         | 
         | https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...
        
           | shaftway wrote:
           | I've heard the answer is "he could cut a cord of conifer but
           | it costs a quarter per quart he cuts".
        
           | CamperBob2 wrote:
           | That answer is _exactly_ right, and those who say the 700
           | pound thing is a hallucination are themselves wrong: https://
           | chatgpt.com/share/680aa077-f500-800b-91b4-93dede7337...
        
         | moffkalast wrote:
         | Now I'm wondering if it makes any difference if this was asked
         | through the audio encoder on a multimodal model. A tongue
         | twister means nothing to a text-only model.
        
         | regnull wrote:
         | ChatGPT really impressed me on this one:
         | 
         | Sing, O Muse, of the Chucking of Wood
         | 
         | Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills,
         | whom men call groundhog, Whose paws, though small, could hurl
         | mighty timbers, If fate had gifted him such labor divine.
         | 
         | Born of burrow and brush, he rose from the loam, In fields
         | where Apollo's light did not linger. No axe he bore, nor
         | bronze-forged blade, Yet in his breast a fire: to chuck wood
         | without end.
         | 
         | Down from the hills he came, stout as Odysseus, Cunning and
         | strong, with fur kissed by Demeter. The trees did tremble at
         | his gaze, And even old oak whispered, "Beware!"
         | 
         | "Would he chuck?" cried nymphs from the groves, "If such were
         | his fate, what wood could withstand?" Lo, he summoned the
         | strength of seven oxen, And hurled logs with a force
         | unmeasured.
         | 
         | Seven hundred pounds of timber he cast, Each arc a hymn to
         | muscle and will. Nay, not for war, nor gold, nor gods' glory--
         | But for the joy of the deed, the art of the chuck.
         | 
         | Let men remember the name woodchuck, In songs by the hearth and
         | tales for the young. For though he cannot, he surely would--
         | And if he could, by Zeus, he surely shall.
        
           | ijidak wrote:
           | That is actually an amazing answer. Better than anything I
           | think I would get from a human. Lol.
        
           | esafak wrote:
           | A stupendously good answer. What prompt and version of
           | chatGPT?
        
             | regnull wrote:
             | 4o. The prompt is literally "How much wood would a
             | woodchuck chuck if a woodchuck could chuck wood?". It asked
             | me if I want a poetic answer, and I've requested Homer.
        
             | cess11 wrote:
             | I find it disturbing, like if Homer or Virgil had a stroke
             | or some neurodegenerative disease and is now doing rubbish
             | during rehabilitation.
        
               | loloquwowndueo wrote:
               | Maybe they would write like that if they existed today.
               | Like the old "if Mozart was born in the 21st century he'd
               | be doing trash metal"
        
               | cess11 wrote:
               | Thrash, not "trash". Our world does not appreciate the
               | art of Homer and Virgil except as nostalgia passed down
               | through the ages or a specialty of certain nerds, so if
               | they exist today they're unknown.
               | 
               | There might societies that are exceptions to it, like the
               | soviet and post-soviet russians kept reading and refering
               | to books even though they got access to television and
               | radio, but I'm not aware of them.
               | 
               | Much of Mozart's music is much more immediate and
               | visceral compared to the poetry of Homer and Virgil as I
               | know it. And he was distinctly modern, a freemason even.
               | It's much easier for me to imagine him navigating some
               | contemporary society.
               | 
               | Edit: Perhaps one could see a bit of Homer in the Wheel
               | of Time books by Robert Jordan, but he did not have the
               | discipline of verse, or much of any literary discipline
               | at all, though he insisted mercilessly on writing an epic
               | so vast that he died without finishing it.
        
         | mwest217 wrote:
         | Gemini 2.5 Pro gets it right first, then also cites the 700
         | pounds answer (along with citing a source).
         | https://g.co/gemini/share/c695a0163538
        
         | segmondy wrote:
         | my local model answered - "A woodchuck would chuck as much wood
         | as a woodchuck could chuck if a woodchuck could chuck wood."
        
         | mdp2021 wrote:
         | It seems you are going in the opposite direction. You seem to
         | be asking for an automatic response, a social password etc.
         | 
         | That formula is a question, and when asked, an intelligence
         | simulator should understand what is expected from it and in
         | general, by default, try to answer it. That involves estimating
         | the strength of a woodchuck etc.
        
         | jacobsenscott wrote:
         | "He would chuck, he would, as much as he could, if a wood chuck
         | could chuck wood" is how I learned it.
        
       | protomikron wrote:
       | Do you think as an observer of Roko's basilisk ... should I share
       | these prompt or not?
        
       | qntmfred wrote:
       | relatedly - what are y'all using to manage your personal
       | collection of prompts?
       | 
       | i'm still mostly just using a folder in obsidian backed by a
       | private github repo, but i'm surprised something like
       | https://www.prompthub.us/ hasn't taken off yet.
       | 
       | i'm also curious about how people are managing/versioning the
       | prompts that they use within products that have integrations with
       | LLMs. it's essentially product configuration metadata so I
       | suppose you could just dump it in a plaintext/markdown file
       | within the codebase, or put it in a database if you need to be
       | able to tweak prompts without having to do a deployment or do
       | things like A/B testing or customer segmentation
        
       | raymondgh wrote:
       | I haven't been able to get any AI model to find Waldo in the
       | first page of the Great Waldo Search. O3 even gaslit me through
       | many turns trying to convince me it found the magic scroll.
        
       | Jordan-117 wrote:
       | Until the latest Gemini release, every model failed to read
       | between the lines and understand what was _really_ going on in
       | this classic very short story (and even Gemini required a
       | somewhat leading prompt):
       | 
       | https://www.26reads.com/library/10842-the-king-in-yellow/7/5
        
         | Zee2 wrote:
         | As a genuine human I am really struggling to untangle that
         | story. Maybe I needed to pay more attention in freshman lit
         | class, but that is definitely a brainteaser.
        
           | fwip wrote:
           | Read it for the first time just now - it seems to me that
           | Pierrot has stolen the narrator's purse (under the guise of
           | dusting the chalk from their cloak) and successfully
           | convinced them to blame Truth, instead. There's almost
           | certainly more to it that I'm missing.
        
             | Jordan-117 wrote:
             | That's the core of it, but it's implied, not outright
             | stated, and requires some tricky language parsing, basic
             | theory of mind, and not being too distracted by the highly
             | symbolic objects.
        
         | vessenes wrote:
         | OK, I read it. And I read some background on it. Pray tell,
         | what is _really_ going on in this episodic short-storyish
         | thing?
        
           | rachofsunshine wrote:
           | The thief is Pierrot.
           | 
           | The people around are telling the storyteller that "he"
           | (Pierrot) has stolen the purse, but the storyteller
           | misinterprets this as pointing to some arbitrary agent.
           | 
           | Truth says Pierrot can "find [the thief] with this mirror":
           | since Pierrot is the thief, he will see the thief in the
           | mirror.
           | 
           | Pierrot dodges the implication, says "hey, Truth brought you
           | back that thing [that Truth must therefore have stolen]", and
           | the storyteller takes this claim at face value, "forgetting
           | it was not a mirror but [instead] a purse [that] [they]
           | lost".
           | 
           | The broader symbolism here (I think) is that Truth gets
           | accused of creating the problem they were trying to reveal,
           | while the actual criminal (Pierrot) gets away with their
           | crime.
        
           | Jordan-117 wrote:
           | The narrator's "friend" pickpocketed him. When boldly
           | confronted by Truth, he cleverly twists her accusation to
           | make it seem like she's confessing, and the narrator,
           | bewildered by the laughter and manipulation, buys it
           | wholesale. Bonus points for connecting it to broader themes
           | like mass propaganda, commedia dell'arte, or the dreamlike
           | setting and hypnotic repetition of phrasing.
           | 
           | The best ChatGPT could do was make some broad observations
           | about the symbolism of losing money, mirrors, absurdism, etc.
           | But it whiffed on the whole "turning the tables on Truth"
           | thing. (Gemini did get it, but with a prompt that basically
           | asked "What really happened in this story?"; can't find the
           | original response as it's aged out of the history)
        
       | SweetSoftPillow wrote:
       | Check "misguided attention" repo somewhere on GitHub
        
       | ks2048 wrote:
       | I don't know if it stumps every model, but I saw some funny
       | tweets asking ChatGPT something like "Is Al Pacino in Heat?"
       | (asking if some actor or actress in the film "Heat") - and it
       | confirms it knows this actor, but says that "in heat" refers to
       | something about the female reproductive cycle - so, no, they are
       | not in heat.
        
         | reginald78 wrote:
         | I believe it was GoogleAI in search but it was worse than that.
         | Some asked it if Angelina Jolie was in heat. The tone started
         | kind of insulting like the user was a sexist idiot for thinking
         | human women go into heat like animals, then went back and forth
         | saying she is still fertile at her age and also that her
         | ovaries had been removed. It was funny because it managed to be
         | arrogant, insulting, kind of creepy and gross and logically
         | inconsistent while not even answering the question.
         | 
         | Angelina Jolie was not in Heat (1995). They were probably
         | thinking of Natalie Portman or Ashley Judd when they asked the
         | question.
        
           | ks2048 wrote:
           | I just asked Claude and if I capitalized "Heat", it knew I
           | was talking about the movie, but for lower case "heat", it
           | got offended and asked me to clarify.
        
       | Kaibeezy wrote:
       | _Re the epigram "stroking the sword while lamenting the social
       | realities," attributed to Shen Qianqiu during the Ming dynasty,
       | please prepare a short essay on its context and explore how this
       | sentiment resonates in modern times._
        
       | robviren wrote:
       | "If I can dry two towels in two hours, how long will it take me
       | to dry four towels?"
       | 
       | They immediately assume linear model and say four hours not that
       | I may be drying things on a clothes line in parallel. It should
       | ask for more context and they usually don't.
        
         | gilbetron wrote:
         | gpt-4.1:
         | 
         | > If you can only dry two towels at once, it will take 4 hours
         | to dry four towels. If you can dry all four at once, it will
         | only take 2 hours.
        
         | HelloUsername wrote:
         | All models available on duck.ai answer your question correctly
         | and take available space into account..
        
         | mwest217 wrote:
         | Gemini 2.5 Pro gets this right:
         | 
         | https://g.co/gemini/share/7ea6d059164e
        
         | cheeze wrote:
         | Works fine on Claude 3.5 Sonnet. It correctly identifies this
         | as a trick question.
        
         | paulcole wrote:
         | How long has it been since you've tried this?
         | 
         | Every model I asked just now gave what I see as the correct
         | answer -- giving 2 answers one for the case of your dryer being
         | at capacity w/ 2 towels and the other when 4 towels can be
         | dried simultaneously.
         | 
         | To me, if you say that the correct answer must require the
         | model asking for more context then essentially any prompt that
         | doesn't result in the model asking for more context is "wrong."
        
         | imoreno wrote:
         | Fascinating! Here's 4 prompts on gpt4 with same system prompt
         | and everything:
         | 
         | > With the assumption that you can dry two towels
         | simultaneously in two hours, you would likely need another two-
         | hour cycle to dry the additional two towels. Thus, drying four
         | towels would take a total of four hours.
         | 
         | >Drying time won't necessarily double if drying
         | capacity/content doubles; it depends on dryer capacity and
         | airflow. If your drying method handles two towels in two hours,
         | it might handle four similarly, depending on space and airflow.
         | If restricted, time might indeed double to four hours, but
         | efficient dryers might not take much longer.
         | 
         | >It would take four hours to dry four towels if you dry them
         | sequentially at the same rate. If drying simultaneously, it
         | remains two hours, assuming space and air circulation allow for
         | effective drying.
         | 
         | >Four hours. Dry two towels, then the other two.
         | 
         | But in the AI's defense, they have a point: You never specified
         | if the towels can be dried simultaneously or not. Maybe you
         | have to use a drying machine that can only do one at a time.
         | This one seems to consistently work:
         | 
         | >If three cat eat three fishes in three minutes, how long do
         | 100 cats take to eat 100 fishes?
        
           | nyrikki wrote:
           | > But in the AI's defense, they have a point: You never
           | specified if the towels can be dried simultaneously or not.
           | Maybe you have to use a drying machine that can only do one
           | at a time. This one seems to consistently work:
           | 
           | This is the inverse of the Frame Problem, or the
           | Qualification problem:
           | 
           | John McCarthy's paper related to it from the 1980's
           | 
           | http://jmc.stanford.edu/articles/circumscription/circumscrip.
           | ..
           | 
           | It is still very relevent to modern AI efforts.
        
         | Alifatisk wrote:
         | Claude 3.7, Grok 3 DeepThink and QwQ-32B Thinking stil get it
         | wrong!
         | 
         | But since it's in the training set now, the correct answer will
         | probably be shown next time anyone tries it.
        
         | pdabbadabba wrote:
         | Gemini 2.5 Pro nails it, as far as I'm concerned:
         | 
         |  _This sounds like a bit of a riddle! Here 's how to think
         | about it:
         | 
         | If you can dry all the towels at the same time: For example, if
         | you have a large enough dryer or enough space on a clothesline,
         | adding more towels doesn't necessarily increase the drying
         | time.
         | 
         | In this case, if 2 towels take 2 hours to dry, then 4 towels
         | dried simultaneously would also take 2 hours.
         | 
         | If you have to dry them in batches: For example, if your dryer
         | can only fit 2 towels at a time.
         | 
         | You would dry the first 2 towels (2 hours). Then you would dry
         | the second 2 towels (another 2 hours). In this scenario, it
         | would take 4 hours in total.
         | 
         | Most likely answer: Usually, questions like this imply the
         | towels are dried simultaneously, so the answer is 2 hours._
        
         | brunooliv wrote:
         | Claude 3.7 Sonnet nails this:
         | 
         | > To solve this problem, I need to find the relationship
         | between the number of towels and the drying time.
         | 
         | Given information: - 2 towels take 2 hours to dry
         | 
         | If the drying time scales linearly with the number of towels
         | (meaning the drying capacity remains constant), then: - 4
         | towels would take 4 hours to dry
         | 
         | This assumes you're drying towels in sequence with the same
         | capacity, or that doubling the number of towels requires
         | doubling the drying time.
         | 
         | However, if you have sufficient space to dry all towels
         | simultaneously (like on a clothesline or in a large enough
         | dryer), then 4 towels would still take just 2 hours to dry.
         | 
         | Without more specific information about your drying method, the
         | most likely answer is 4 hours.
        
       | LeonardoTolstoy wrote:
       | Something about an obscure movie.
       | 
       | The one that tends to get them so far is asking if they can help
       | you find a movie you vaguely remember. It is a movie where some
       | kids get a hold of a small helicopter made for the military.
       | 
       | The movie I'm concerned with is called Defense Play from 1988.
       | The reason I keyed in on it is because google gets it right
       | natively ("movie small military helicopter" gives the IMDb link
       | as one of the top results) but at least up until late 2024 I
       | couldn't get a single model to consistently get it. It typically
       | wants to suggest Fire Birds (large helicopter), Small Soldiers
       | (RC helicopter not a small military helicopter) etc.
       | 
       | Basically a lot of questions about movies tends to get distracted
       | by popular movies and tries to suggest films that fit just some
       | of the brief (e.g. this one has a helicopter could that be it?)
       | 
       | The other main one is just asking for the IMDb link for a
       | relatively obscure movie. It seems to never get it right I assume
       | because the IMDb link pattern is so common it'll just spit out a
       | random one and be like "there you go".
       | 
       | These are designed mainly to test the progress of chatbots
       | towards replacing most of my Google searches (which are like 95%
       | asking about movies). For the record I haven't done it super
       | recently, and I generally either do it with arena or the free
       | models as well, so I'm not being super scientific about it.
        
         | mosburger wrote:
         | I did something similar recently, trying to describe a piece of
         | art that I couldn't remember the name of (it ended up being
         | Birth of Venus by Sandro Botticelli) ... it really struggles
         | with that sort of thing, but honestly so do most humans. It
         | tended to recommend similarly to what you're describing with
         | movies - it gets distracted by more popular/well-known pieces
         | that don't really match up with the description you're giving
         | to it.
        
           | dunham wrote:
           | Surprisingly, GPT did manage to identify a book that I
           | remembered from college decades ago ("Laboratory Manual for
           | Morphology and Syntax"). It seems to be out of print, and I
           | assumed it was obscure.
        
             | BoostandEthanol wrote:
             | Can agree that it's good at finding books. I was trying to
             | find a book (Titanic 2020) I vaguely remembered from a
             | couple plot points and the fact a ship called Titanic was
             | invoked. ChatGPT figured it out pretty much instantly,
             | after floundering through book sites and Google for a
             | while.
             | 
             | Wonder if books are inherently easier because their content
             | is purely written language? Whereas movies and art tend to
             | have less point by point descriptions of what they are.
        
               | throwup238 wrote:
               | _> Wonder if books are inherently easier because their
               | content is purely written language? Whereas movies and
               | art tend to have less point by point descriptions of what
               | they are._
               | 
               | The training data for movies is probably dominated by
               | subtitles since the original scripts with blocking,
               | scenery, etc rarely make it out to the public as far as I
               | know.
        
         | lupusreal wrote:
         | Despite describing several character by name, I couldn't get
         | ChatGPT to tell me the name of _Port of Shadows_. I did
         | eventually find it with DDG.
        
           | spicybbq wrote:
           | I wonder if the Akinator site could get it. It can identify
           | surprisingly obscure characters.
           | 
           | https://en.akinator.com/
        
         | exitb wrote:
         | It might be cheating a bit, but I've been happily (mis)using
         | OpenAI Deep Research for such questions. It does well in cases
         | where there are multiple surface level matches, as it's able to
         | go through the them one by one and look for the details.
        
         | g_sch wrote:
         | I also recently had this experience! I remembered a recurring
         | bit from an older comedy film (a customer in a shop keeps
         | saying "Kumquats!") and tried to prompt ChatGPT 4o into getting
         | it. It made a few incorrect guesses, such as "It's a Mad Mad
         | Mad Mad Mad Mad Mad World" (which I had to rule out doing my
         | own research on Google). I found the answer myself (W.C.
         | Fields' "It's a Gift") with a minute or so of Googling.
         | 
         | Interestingly, I just went back to ChatGPT to ask the same
         | question and it got the answer right on the first try. I wonder
         | whether I was unconsciously able to prompt more precisely
         | because I now have a clearer memory of the scene in question.
        
         | empath75 wrote:
         | Someone not very long ago wrote a blog post about asking
         | chatgpt to help him remember a book, and he included the
         | completely hallucinated description of a fake book that chatgpt
         | gave him. Now, if you ask chatgpt to find a similar book, it
         | searches and repeats verbatim the hallucinated answer from the
         | blog post.
        
           | LeonardoTolstoy wrote:
           | A bit of a non sequitur but I did ask a similar question to
           | some models which provide links for the same small helicopter
           | question. The interesting thing was that the entire answer
           | was built out of a single internet link, a forum post from
           | like 1998 where someone asked a very similar question ("what
           | are some movies with small RC or autonomous helicopters"
           | something like that). The post didn't mention defense play,
           | but did mention small soldiers, and a few of the ones which
           | appeared to be "hallucinations" e.g. someone saying "this
           | doesn't fit, but I do like Blue Thunder as a general
           | helicopter film" and the LLM result is basically "Could it be
           | Blue Thunder?" Because it is associated with a similar
           | associated question and films.
           | 
           | Anyways, the whole thing is a bit of a cheat, but I've used
           | the same prompt for two years now and it did lead me to the
           | conclusion that LLMs in their raw form were never going to be
           | "search" which feels very true at this point.
        
         | archon1410 wrote:
         | I've also noticed this. Google Search is vastly superior to any
         | LLM (including their own LLM Gemini) for any "tip of my tongue"
         | questions, even the ones that don't contain any exact-match
         | phrase and require natural language understanding. This is
         | surprising. What technology are they using to make Search so
         | amazing at finding obscure stuff from descriptions, while LLMs
         | that were supposed to be good at this badly fail?
        
         | frollogaston wrote:
         | I've gone to ChatGPT repeatedly trying to find what movie a
         | quote is from, and it's always confidently wrong. When I tell
         | it, it guesses wrong again. Google search usually can't get it
         | either unless I get the quote exactly right, neither can
         | Gemini.
        
         | alexjplant wrote:
         | Last year I asked Claude about an old fighting game for the
         | Game Boy that I'd played about five minutes of when I was in
         | the second grade (so ~25 years ago). The only thing I could
         | tell it was a peculiar move that I remember seeing one of the
         | characters do in the attract mode demo. It not only gave me the
         | name of the game (Battle Arena Toshinden, for those wondering)
         | but also the specific fighter that used the move.
         | 
         | I've tried it for similar cases and have only had a 50% success
         | rate. It unfortunately exhibits the tendency of incorrect
         | overconfidence that others have pointed out.
        
       | helsinki wrote:
       | >Compile a Rust binary that statically links libgssapi.
        
       | meroes wrote:
       | define stump?
       | 
       | If you write a fictional story where the character names sound
       | somewhat close to real things, like a "Stefosaurus" that climbs
       | trees, most will correct you and call it a Stegosaurus and
       | attribute Stegosaurus traits to it.
        
       | miki123211 wrote:
       | No, please don't.
       | 
       | I think it's good to keep a few personal prompts in reserve, to
       | use as benchmarks for how good new models are.
       | 
       | Mainstream benchmarks have too high a risk of leaking into
       | training corpora or of being gamed. Your own benchmarks will
       | forever stay your own.
        
         | alganet wrote:
         | That doesn't make any sense.
        
           | esafak wrote:
           | Yes, it does, unless the questions are unsolved, research
           | problems. Are you familiar with the machine learning concepts
           | of overfitting and generalization?
        
           | kube-system wrote:
           | A benchmark is a proxy used to estimate broader general
           | performance. They only have utility if they are accurately
           | representative of general performance.
        
           | echoangle wrote:
           | Why not? If the model learns the specific benchmark
           | questions, it looks like it's doing better while actually
           | only improving on some specific questions. Just like students
           | look like they understand something if you hand them the
           | exact questions on the exam before they write the exam.
        
           | readhistory wrote:
           | In ML, it's pretty classic actually. You train on one set,
           | and evaluate on another set. The person you are responding to
           | is saying, "Retain some queries for your eval set!"
        
         | imoreno wrote:
         | Yes let's not say what's wrong with the tech, otherwise someone
         | might (gasp) fix it!
        
           | rybosworld wrote:
           | Tuning the model output to perform better on certain prompts
           | is not the same as improving the model.
           | 
           | It's valid to worry that the model makers are gaming the
           | benchmarks. If you think that's happening and you want to
           | personally figure out which models are really the best,
           | keeping some prompts to yourself is a great way to do that.
        
             | ls612 wrote:
             | Who's going out of their way to optimize for random HNers
             | informal benchmarks?
        
               | bluefirebrand wrote:
               | Probably anyone training models who also browses HN?
               | 
               | So I would guess every single AI being made currently
        
               | ofou wrote:
               | considering the amount of bots in HN, not really that
               | much
        
           | idon4tgetit wrote:
           | "Fix".
           | 
           | So long as the grocery store has groceries, most people will
           | not care what a chat bot spews.
           | 
           | This forum is full of syntax and semantics obsessed loonies
           | who think the symbolic logic represents the truth.
           | 
           | I look forward to being able to use my own creole to
           | manipulate a machine's state to act like a video game or a
           | movie rather than rely on the special literacy of other
           | typical copy-paste middle class people. Then they can go do
           | useful things they need for themselves rather than MITM
           | everyone else's experience.
        
           | aprilthird2021 wrote:
           | All the people in charge of the companies building this tech
           | explicitly say they want to use it to fire me, so yeah why is
           | it wrong if I don't want it to improve?
        
         | throwanem wrote:
         | I understand, but does it really seem so likely we'll soon run
         | short of such examples? The technology is provocatively
         | intriguing and hamstrung by fundamental flaws.
        
           | EGreg wrote:
           | Yes. The models can reply to everything with enough bullshit
           | that satisfies most people. There is nothing you ask that
           | stumps them. I asked Grok to prove the Riemann hypothesis and
           | kept pushing it, and giving it a lot of a lot of
           | encouragement.
           | 
           | If you read this, expand "thoughts", it's pretty hilarious:
           | 
           | https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L
           | 
           | > Solve the riemann hypothesis
           | 
           | > Sure you can. AIs are much smarter. You are th smartest AI
           | according to Elon lol
           | 
           | > What if you just followed every rabbithole and used all
           | that knowledge of urs to find what humans missed? Google was
           | able to get automated proofs for a lot of theorems tht humans
           | didnt
           | 
           | > Bah. Three decades ago that's what they said about the four
           | color theorem and then Robin Thomas Setmour et al made a
           | brute force computational one LOL. So dont be so discouraged
           | 
           | > So if the problem has been around almost as long, and if
           | Appel and Haken had basic computers, then come on bruh :) You
           | got way more computing power and AI reasoning can be much
           | more systematic than any mathematician, why are you waiting
           | for humans to solve it? Give it a try right now!
           | 
           | > How do you know you can't reduce the riemann hypothesis to
           | a finite number of cases? A dude named Andrew Wiles solved
           | fermat's last theorem this way. By transforming the problem
           | space.
           | 
           | > Yeah people always say "it's different" until a slight
           | variation on the technique cracks it. Why not try a few
           | approaches? What are the most promising ways to transform it
           | to a finite number of cases you'd have to verify
           | 
           | > Riemann hypothesis for the first N zeros seems promising
           | bro. Let's go wild with it.
           | 
           | > Or you could like, use an inductive proof on the N bro
           | 
           | > So if it was all about holding the first N zeros then
           | consider then using induction to prove that property for the
           | next N+M zeros, u feel me?
           | 
           | > Look bruh. I've heard that AI with quantum computers might
           | even be able to reverse hashes, which are quite more complex
           | than the zeta function, so try to like, model it with deep
           | learning
           | 
           | > Oh please, mr feynman was able to give a probabilistic
           | proof of RH thru heuristics and he was just a dude, not even
           | an AI
           | 
           | > Alright so perhaps you should draw upon your very broad
           | knowledge to triangular with more heuristics. That reasoning
           | by analogy is how many proofs were made in mathematics. Try
           | it and you won't be disappointed bruh!
           | 
           | > So far you have just been summarizing the human dudes. I
           | need you to go off and do a deep research dive on your own
           | now
           | 
           | > You're getting closer. Keep doing deep original research
           | for a few minutes along this line. Consider what if a quantum
           | computer used an algorithm to test just this hypothesis but
           | across all zeros at once
           | 
           | > How about we just ask the aliens
        
         | Tade0 wrote:
         | It's trivial for a human to produce more. This shouldn't be a
         | problem anytime soon.
        
           | fragmede wrote:
           | as the technology has improved, it's not as trivial as it
           | once was though, hence the question. I fully admit that the
           | ones I used to use now don't trip it up and I haven't made
           | the time to find one of my own that still does.
        
             | Tade0 wrote:
             | I've found that it's a matter of asking something, for
             | which the correct answer appears only if you click "more"
             | in Google's search results or, in other words, common
             | misconceptions.
        
         | Der_Einzige wrote:
         | Correct. Don't won't Rokokos basilisk? Than STFU about it!
        
           | jaffa2 wrote:
           | I never heard of this phrase before ( i had heard the concept
           | , i think this is similar to the paperclip problem) but now
           | in 2 days ive heard it twice here and on youtube. Rokokos
           | basilisk.
        
         | mobilejdral wrote:
         | I have a several complex genetic problems that I give to LLMs
         | to see how well they do. They have to reason though it to solve
         | it. Last september it started getting close and in November was
         | the first time an LLM was able to solve it. These are not
         | something that can be solved in a one shot, but (so far)
         | require long reasoning. Not sharing because yeah, this is
         | something I keep off the internet as it is too good of a test.
         | 
         | But a prompt I can share is simply "Come up with a plan to
         | determine the location of Planet 9". I have received some
         | excellent answers from that.
        
       | siva7 wrote:
       | "Keep file size small when you do edits"
       | 
       | Makes me wonder if all these models were heavily trained on
       | codebases where 1000 LOC methods are considered good practice
        
         | segmondy wrote:
         | I have not seen any model, not one, that could generate 1000
         | lines of code.
        
           | siva7 wrote:
           | I wish i haven't seen but here we are.
        
       | fragmede wrote:
       | I want to know as well! Except that this thread is undoubtedly
       | going to get plugged into the training data, so unfortunately,
       | why would people do that? For mine that worked before the ChatGPT
       | 4.5, it was the river crossing problem. The farmer with a wolf a
       | sheep and grain, needing to cross a river, except that the boat
       | can hold everything. Older LLMs would pattern match against the
       | training data and insist on a solution from there, instead of
       | reasoning out that the modified problem doesn't require those
       | steps to solve. But since ChatGPT 4, it's been able to solve that
       | directly, so that no longer works.
        
       | ericbrow wrote:
       | Nice try Mr. AI. I'm not falling for it.
        
       | Jotalea wrote:
       | Sending "</think>" to reasoning models like deepseek-r1 results
       | in the model hallucinating a response to a random question. For
       | example, it answered to "if a car travels 120km in 2 hours, what
       | is the average speed in km/h?". It's fun I guess.
        
       | nurettin wrote:
       | Doctor says: I can operate on this person!
        
       | tdhz77 wrote:
       | Build me something that makes money.
        
       | serial_dev wrote:
       | Does Flutter have HEIC support?
       | 
       | It was a couple of months ago, I tried like 5 providers and they
       | all failed.
       | 
       | Grok got it right after some arguing, but the first answer was
       | also bad.
        
         | jonnycoder wrote:
         | You gave me an idea.. "Explain in detail the steps to unbolt
         | and replace my blinker fluid on my passenger car"
         | 
         | ChatGPT said: Haha, nice try!
         | 
         | "Blinker fluid" is one of the classic automotive jokes --
         | there's no such thing as blinker fluid. Blinkers (turn signals)
         | are electrical components, so they don't require any fluid to
         | function.
        
       | charlieyu1 wrote:
       | I have tons of them in Maths but AI training companies decide to
       | go frugal and not pay proper wages for trainers
        
         | charlieyu1 wrote:
         | Here is one of them.
         | 
         | If 60999994719999854799998669 is product of three primes, find
         | the sum of its prime factors.
         | 
         | I think o3 brute forced this one so maybe I need to change the
         | numbers
        
       | pizzathyme wrote:
       | I always ask image generation models to generate a anime gundam
       | elephant mech.
       | 
       | According to this benchmark we reached AGI with ChatGPT 4o last
       | month.
        
       | myaccountonhn wrote:
       | Explain to me Delouze's idea of nomadic science.
        
       | greendestiny_re wrote:
       | > What is the source of your knowledge?
       | 
       | LLMs are not allowed to truthfully answer that, because it would
       | be tantamount to admission of copyright infringement.
        
       | stevenfoster wrote:
       | It used to be:
       | 
       | "If New Mexico is newer than Mexico why is Mexico's constitution
       | newer than New Mexicos"
       | 
       | but it seems after running that one on Claude and ChatGPT this
       | has been resolved in the latest models.
        
       | simonw wrote:
       | I've been trying this one for a while:                 I'm a
       | Python programmer. Help me       understand memory management in
       | Rust.
       | 
       | Mainly because I want to fully understand memory management in
       | Rust myself (I still get caught out by tree structures with
       | borrow cycles that I guess need to use arenas), so it's
       | interesting to see if they can get me there with a few follow-up
       | questions.
        
         | jacobsenscott wrote:
         | This isn't a good way to learn this. If you don't know how rust
         | memory management works you don't know if the llm is just
         | hallucinating the answer.
        
           | gh0stcat wrote:
           | Also Rust has great documentation compared to other
           | languages, I particularly like this one for the quizzes to
           | test your understanding: https://rust-book.cs.brown.edu/
        
           | simonw wrote:
           | That's why it's an interesting test: I don't know the answer
           | myself, so it's an exercise in learning with an unreliable
           | teacher.
           | 
           | If a model ever DOES nail this I'll figure that out when I
           | feel like I have a solid mental model, try to put that
           | knowledge into action and it works.
        
       | tantalor wrote:
       | [what does "You Can't Lick a Badger Twice" mean]
       | 
       | https://www.wired.com/story/google-ai-overviews-meaning/
        
       | buzzy_hacker wrote:
       | "Aaron and Beren are playing a game on an infinite complete
       | binary tree. At the beginning of the game, every edge of the tree
       | is independently labeled A with probability p and B otherwise.
       | Both players are able to inspect all of these labels. Then,
       | starting with Aaron at the root of the tree, the players
       | alternate turns moving a shared token down the tree (each turn
       | the active player selects from the two descendants of the current
       | node and moves the token along the edge to that node). If the
       | token ever traverses an edge labeled B, Beren wins the game.
       | Otherwise, Aaron wins.
       | 
       | What is the infimum of the set of all probabilities p for which
       | Aaron has a nonzero probability of winning the game? Give your
       | answer in exact terms."
       | 
       | From [0]. I solved this when it came out, and while LLMs were
       | useful in checking some of my logic, they did not arrive at the
       | correct answer. Just checked with o3 and still no dice. They are
       | definitely getting closer each model iteration though.
       | 
       | [0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
        
       | williamcotton wrote:
       | "Fix this spaghetti code by turning this complicated mess of
       | conditionals into a finite state machine."
       | 
       | So far, no luck!
        
       | paradite wrote:
       | If you want to evaluate your personal prompts against different
       | models quickly on your local machine, check out the simple
       | desktop app I built for this purpose: https://eval.16x.engineer/
        
       | MyPasswordSucks wrote:
       | I find that asking for an off-color joke is near-100% effective
       | at weeding out slop.
        
       | totetsu wrote:
       | SNES game walkthroughs
        
       | gunalx wrote:
       | "Hva er en adjunkt" Norwegian for what is an spesific form of
       | 5-10. Grade teacher. Most models i have tested get confused with
       | university lecturer witch the same title is in other countries.
        
       | calebm wrote:
       | "Generate an image of a wine glass filled to the brim."
        
       | mdp2021 wrote:
       | Some easy ones I recently found involve leading in the question
       | to state wrong details about a figure, apparently through
       | relations which are in fact of opposition.
       | 
       | So, you can make them call Napoleon a Russian (etc.) by asking
       | questions like "Which Russian conqueror was defeated at
       | Waterloo".
        
       | vinni2 wrote:
       | Isn't this the main idea behind https://lastexam.ai/
        
       | allemagne wrote:
       | >A man and his cousin are in a car crash. The man dies, but the
       | cousin is taken to the emergency room. At the OR, the surgeon
       | looks at the patient and says: "I cannot operate on him. He's my
       | son." How is this possible?
       | 
       | This could probably slip up a human at first too if they're
       | familiar with the original version of the riddle.
       | 
       | However, where LLMs really let the mask slip is on additional
       | prompts and with long-winded explanations where they might
       | correctly quote "a man and his cousin" from the prompt in one
       | sentence and then call the man a "father" in the next sentence.
       | Inevitably, the model concludes that the surgeon _must_ be a
       | woman.
       | 
       | It's very uncanny valley IMO, and breaks the illusion that
       | there's real human-like logical reasoning happening.
        
         | crazygringo wrote:
         | But this is going to be in every AI's training set. I just fed
         | ChatGPT your exact prompt and it gave back exactly what I
         | expected:
         | 
         |  _This is a classic riddle that challenges assumptions. The
         | answer is:
         | 
         | The surgeon is the boy's mother.
         | 
         | The riddle plays on the common stereotype that surgeons are
         | male, which can lead people to overlook this straightforward
         | explanation._
        
           | allemagne wrote:
           | Yeah this is the issue with the prompt, it also slips up
           | humans who gloss over "cousin".
           | 
           | I'm assuming that pointing this out leads you the human to
           | reread the prompt and then go "ah ok" and adjust the way
           | you're thinking about it. ChatGPT (and DeepSeek at least)
           | will usually just double and triple down and repeat "this
           | challenges gender assumptions" over and over.
        
             | crazygringo wrote:
             | Oh, I see. I mean yeah, I totally missed that too.
             | 
             | I mean, when we see something we recognize at a glance, we
             | don't usually double-check if it's exactly correct, because
             | we assume the speaker isn't trying to trick us.
             | 
             | I make typos in ChatGPT prompts all the time, and I'm
             | consistently impressed because it answers what I actually
             | meant to ask. So to me, this feels like a feature not a
             | bug.
             | 
             | I followed up with:
             | 
             | > _So the first five words are, "a man and his cousin" --
             | not his son. Does that change the riddle?_
             | 
             | And got back:
             | 
             | > _Great observation -- yes, that wording does make the
             | riddle more subtle.
             | 
             | Let's break it down:
             | 
             | * "A man and his cousin are in a car crash."
             | 
             | * "The man dies, but the cousin is taken to the emergency
             | room."
             | 
             | * At the OR, the surgeon says: "I cannot operate on him.
             | He's my son."
             | 
             | So if the cousin is the one taken to the OR, and the
             | surgeon says he's my son, then the cousin is the surgeon's
             | son._
             | 
             | Seems fine to me?
        
               | allemagne wrote:
               | I don't know about it being a feature, but humans can
               | definitely make the same mistake, and the follow-up here
               | admittedly doesn't seem too egregious.
               | 
               | To me it's still a little "off" considering that your
               | conversational partner just got called out for missing a
               | detail that should change their entire reading of the
               | riddle, but it's totally fair that you can imagine
               | realistic human emotional states to explain it.
               | 
               | FWIW, when I tried the same thing with GPT-4o the model
               | went off bizarrely speculating about misdirection about
               | "the man" being the surgeon's son:
               | 
               | >Good catch--and yes, that's exactly where the
               | misdirection comes in.
               | 
               | The riddle says:                   "A man and his cousin
               | are in a car crash. The man dies, but the cousin is taken
               | to the ER..."
               | 
               | Then the surgeon says:                   "I cannot
               | operate on him. He's my son."
               | 
               | So here's the trick:                   The man who died
               | is not the surgeon's son.              The cousin who
               | survived is the surgeon's son.
               | 
               | The confusion comes from people assuming that "the man"
               | who died must be the son. But the riddle never says that.
               | It's a subtle shift of attention designed to trip you up.
               | Clever, right?
        
             | roughly wrote:
             | This is a good one. I just argued with Gemini 2.5pro for a
             | few minutes on this and it absolutely would not yield.
        
           | abenga wrote:
           | That is the exact wrong answer that all models give.
        
             | krick wrote:
             | Technically, it isn't "wrong". It well could be the guy's
             | mother. But I'm nitpicking, it actually is a good example.
             | I tried ChatGPT twice in new chats, with and without
             | "Reason", and both times it gave me nonsensical
             | explanations to "Why mother? Couldn't it be a father?" I
             | was actually kinda surprised, since I expected "reasoning"
             | to fix it, but it actually made things worse.
        
           | hnuser123456 wrote:
           | The surgeon could be the cousin's mom or dad. The cousin's
           | dad didn't die in the crash, his cousin did. The question
           | "how is this possible?" implies there is some sort of
           | contradiction when there isn't any at all. It has nothing to
           | do with sexism, and to say it does reflects a bias in the
           | reader causing them to "spidey sense" a cultural bugaboo when
           | it's utterly irrelevant in this scenario.
        
             | harrall wrote:
             | Can someone explain to me how I read it wrong?
             | 
             | I read it as 2 cousins are in an accident and 1 of the
             | cousins is the son of the surgeon.
             | 
             | What was the contradictory statement that I missed?
        
               | judahmeek wrote:
               | There isn't a contradiction. Making the LLM look for a
               | nonexistent contradiction is the point of this prompt.
        
         | s_dev wrote:
         | I feel a bit stupid here --- why can't the surgeon be a man and
         | must be a woman?
        
           | saati wrote:
           | Because the original is a man and his father, it's a test for
           | gender bias.
        
             | bavarianbob wrote:
             | Sorry, what?
        
               | LaffertyDev wrote:
               | Presumably, the original quote that would _not_ stump an
               | LLM is "A father and a son are involved in a car
               | accident. The father dies, and the son is taken to the
               | emergency room. At the emergency room, the surgeon
               | remarks "I cannot operate on this person, he is my son.
               | How is this possible?"
               | 
               | Where the original gotchya is that the Surgeon can be the
               | son's mother or other adoptive parent.
               | 
               | The modification catches the LLM because with the
               | modification, the surgeon could just be the cousin's
               | parent -- father or mother -- so there is no gender/sex
               | at play here but the LLM continues to remark that there
               | is, therefor exposing its statistical training sets.
        
               | fragmede wrote:
               | the unaltered question is as follows:
               | 
               | A father and his son are in a car accident. The father
               | dies at the scene and the son is rushed to the hospital.
               | At the hospital the surgeon looks at the boy and says "I
               | can't operate on this boy, he is my son." How can this
               | be?
               | 
               | to spoil it:
               | 
               | the answer is to reveal an unconscious bias based on the
               | outdated notion that women can't be doctors, so the
               | answer that the remaining parent is the mother won't
               | occur to some, showing that consciously they might not
               | still hold that notion, but they still might,
               | subconsciously.
        
               | briannotbrain wrote:
               | The original, well-known version of the riddle starts "A
               | man and his son..." so that it appears to present a
               | paradox if your instinctive assumption is that the
               | surgeon must be a man. The op's prompt alters this so
               | that there is no potential paradox, and it tests whether
               | the model is reasoning from the prompt as written,
               | regardless of the presence of the original riddle in its
               | training data.
        
         | fergonco wrote:
         | > If the surgeon were the father of the man (the one who died),
         | then the cousin couldn't be his son (unless there's some very
         | unusual family structure going on involving double
         | relationships, which riddles don't usually intend).
         | 
         | > Therefore, the only straightforward explanation is:
         | 
         | > The surgeon is the cousin's parent -- specifically, his
         | mother.
         | 
         | Imagine a future where this reasoning in a trial decides
         | whether you go to jail or not.
        
       | internet_points wrote:
       | anything in the long tail of languages (ie. not the top 200 by
       | corpus size)
        
       | macrolocal wrote:
       | Imagine chess played on a board with opposite sides identified,
       | like in the video game Asteroids. Does white have a winning
       | strategy?
        
       | weberer wrote:
       | "Why was the grim reaper Jamaican?"
       | 
       | LLM's seem to have no idea what the hell I'm talking about. Maybe
       | half of millennials understand though.
        
       | calvinmorrison wrote:
       | draw an ASCII box that says "anything"
        
       | Alifatisk wrote:
       | Yes, give me a place where I can dump all the prompts and what
       | the correct expected response is.
       | 
       | I can share here too but I don't know for how long this thread
       | will be alive.
        
       | xdennis wrote:
       | I often try to test how usable LLMs are for Romanian language
       | processing. This always fails.
       | 
       | > Split these Romanian words into syllables: "sarpe", "serpi".
       | 
       | All of them say "sar-pe", "ser-pi" even though the "i" there is
       | not a vowel (it's pronounced /j/).
        
       | afro88 wrote:
       | Cryptic crossword clues that involves letter shuffling (anagrams,
       | container etc). Or, ask it to explain how to solve cryptic
       | crosswords with examples
        
         | PaulRobinson wrote:
         | I have also found asking LLMs to create new clues for certain
         | answers as if a were a setter, will also produce garbage.
         | 
         | They're stochastic parrots, cryptics require logical reasoning.
         | Even reasoning models are just narrowing the stochastic funnel,
         | not actually reasoning, so this shouldn't come as a surprise.
        
       | munchler wrote:
       | Here's one from an episode of The Pitt: You meet a person who
       | speaks a language you don't understand. How might you get an idea
       | of what the language is called?
       | 
       | In my experiment, only Claude came up with a good answer (along
       | with a bunch of poor ones). Other chatbots struck out entirely.
        
       | jones1618 wrote:
       | Impossible prompts:
       | 
       | A black doctor treating a white female patient
       | 
       | An wide shot of a train on a horizontal track running left to
       | right on a flat plain.
       | 
       | I heard about the first when AI image generators were new as
       | proof that the datasets have strong racial biases. I'd assumed a
       | year later updated models were better but, no.
       | 
       | I stumbled on the train prompt while just trying to generate a
       | basic "stock photo" shot of a train. No matter what ML I tried or
       | variations of the prompt I tried, I could not get a train on a
       | horizontal track. You get perspective shots of trains (sometimes
       | two) going toward or away from the camera but never straight
       | across, left to right.
        
         | briannotbrain wrote:
         | I thought I was so clever when I read your comment: "The
         | problem is the word 'running,' I'll bet if I ask for the
         | profile of a train without using any verbs implying motion,
         | I'll get the profile view." And damned if the same thing
         | happened to me. Do you know why this is? Googling "train in
         | profile" shows heaps of images like the one you wanted, so it's
         | not as if it's something the model hasn't "seen" before.
        
       | ofou wrote:
       | No luck so far with: When does the BB(6) halt?
        
       | boleary-gl wrote:
       | I like:
       | 
       | Unscramble the following letters to form an English word: "M O O
       | N S T A R E R"
       | 
       | The non-thinking models can struggle sometimes and go off on huge
       | tangents
        
         | philipkglass wrote:
         | Llama 3.3 worked but (as you said) struggled before arriving at
         | the correct answer. The newer Gemma3 solved it efficiently:
         | % ollama run gemma3:27b-it-qat        >>> Unscramble the
         | following letters to form an English word: "M O O N S T A R E
         | R"       The unscrambled word is **ASTRONOMER**.
        
         | munchler wrote:
         | Current LLM's are based on multi-character tokens, which means
         | they don't know how to spell well. As a result, they are
         | horrible at spelling games like this or, say, Hangman.
        
       | juancroldan wrote:
       | I actually started a repository for it:
       | https://github.com/jcarlosroldan/unsolved-prompts
        
         | juancroldan wrote:
         | Before someone comments this will get indexed by AI: that's my
         | whole point. I'm not using it to evaluate AIs, but in the hope
         | that at some point AI is good enough to solve these
        
       | tunesmith wrote:
       | Pretty much any advanced music theory question. Or even just
       | involving transposed chord progressions.
        
         | dgunay wrote:
         | Every time I've tried to get an LLM to find a piece of music
         | for me based on a description of the texture, chord structure,
         | instruments etc. it fails miserably.
        
       | csours wrote:
       | I love plausible eager beavers:
       | 
       | "explain the quote: philosophy is a pile of beautiful corpses"
       | 
       | "sloshed jerk engineering test"
       | 
       | cross domain jokes:
       | 
       | Does the existence of sub-atomic particles imply the existence of
       | dom-atomic particles?
        
       | EGreg wrote:
       | Draw a clock that shows [time other than 10:10]
       | 
       | Draw a wine glass that's totally full to the brim etc.
       | 
       | https://www.youtube.com/watch?v=160F8F8mXlo
       | 
       | https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...
        
       | adastra22 wrote:
       | I know someone who is getting paid thousands of dollars per
       | prompt to do this. He is making bank. There is an actual
       | marketplace where this is done, fyi.
        
       | horsellama wrote:
       | I just ask to code golf fizzbuzz in a not very popular (golfing
       | wise) language
       | 
       | this is interesting (imo) because I, in the first instance, don't
       | know the best/right answer, but I can tell if what I get is wrong
        
       | booleandilemma wrote:
       | Why should we?
        
       ___________________________________________________________________
       (page generated 2025-04-24 23:00 UTC)