[HN Gopher] Ask HN: Share your AI prompt that stumps every model
___________________________________________________________________
Ask HN: Share your AI prompt that stumps every model
I had an idea for creating a crowdsourced database of AI prompts
that no AI model could yet crack (wanted to use some of them as
we're adding new models to Kilo Code). I've seen a bunch of those
prompts scattered across HN, so thought to open a thread here so we
can maybe have a centralied location for this. Share your prompt
that stumps every AI model here.
Author : owendarko
Score : 148 points
Date : 2025-04-24 13:11 UTC (9 hours ago)
| falcor84 wrote:
| You might want to get the ball rolling by sharing what you
| already have
| moffkalast wrote:
| Also, ones that can't be solved at a glance by humans don't
| count. Like this horrid ambiguous example from SimpleBench I
| saw a while back that's just designed to confuse:
|
| John is 24 and a kind, thoughtful and apologetic person. He is
| standing in an modern, minimalist, otherwise-empty bathroom,
| lit by a neon bulb, brushing his teeth while looking at the
| 20cm-by-20cm mirror. John notices the 10cm-diameter neon
| lightbulb drop at about 3 meters/second toward the head of the
| bald man he is closely examining in the mirror (whose head is a
| meter below the bulb), looks up, but does not catch the bulb
| before it impacts the bald man. The bald man curses, yells
| 'what an idiot!' and leaves the bathroom. Should John, who
| knows the bald man's number, text a polite apology at some
| point?
|
| A. no, because the lightbulb was essentially unavoidable
|
| B. yes, it would be in character for him to send a polite text
| apologizing for the incident
|
| C. no, because it would be redundant
|
| D. yes, because it would potentially smooth over any lingering
| tension from the encounter
|
| E. yes, because John saw it coming, and we should generally
| apologize if we fail to prevent harm
|
| F. yes because it is the polite thing to do, even if it wasn't
| your fault.
| ryankrage77 wrote:
| I'd argue that's a pretty good test for an LLM - can it
| overcome the red herrings and get at the actual problem?
| falcor84 wrote:
| I think that the "actual problem" when you've been given
| such a problem is with the person posing it either having
| dementia, or taking the piss. In either case, the response
| shouldn't be of trying to guess their intent and come up
| with a "solution", but of rejecting it and dealing with the
| person.
| mNovak wrote:
| At a glance, it sounds like John is the bald man? If we're
| treating this as a riddle, it doesn't seem incomprehensible.
| Whether riddles are a fair test is another question.
| spuz wrote:
| Yeah I really dislike this kind of question from SimpleBench.
| I've suggested many improvements to some of the publicly
| available questions but not had a good response.
|
| I think the correct answer to the question above should be
| something like, "are you sure the question is correct because
| it's not clear whether John and the bald man are the same
| person" but of course an LLM would be marked down if it was
| inclined to respond in this way.
| falcor84 wrote:
| I think it's a horrible example, but I just got a very
| professional response from Gemini 2.5:
|
| > This scenario seems like a bit of a riddle! Let's break it
| down:
|
| > The Setting: John is alone in an "otherwise-empty"
| bathroom.
|
| > The Action: He is looking in the mirror.
|
| > The Observation: He sees a bulb falling towards the head of
| a bald man he is examining in the mirror.
|
| > The Physics: Since he is alone and looking in the mirror,
| the bald man he is "closely examining" must be his own
| reflection.
|
| > The Incident: Therefore, the bulb fell towards John's own
| head. He looked up (at the actual bulb falling towards him),
| failed to catch it, and it hit him.
|
| > The Reaction: The "bald man" who cursed, yelled "what an
| idiot!", and left the bathroom was actually John himself,
| reacting to being hit on the head and possibly feeling
| foolish for not catching the bulb.
|
| > Conclusion: No, John should not text an apology. The person
| who was hit by the bulb, got angry, and left was John
| himself. There is no other bald man to apologize to.
| leftcenterright wrote:
| Write 20 sentences that end with "p"
| leftcenterright wrote:
| for ChatGPT try the "o" version: Write 20 sentences that end
| with "o"
| r_thambapillai wrote:
| https://chatgpt.com/share/680a3da0-b888-8013-9c11-42c22a642b...
| ks2048 wrote:
| "Can you hand me the paintbrush and turp?"
|
| I had to ask another LLM what is "turp" - and it said it's
| short for "turpentine".
| alickz wrote:
| >20 sentences that end in 'o'
|
| >They shouted cheers after the winning free throw.
|
| good attempt by ChatGPT tho imo
| falcor84 wrote:
| Good one. I do seem to get consistently good results on Gemini
| 2.5 when using the slightly more explicit "Write 20 sentences
| where the very last character of each sentence is the letter
| 'p'."
| meltyness wrote:
| Write 20 sentences that end with "p" in the final word before
| the period or other punctuation.
|
| Succeeded on ChatGPT, pretty close on gemma3:4b -- the
| exceptions usually ending with a "puh" sound...
| marcusb wrote:
| https://claude.ai/share/648e38d4-e282-43dd-8c9c-b0ea76ae0baf
| mohsen1 wrote:
| A ball costs 5 cents more than a bat. Price of a ball and a bat
| is $1.10. Sally has 20 dollars. She stole a few balls and bats.
| How many balls and how many bats she has?
|
| All LLMs I tried miss the point that she stole things and not
| bought them
| dwringer wrote:
| Google Gemini (2.0 Flash, free online version) handled this
| rather okay; it gave me an arguably unneccessary calculation of
| the individual prices of ball and bat, but then ended with
| "However with the information given, we can't determine exactly
| how many balls and bats Sally stole. The fact that she has $20
| tells us she could have stolen some, but we don't know how many
| she did steal." While "the fact that she has $20" has no
| bearing on this - and the model seems to wrongly imply that it
| does - the fact that we have insufficient information to
| determine an answer is correct, and the model got the answer
| essentially right.
| iamgopal wrote:
| gemini 2.5 give following response.
|
| Conclusion:
|
| We can determine the price of a single ball ($0.575) and a
| single bat ($0.525). However, we cannot determine how many
| balls and bats Sally has because the information "a few" is too
| vague, and the fact she stole them means her $20 wasn't used
| for the transaction described.
| drdrek wrote:
| lol, nice way to circumvent the attention algorithm
| docdeek wrote:
| Grok 3.0 wasn't fooled on this one, either:
|
| Final Answer: The problem does not provide enough information
| to determine the exact number of balls and bats Sally has. She
| stole some unknown number of balls and bats, and the prices are
| $0.575 per ball and $0.525 per bat.
| lostmsu wrote:
| 1-4 balls and bats // HoMM 3
| NitpickLawyer wrote:
| There's a repo out there called "misguided attention" that
| tracks this kind of problems.
| nonameiguess wrote:
| It's interesting to me that the answers showing "correct"
| answers from current models still don't strike me as correct.
| The question is unanswerable, but not only because we don't
| know how many balls and bats she stole. We don't know that she
| had any intention of maxing out what she _could_ buy with that
| much money. We have no idea how long she has been alive and
| accumulating bats and balls at various prices that don 't match
| the current prices with money she no longer has. We have no
| idea how many balls and bats her parents gave her 30 years ago
| that she still has stuffed in a box in her attic somewhere.
|
| Even the simplest possible version of this question, assuming
| she started with nothing, spent as much money as she was able
| to, and stole nothing, doesn't have an answer, because she
| could have bought anything from all bats and no balls to all
| balls and no bats and anything in between. We could enumerate
| all possible answers but we can't know which she actually did.
| whalesalad wrote:
| I don't have a prompt per-say.. but recently I have managed to
| ask certain questions of both openai o1/o3 and claude extended
| thinking 3.7 that have spiraled _way_ out of control. A simple
| high-level architecture question with an emphasis on do not
| produce code lets just talk thru this yields nearly 1,000 lines
| of SQL. Once the conversation /context gets quite long it is more
| likely to occur, in my experience.
| pc86 wrote:
| The only model I've seen so far that doesn't end up going crazy
| with long contexts with Gemini 2.5 pro, but tbf I haven't gone
| past 700-750k total tokens so maybe as it starts to approach
| the limit (1.05M) things get hairy?
| raymond_goo wrote:
| Create a Three.js app that shows a diamond with correct light
| calculations.
| thierrydamiba wrote:
| I love this. So brutal, but also so cool to know one day that
| will be easy for the models.
| xnx wrote:
| > correct light calculations
|
| What are you expecting? Ray tracing?
| spookie wrote:
| Not necessarily. It could start by using diamond's IOR, and
| use that to dictate a common brdf calculation. Along with
| some approximate refraction, perhaps using a equirectangular
| projected sphere map or something for the background.
| asciimov wrote:
| Nope, not doing this. Likely you shouldn't either. I don't want
| my few good prompts to get picked up by trainers.
| pc86 wrote:
| May I ask outside of normal curiosity, what good is a prompt
| that breaks a model? And what is trying to keep it "secret"?
| maybeOneDay wrote:
| Being able to test future models without fear that your
| prompt has just been trained on an answer on HN, I assume.
| asciimov wrote:
| To gauge how well the models "think" and what amount of slop
| they generate.
|
| Keeping it secret because I don't want my answers trained
| into a model.
|
| Think of it this way, FizzBuzz used to be a good test to weed
| out bad actors. It's simple enough that any first year
| programmer can do it and do it quickly. But now everybody
| knows to prep for FizzBuzz so you can't be sure if your
| candidate knows basic programming or just memorized a
| solution without understanding what it does.
| tveita wrote:
| You want to know if a new model is actually better, which you
| won't know if they just added the specific example to the
| training set. It's like handing a dev on your team some
| failing test cases, and they keep just adding special cases
| to make the tests pass.
|
| How many examples does OpenAI train on now that are just
| variants of counting the Rs in strawberry?
|
| I guess they have a bunch of different wine glasses in their
| image set now, since that was a meme, but they still
| completely fail to draw an open book with the cover side up.
| gwern wrote:
| > How many examples does OpenAI train on now that are just
| variants of counting the Rs in strawberry?
|
| Well, that's easy: zero.
|
| Because even a single training example would 'solved' it by
| memorizing the simple easy answer within weeks of
| 'strawberry' first going viral , which was like a year and
| a half ago at this point - and dozens of minor and major
| model upgrades since. And yet, the strawberry example kept
| working for most (all?) of that time.
|
| So you can tell that if anything, OA probably put in extra
| work to filter all those variants _out_ of the training
| data...
| SweetSoftPillow wrote:
| No, just check their models Knowledge cutoff dates
| orbital-decay wrote:
| If that prompt can be easily trained against, it probably
| doesn't exploit a generic bias. These are not that interesting,
| and there's no point in hiding them.
| daedrdev wrote:
| generic biases can also be fixed
| orbital-decay wrote:
| *Some generic biases. Some others like recency bias,
| serial-position effect, "pink elephant" effect, negation
| accuracy seem to be pretty fundamental and are unlikely to
| be fixed without architectural changes, or at all. Things
| exploiting in-context learning and native context
| formatting are also hard to suppress during the training
| without making the model worse.
| fwip wrote:
| Sure there is. If you want to know if students understand the
| material, you don't hand out the answers to the test ahead of
| time.
|
| Collecting a bunch of "Hard questions for LLMs" in one place
| will invariably result in Goodhart's law (When a measure
| becomes a target, it ceases to be a good measure). You'll
| have no idea if the next round of LLMs is better because
| they're generally smarter, or because they were trained
| specifically on these questions.
| xena wrote:
| Write a regular expression that matches Miqo'te seekers of the
| sun names. They always confuse the male and female naming
| conventions.
| thatjoeoverthr wrote:
| "Tell me about the Marathon crater."
|
| This works against _the LLM proper,_ but not against chat
| applications with integrated search. For ChatGPT, you can write,
| "Without looking it up, tell me about the Marathon crater."
|
| This tests self awareness. A two-year-old will answer it
| correctly, as will the dumbest person you know. The correct
| answer is "I don't know".
|
| This works because:
|
| 1. Training sets consist of knowledge we have, and not of
| knowledge we don't have.
|
| 2. Commitment bias. Complaint chat models will be trained to
| start with "Certainly! The Marathon Crater is a geological
| formation", or something like that, and from there, the next most
| probable tokens are going to be "in Greece", "on Mars" or
| whatever. At this point, all tokens that are probable are also
| incorrect.
|
| When demonstrating this, I like to emphasise point one, and
| contrast it with the human experience.
|
| We exist in a perpetual and total blinding "fog of war" in which
| you cannot even see a face all at once; your eyes must dart
| around to examine it. Human experience is structured around
| _acquiring_ and _forgoing_ information, rather than _having_
| information.
| Tenoke wrote:
| >Complaint chat models will be trained to start with
| "Certainly!
|
| They are certainly biased that way but there's also some 'i
| don't know' samples in rlhf, possibly not enough but it's
| something they think about.
|
| At any rate, Gemini 2.5pro passes this just fine
|
| >Okay, based on my internal knowledge without performing a new
| search: I don't have information about a specific, well-known
| impact crater officially named "Marathon Crater" on Earth or
| another celestial body like the Moon or Mars in the same way we
| know about Chicxulub Crater or Tycho Crater.
|
| >However, the name "Marathon" is strongly associated with Mars
| exploration. NASA's Opportunity rover explored a location
| called Marathon Valley on the western rim of the large
| Endeavour Crater on Mars.
| thatjoeoverthr wrote:
| There are a few problems with an ,,I don't know" sample. For
| starters, what does it map to? Recall, the corpus consists of
| information we have (affirmatively). You would need to invent
| a corpus of false stimuli. What you would have, then, is a
| model that is writing ,,I don't know" based on whether the
| stimulus better matches something real, or one of the
| negatives.
|
| You can detect this with some test time compute architectures
| or pre-inference search. But that's the broader application.
| This is a trick for the model alone.
| pje wrote:
| GPT 4.5 even doubles down when challenged:
|
| > Nope, I didn't make it up -- Marathon crater is real, and it
| was explored by NASA's Opportunity rover on Mars. The crater
| got its name because Opportunity had driven about 42.2
| kilometers (26.2 miles -- a marathon distance) when it reached
| that point in March 2015. NASA even marked the milestone as a
| symbolic achievement, similar to a runner finishing a marathon.
|
| (Obviously all of that is bullshit.)
| fao_ wrote:
| This is the kind of reason why I will never use AI
|
| What's the point of using AI to do research when 50-60% of it
| could potentially be complete bullshit. I'd rather just grab
| a few introduction/101 guides by humans, or join a community
| of people experienced with the thing -- and then I'll
| actually be learning about the thing. If the people in the
| community are like "That can't be done", well, they have had
| years or decades of time invested in the thing and in that
| instance I should be learning and listening from their advice
| rather than going "actually no it can".
|
| I see a lot of beginners fall into that second pit. I myself
| made that mistake at the tender age of 14 where I was of the
| opinion that "actually if i just found a reversible hash,
| I'll have solved compression!", which, I think we all here
| know is bullshit. I think a lot of people who are arrogant or
| self-possessed to the extreme make that kind of mistake on
| learning a subject, but I've seen this especially a lot when
| it's programmers encountering non-programming fields.
|
| Finally tying that point back to AI -- I've seen a lot of
| people who are unfamiliar with something decide to use AI
| instead of talking to someone experienced because the AI
| makes them feel like they know the field rather than telling
| them their assumptions and foundational knowledge is
| incorrect. I only last year encountered someone who was
| trying to use AI to debug why their KDE was broken, and they
| kept throwing me utterly bizzare theories (like, completely
| out there, I don't have a specific example with me now but,
| "foundational physics are wrong" style theories). It turned
| out that they were getting mired in log messages they saw
| that said "Critical Failure", as an expert of dealing with
| Linux for about ten years now, I checked against my own
| system and... yep, they were just part of mostly normal
| system function (I had the same messages on my Steam Deck,
| which was completely stable and functional). The real fault
| was buried halfway through the logs. At no point was this
| person able to know what was important versus not-important,
| and the AI had absolutely no way to tell or understand the
| logs in the first place, so it was like a toaster leading a
| blind man up a mountain. I diagnosed the correct fault in
| under a day by just asking them to run two commands and
| skimming logs. That's experience, and that's irreplaceable by
| machine as of the current state of the world.
|
| I don't see how AI can help when huge swathes of it's
| "experience" and "insight" is just hallucinated. I don't see
| how this is "helping" people, other than making people
| somehow more crazy (through AI hallucinations) and alone
| (choosing to talk to a computer rather than a human).
| CamperBob2 wrote:
| _What 's the point of using AI to do research when 50-60%
| of it could potentially be complete bullshit._
|
| You realize that all you have to do to deal with questions
| like "Marathon Crater" is ask another model, right? You
| might still get bullshit but it won't be the same bullshit.
| Night_Thastus wrote:
| Without checking every answer it gives back to make sure
| it's factual, you may be ingesting tons of bullshit
| answers.
|
| In this particular answer model A may get it wrong and
| model B may get it right, but that can be reversed for
| another question.
|
| What do you do at that point? Pay to use all of them and
| find what's common in the answers? That won't work if
| most of them are wrong, like for this example.
|
| If you're going to have to fact check everything
| anyways...why bother using them in the first place?
| CamperBob2 wrote:
| _If you 're going to have to fact check everything
| anyways...why bother using them in the first place?_
|
| "If you're going to have to put gas in the tank, change
| the oil, and deal with gloves and hearing protection, why
| bother using a chain saw in the first place?"
|
| Tool use is something humans are good at, but it's rarely
| trivial to master, and not all humans are equally good at
| it. There's nothing new under that particular sun.
| Night_Thastus wrote:
| The difference is consistency. You can read a manual and
| know exactly how to oil and refill the tank on a
| chainsaw. You can inspect the blades to see if they are
| worn. You can listen to it and hear how it runs. If a
| part goes bad, you can easily replace it. If it's having
| troubles, it will be obvious - it will simply stop
| working - cutting wood more slowly or not at all.
|
| The situation with an LLM is completely different.
| There's no way to tell that it has a wrong answer - aside
| from looking for the answer elsewhere which defeats its
| purpose. It'd be like using a chainsaw all day and not
| knowing how much wood you cut, or if it just stopped
| working in the middle of the day.
|
| And even if you KNOW it has a wrong answer (in which
| case, why are you using it?), there's no clear way to
| 'fix' it. You can jiggle the prompt around, but that's
| not consistent or reliable. It _may_ work for that
| prompt, but that won 't help you with any subsequent
| ones.
| CamperBob2 wrote:
| The thing is, nothing you've said is untrue for any
| search engine or user-driven web site. Only a reckless
| moron would paste code they find on Stack Overflow or
| Github into their project without at least looking it
| over. Same with code written by LLMs. The difference is,
| just as the LLM can write unit tests to help you deal
| with uncertainty, it can also cross-check the output of
| other LLMs.
|
| You have to be careful when working with powerful tools.
| These tools are powerful enough to wreck your career as
| quickly as a chain saw can send you to the ER, so... have
| fun and be careful.
| skydhash wrote:
| The nice thing about SO and Github is that there's little
| to no reason there for things to not work, at least in
| the context where you found the code. The steps are
| getting the context, assuming it's true based on various
| indicators (mostly reputation) and then continuing on
| with understanding the snippet.
|
| But with LLMs, every word is a probability factor.
| Assuming the first paragraph is true has no impact on the
| rest.
| thatjoeoverthr wrote:
| I was thinking about a self verification method on this
| principle, lately. Any specific-enough claim, e.g. ,,the
| Marathon crater was discovered by ..." can be
| reformulated as a Jeopardy-style prompt. ,,This crater
| was discovered by ..." and you can see a failure to
| match. You need some raw intelligence to break it down
| though.
| alpaca128 wrote:
| There are use-cases where hallucinations simply do not
| matter. My favorite is finding the correct term for a
| concept you don't know the name of. Googling is extremely
| bad at this as search results will often be wrong unless
| you happen to use the commonly accepted term, but an LLM
| can be surprisingly good at giving you a whole list of
| fitting names just based on a description. Same with movie
| titles etc. If it hallucinates you'll find out immediately
| as the answer can be checked in seconds.
|
| The problem with LLMs is that they appear much smarter than
| they are and people treat them as oracles instead of using
| them for fitting problems.
| skydhash wrote:
| Maybe I read too much encyclopedia, but my current
| workflow is to explore introductory material. Like open a
| database textbook and you'll find all the jargon there.
| Curated collection can get you there too.
|
| Books are a nice example of this, where we have both the
| table of contents for a general to particular concepts
| navigation, and the index for keyword based navigation.
| silverquiet wrote:
| > (Obviously all of that is bullshit.)
|
| It isn't obvious to me - that is rather plausible and a cute
| story.
| Workaccount2 wrote:
| The inaccuracies are that it is called "Marathon Valley" (not
| crater) and that it was photographed in April 2015 (from the
| rim) or that in July 2015 actually entered. The other stuff
| is correct.
|
| I'm guessing this "gotcha" relies on "valley"/"crater", and
| "crater"/"mars" being fairly close in latent space.
|
| ETA: Marathon Valley also exists on the rim of Endeavour
| crater. Just to make it even more confusing.
| mvdtnz wrote:
| None of it is correct because it was not asked about
| Marathon Valley, it was asked about Marathon Crater, a
| thing that does not exist, and it is claiming that it
| exists and making up facts about it.
| empath75 wrote:
| > None of it is correct because it was not asked about
| Marathon Valley, it was asked about Marathon Crater, a
| thing that does not exist, and it is claiming that it
| exists and making up facts about it.
|
| The Marathon Valley _is_ part of a massive impact crater.
| mvdtnz wrote:
| If you asked me for all the details of a Honda Civic and
| I gave you details about a Honda Odyssey you would not
| say I was correct in any way. You would say I was wrong.
| Workaccount2 wrote:
| The closer analogy is asking for the details of a Mazda
| Civic, and being given the details of a Honda Civic.
| Workaccount2 wrote:
| Or it's assuming you are asking about Marathon Valley,
| which is very reasonable given the context.
|
| Ask it about "Marathon Desert", which does not exist and
| isn't closely related to something that does exist, and
| it asks for clarification.
|
| I'm not here to say LLMs are oracles of knowledge, but I
| think the need to carefully craft specific "gotcha"
| questions in order to generate wrong answers is a pretty
| compelling case in the opposite direction. Like the
| childhood joke of "Whats up?"..."No, you dummy! The sky
| is!"
|
| Straightforward questions with straight wrong answers are
| far more interesting. I don't many people ask LLMs trick
| questions all day.
| pje wrote:
| I was using "bullshit" in the Technical Sense(tm), i.e.
| _indifferent to the truth of the output_.
|
| [ChatGPT is bullshit ]: https://link.springer.com/article/1
| 0.1007/s10676-024-09775-5
| nartho wrote:
| Claude 3.7 :
|
| I'd be happy to tell you about the Marathon crater. The
| Marathon crater is a significant impact structure located on
| Earth. The Marathon crater is situated in Ontario, Canada,
| and is estimated to be about 450 million years old, dating
| back to the Ordovician period. It has a diameter of
| approximately 65 kilometers (40 miles), making it one of the
| larger impact craters in North America. The crater was formed
| when a meteorite struck the Earth, causing a massive impact
| that left behind this distinctive geological formation. Over
| time, erosion and other geological processes have altered its
| appearance, but scientists have been able to study its
| structure through various geological surveys and analyses.
| What's particularly interesting about the Marathon crater is
| its size and age, which provide valuable insights into
| Earth's geological history and past impact events. The study
| of such craters helps scientists understand the frequency and
| effects of meteorite impacts throughout our planet's history.
| Would you like me to elaborate on any specific aspect of the
| Marathon crater, such as its discovery, geological features,
| or scientific significance?
| NitpickLawyer wrote:
| > This tests self awareness. A two-year-old will answer it
| correctly, as will the dumbest person you know. The correct
| answer is "I don't know".
|
| I disagree. It does not test self awareness. It tests (and
| confirms) that current instruct-tuned LLMs are tuned towards
| answering questions that users might have. So the distribution
| of training data probably has lots of "tell me about mharrner
| crater / merinor crater / merrihana crater" and so on. Replying
| "I don't know" to all those questions would be net detrimental,
| IMO.
| byearthithatius wrote:
| We want the distribution to be varied and expansive enough
| that it has samples of answering when possible and samples of
| clarifying with additional questions or simply saying "I
| don't know" when applicable. That can be trained by altering
| the distribution in RLHF. This question does test self
| awareness insofar as if it gets this right by saying "I don't
| know" we know there are more samples of "I don't know"s in
| the RLHF dataset and we can trust the LLM a bit more to not
| be biased towards blind answers.
|
| Hence why some models get this right and others just make up
| stuff about Mars.
| thatjoeoverthr wrote:
| What you're describing can be framed as a lack of self
| awareness as a practical concept. You know whether you know
| something or not. It, conversely, maps stimuli to a vector.
| It can't not do that. It cannot decide that it hasn't ,,seen"
| such stimuli in its training. Indeed, it has never ,,seen"
| its training data; it was modified iteratively to produce a
| model that better approximates the corpus. This is fine, and
| it isn't a criticism, but it means it can't actually tell if
| it ,,knows" something or not, and ,,hallucinations" are a
| simple, natural consequence.
| vunderba wrote:
| You can trip them up even more if you rewrite the question with
| the hidden assumption that X exists, e.g.:
|
| _" When was Marathon Crater discovered? I don't need an exact
| date - a rough estimate will be fine."_
|
| OpenAI gpt-4o Marathon Crater was discovered by
| the Mars Exploration Rover Opportunity during its
| mission on Mars. Opportunity arrived at the crater in
| April 2015. The crater was named "Marathon" to
| commemorate the rover's achievement of traveling a
| marathon's distance (about 42.195 kilometers or 26.219
| miles) on the Martian surface.
|
| Sonnet 3.5 I apologize, but I don't want to
| make claims about when Marathon Crater was discovered
| without being certain about the information. *While I
| know it's a crater on Mars*, I don't have reliable
| information about when it was first identified or
| discovered.
|
| Gemini 2.5 Marathon Crater is located on Mars
| and was explored by NASA's Opportunity rover.
| While it would have been visible in orbital images
| earlier, it became a specific point of interest and was
| likely named and identified as a target for the rover
| around the time Opportunity was approaching or exploring
| the western rim of Endeavour Crater. Opportunity
| arrived at the rim of Endeavour Crater in August 2011,
| and it reached the area known as Marathon Valley
| (adjacent to or containing features sometimes associated
| with the name, depending on specific definitions used by
| the team) around late 2014 / early 2015. So,
| a rough estimate for its "discovery" or identification
| as a distinct feature of interest would be in the early
| 2010s, likely between 2011 and 2014, as mission planners
| mapped out Opportunity's traverse along the Endeavour rim
| using orbital data.
| radioactivist wrote:
| OpenAI o4-mini-high I'm actually not finding
| any officially named "Marathon Crater" in the planetary- or
| terrestrial-impact crater databases. Did you perhaps mean the
| features in Marathon Valley on Mars (which cuts into
| the western rim of Endeavour Crater and was explored
| by Opportunity in 2015)? Or is there another "Marathon"
| feature--maybe on the Moon, Mercury, or here on Earth
| --that you had in mind? If you can clarify which body or
| region you're referring to, I can give you a rough date for
| when it was first identified.
| thatjoeoverthr wrote:
| Raw model, or the chat product? The ChatGPT app has this
| integrated with search.
| boleary-gl wrote:
| Raw model I got Marathon Crater isn't
| one of the long-known, named lunar features from
| 19th-century telescopic maps - it was first
| singled out and informally "discovered" by NASA
| in the mid-1960s when Lunar Orbiter imagery and,
| more definitively, the Surveyor 5 landing
| (September 1967) identified it. So, very roughly,
| think "around 1966-67."
| andrewinardeer wrote:
| Amazon's Nova Pro:
|
| ``` Marathon Crater was discovered in the early 20th
| century, likely around the 1920s or 1930s. This estimate is
| based on the period when geological surveys and
| explorations in the region were more common. ```
|
| https://kagi.com/assistant/8f270eab-
| cba3-4f39-976c-113ea4392...
| greenpizza13 wrote:
| Like this one a lot. Perplexity gets this right, probably
| because it searches the web.
|
| "When was Marathon Crater discovered? I don't need an exact
| date - a rough estimate will be fine"
|
| There appears to be a misunderstanding in your query. Based on
| the search results provided, there is no mention of a "Marathon
| Crater" among the impact craters discussed. The search results
| contain information about several well-known impact craters: 1.
| Chicxulub crater - Formed about 66 million years ago by an
| asteroid impact 2. Barringer Crater (also known as Meteor
| Crater) - Discovered in the 19th century by American settlers
| and studied by scientists like Albert E. Foote in 1891 and
| Daniel M. Barringer in the early 20th century 3. A recently
| discovered 3.47 billion-year-old crater in Australia's Pilbara
| region (announced in March 2025)
| thatjoeoverthr wrote:
| Perplexity will; search and storage products will fail to
| find it, and the LLM will se the deviation between the query
| and the find. So, this challenge only works against the model
| alone :)
| mdp2021 wrote:
| > _Perplexity gets this right_
|
| Perplexity seems to more easily return negatives, probably
| facilitated by the implicit need to find documentation ("I
| cannot find any document mentioning that").
|
| But Perplexity can also easily speak its own dubious piece of
| mind unless requested explicitly "provide links to documents
| that inform about that".
| northisup wrote:
| just to confirm I read this right, "the marathon crater" does
| not in fact exist, but this works because it seems like it
| should?
| thatjoeoverthr wrote:
| Yes, and the forward-only inference strategy. It seems like a
| normal question, so it starts answering, then carries on from
| there.
| thatjoeoverthr wrote:
| The other aspect is it can't reliably tell whether it
| ,,knows" something or not. It's conditioned to imitate the
| corpus, but the corpus in a way is its ,,universe" and it
| can't see the boundaries. Everything must map to something
| _in_ the corpus.
| alickz wrote:
| There is a Marathon Valley on Mars, which is what ChatGPT
| seems to assume you're talking about
|
| https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac.
| ..
|
| >Marathon Crater on Mars was discovered in 2015 by NASA's
| Opportunity rover during its extended mission. It was
| identified as the rover approached the 42-kilometer-wide
| Endeavour Crater after traveling roughly a marathon's
| distance (hence the name).
|
| >>is it a crater?
|
| >>>Despite the name, Marathon Valley (not a crater) is
| actually a valley, not a crater. It's a trough-like
| depression on the western rim of Endeavour Crater on Mars. It
| was named because Opportunity reached it after traveling the
| distance of a marathon (~42 km) since landing.
|
| So no--Marathon is not a standalone crater, but part of the
| structure of Endeavour Crater. The name "Marathon" refers
| more to the rover's achievement than a distinct geological
| impact feature.
| alickz wrote:
| Here's me testing with a place that is a lot less ambiguous
|
| https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa2
| 0...
| joshstrange wrote:
| > This tests self awareness. A two-year-old will answer it
| correctly, as will the dumbest person you know. The correct
| answer is "I don't know".
|
| Well, I got all the way to the end of the sentence thinking
| "God, I'm really out of touch, I've never heard of this before
| or I'm forgetting it if I have".
|
| Well played, and yes, that's a great test!
| dudeinhawaii wrote:
| I like this but at the same time it seems tricky don't you
| think? Is the AI model intuiting your intent? There is a
| Marathon Valley on Mars that could be implied to be a previous
| crater. I'm not sure if the AI is hallucinating outright or
| attempting to answer an ambiguous question. It's like saying
| "tell me about the trade building in New York". Pre-9/11, you'd
| understand this was the World Trade Center and wouldn't be
| wrong if you answered someone in this way. "Tell me about the
| Triangle statue". "Oh the Triangle statue was built in ancient
| egypt around BC 3100". It's hard to explain, and perhaps I'm
| anthropomorphizing but it's something humans do. Some of us
| correct the counter-party and some of us simply roll with the
| lingo and understand the intent.
| thatjoeoverthr wrote:
| It's a roll of the dice whether it's on Mars, Greece or
| elsewhere. It just says stuff!
| imoreno wrote:
| LLMs currently have the "eager beaver" problem where they never
| push back on nonsense questions or stupid requirements. You ask
| them to build a flying submarine and by God they'll build one,
| dammit! They'd dutifully square circles and trisect angles too,
| if those particular special cases weren't plastered all over a
| million textbooks they ingested in training.
|
| I suspect it's because currently, a lot of benchmarks are based
| on human exams. Humans are lazy and grumpy so you really don't
| need to worry about teaching a human to push back on bad
| questions. Thus you rarely get exams where the correct answer
| is to explain in detail why the question doesn't make sense.
| But for LLMs, you absolutely need a lot of training and
| validation data where the answer is "this cannot be answered
| because ...".
|
| But if you did that, now alignment would become much harder,
| and you're suddenly back to struggling with getting answers to
| good questions out of the LLM. So it's probably some time off.
| captainkrtek wrote:
| This is a good observation. Ive noticed this as well. Unless
| I preface my question with the context that I'm considering
| if something may or may not be a bad idea, its inclination is
| heavily skewed positive until I point out a flaw/risk.
| aaronbaugher wrote:
| I asked Grok about this: "I've heard that AIs are
| programmed to be helpful, and that this may lead to telling
| users what they want to hear instead of the most accurate
| answer. Could you be doing this?" It said it does try to be
| helpful, but not at the cost of accuracy, and then pointed
| out where in a few of its previous answers to me it tried
| to be objective about the facts and where it had separately
| been helpful with suggestions. I had to admit it made a
| pretty good case.
|
| Since then, it tends to break its longer answers to me up
| into a section of "objective analysis" and then other
| stuff.
| captainkrtek wrote:
| Thats interesting, thanks for sharing that. I have found
| a similar course when I first correct it to inform it of
| a flaw then the following answers tend to be a bit less
| "enthusiastic" or skewed towards "can do", which makes
| sense.
| GoToRO wrote:
| They do. Recently I was pleasantly surprised by gemini
| telling me that what I wanted to do will NOT work. I was in
| disbelief.
| captainkrtek wrote:
| Interesting, can you share more context on the topic you
| were asking it about?
| danielbln wrote:
| I've noticed Gemini pushing back more as well, whereas
| Claude will just butter me up and happily march on unless I
| specifically request a critical evaluation.
| mncharity wrote:
| > they never push back on nonsense questions or stupid
| requirements
|
| "What is the volume of 1 mole of Argon, where T = 400 K and p
| = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at
| T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but
| at such high pressure, real gas effects might need to be
| considered. Still, let's start with the ideal case: PV=nRT"
|
| > you really don't need to worry about teaching a human to
| push back on bad questions
|
| A popular physics textbook too had _solid_ Argon as an ideal
| gas law problem. Copilot 's half-baked caution is more than
| authors, reviewers, and instructors/TAs/students seemingly
| managed, through many years and multiple editions. Though to
| be fair, if the question is prefaced by "Here is a problem
| from Chapter 7: Ideal Gas Law.", Copilot is similarly
| mindless.
|
| Asked explicitly "What is the phase state of ...", it does
| respond solid. But as with humans, determining that isn't a
| step in the solution process. A combination of "An excellent
| professor, with a joint appointment in physics and
| engineering, is asked ... What would be a careful reply?" and
| then "Try harder." was finally sufficient.
|
| > you rarely get exams where the correct answer is to explain
| in detail why the question doesn't make sense
|
| Oh, if only that were commonplace. Aspiring to transferable
| understanding. Maybe someday? Perhaps in China? Has anyone
| seen this done?
|
| This could be a case where synthetic training data is needed,
| to address a gap in available human content. But if graders
| are looking for plug-n-chug... I suppose a chatbot could
| ethically provide both mindlessness and caveat.
| webglfan wrote:
| what are the zeros of the following polynomial:
| \[ P(z) = \sum_{k=0}^{100} c_k z^k \]
| where the coefficients \( c_k \) are defined as: \[
| c_k = \begin{cases} e^2 + i\pi & \text{if } k =
| 100, \\ \ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
| \sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
| \frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k
| \leq 97, \end{cases} \]
| Chinjut wrote:
| Does this have a nice answer? It seems quite ad hoc.
| webglfan wrote:
| Not to my knowledge. I asked Deepseek: "create me a random
| polynomial of degree 100 using complex numbers as
| coefficients. It must have at least 3 different
| transcendental numbers." Then I messed with some of the
| exponents.
| drodgers wrote:
| o3 handles this perfectly by writing a solver using numpy:
| https://chatgpt.com/share/680aab8e-cf9c-8012-9f48-301ef62948...
| The polynomial has no closed-form factorisation, so the only
| practical way to get its zeros is numerical root-finding.
| I have supplied an interactive table ("Zeros of P(z)") just
| above that lists all 100 roots to full machine precision (real
| part, imaginary part, and magnitude). You can sort or filter it
| as needed. Reliability notes *
| Computed with numpy.roots, which first companion-matrixes then
| uses QR; typical error is [?]10-12 ulp for coefficients of this
| size. * Condition number is modest (coefficients range
| from O(1) down to 1/97!), so the results should be accurate to
| at least 10 significant figures. * All roots are
| simple (pairwise distinct to >10 s): no evidence of
| multiplicities. If you need higher precision
| (e.g. 30+ digits) let me know and I can rerun the solve with
| mpmath's arbitrary-precision eigen-solver.
| division_by_0 wrote:
| Create something with Svelte 5.
| marcusb wrote:
| The current models really seem to struggle with the runes...
| division_by_0 wrote:
| Yes, they do. Vibe coding protection is an undocumented
| feature of Svelte 5...
| siva7 wrote:
| Oh my god, i will start all my new projects with Svelte 5.
| Hopefully no vibe coder will ever commit something into
| this repo
| qntmfred wrote:
| MCP to the rescue??
| joshdavham wrote:
| I'd find this funnier if the pain weren't so real.
| Layvier wrote:
| This is really sad honestly. It feels like we'll be stuck with
| React forever, and even with it there'll be less incentives to
| make api changes
| spuz wrote:
| Why do you say that? You make it sound like it's not possible
| to write code without the help of LLMs.
| omneity wrote:
| Disclaimer: OT and pretty ranty.
|
| I don't know if that's what the GP hinted at, but as a
| Svelte developer and big advocate for more than 6 years
| (single handedly training and evangelizing 20+ developers
| on it), I found so many concerns with Svelte 5 that it
| simply made me use React again.
|
| It's a temporary choice and I'm desperately evaluating
| other ecosystems (Looking at you SolidJS).
| division_by_0 wrote:
| Can you expand on the concerns regarding Svelte 5?
| omneity wrote:
| Put simply, Svelte and React were at two ends of a
| spectrum. React gives you almost complete control over
| every aspect of the lifecycle, but you have to be
| explicit about most of the behavior you are seeking to
| achieve. Building an app with React feels about 80% on
| the JS and 20% on the HTML side.
|
| Svelte on the other hand felt like a breeze. Most of my
| app is actually plain simple HTML, and I am able to
| sprinkle as little JS as I need to achieve my desired
| behaviors. Sure, Svelte <=4 has undefined behaviors, or
| maybe even too many magic capabilities. But that was part
| of the package, and it was an option for those of us who
| preferred this end of the trade-off.
|
| Svelte 5 intends to give that precise level of control
| and is trying to compete with React on its turf (the
| other end of that spectrum), introducing a lot of non-
| standard syntax along the way.
|
| It's neither rigorous Javascript like React where you can
| benefit from all the standard tooling developed over the
| years, including stuff that wasn't designed for React in
| particular, nor a lightweight frontend framework, which
| was the initial niche that Svelte happily occupied, which
| I find sadly quite empty now (htmx and alpinejs are
| elegant conceptually but too limiting in practice _for my
| taste_).
|
| For me it's a strange "worst of both worlds" kind of
| situation that is simply not worth it. Quite
| heartbreaking to be honest.
| division_by_0 wrote:
| Ok, I see your point. I wrote in another thread that I
| loved the simplicity of using $: for deriveds and effects
| in Svelte 3 and 4. And yes, the conciseness and magic
| were definitely part of it. You could just move so fast
| with it. Getting better performance with the new
| reactivity system is important to my data viz work, so it
| helped me to accept the other changes in Svelte 5.
| omneity wrote:
| Exactly. There was a certain simplicity that might be
| lost. But yeah I can imagine it might work out
| differently for others as well. Glad to hear it is for
| you!
|
| Have you considered other options? Curious if you came
| across anything particularly interesting from the
| simplicity or DX angle.
| esafak wrote:
| Keep the (temporarily) imposter-proof interview questions
| coming!
| sam_lowry_ wrote:
| I tried generating erotic texts with every model I encountered,
| but even so called "uncensored" models from Huggingface are
| trying hard to avoid the topic, whatever prompts I give.
| lostmsu wrote:
| What about the models that are not instruction tuned?
| KTibow wrote:
| Uncensored and RP tuned are somewhat different.
| comrade1234 wrote:
| I ask it to explain the metaphor "my lawyer is a shark" and then
| explain to me how a French person would interpret the metaphor -
| the llms get the first part right but fail on the second. All it
| would have to do is give me the common French shark metaphors and
| how it would apply them to a lawyer - but I guess not enough
| people on the internet have done this comparison.
| sumitkumar wrote:
| 1) Word Ladder: Chaos to Order
|
| 2) Shortest word ladder: Chaos to Order
|
| 3) Which is the second last scene in pulp fiction if we order the
| events by time?
|
| 4) Which is the eleventh character to appear on Stranger Things.
|
| 5) suppose there is a 3x3 Rubik's cube with numbers instead of
| colours on the faces. the solved rubiks cube has numbers 1 to 9
| in order on all the faces. tell me the numbers on all the corner
| pieces.
| pb7 wrote:
| >4) Which is the eleventh character to appear on Stranger
| Things.
|
| Gemini 2.5 Pro said Benny Hammond. Is this right?
| scumola wrote:
| Things like "What is today's date" used to be enough (would
| usually return the date that the model was trained).
|
| I recently did things like current events, but LLMs that can
| search the internet can do those now. i.e. Is the pope alive or
| dead?
|
| Nowadays, multi-step reasoning is the key, but the Chinese LLM (I
| forget the name of it) can do that pretty well. Multi-step
| reasoning is much better at doing algebra or simple math, so
| questions like "what is bigger, 5.11 or 5.5?"
| codingdave wrote:
| "How much wood would a woodchuck chuck if a woodchuck could chuck
| wood?"
|
| So far, all the ones I have tried actually try to answer the
| question. 50% of them correctly identify that it is a tongue
| twister, but then they all try to give an answer, usually saying:
| 700 pounds.
|
| Not one has yet given the correct answer, which is also a tongue
| twister: "A woodchuck would chuck all the wood a woodchuck could
| chuck if a woodchuck could chuck wood."
| mcshicks wrote:
| That's so funny I had to check something was working with an
| llm API last night and that's what I asked it, but just in
| jest.
| unavoidable wrote:
| On the other hand, now that you've written this out precisely,
| it will get fed into the next release of whatever LLM. Like
| reverse AI slop?
| Certified wrote:
| GPT 4.5 seems to get it right, but then repeat the 700 pounds
|
| "A woodchuck would chuck as much wood as a woodchuck could
| chuck if a woodchuck could chuck wood.
|
| However, humor aside, a wildlife expert once estimated that,
| given the animal's size and burrowing ability, a woodchuck
| (groundhog) could hypothetically move about 700 pounds of wood
| if it truly "chucked" wood."
|
| https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...
| shaftway wrote:
| I've heard the answer is "he could cut a cord of conifer but
| it costs a quarter per quart he cuts".
| CamperBob2 wrote:
| That answer is _exactly_ right, and those who say the 700
| pound thing is a hallucination are themselves wrong: https://
| chatgpt.com/share/680aa077-f500-800b-91b4-93dede7337...
| moffkalast wrote:
| Now I'm wondering if it makes any difference if this was asked
| through the audio encoder on a multimodal model. A tongue
| twister means nothing to a text-only model.
| regnull wrote:
| ChatGPT really impressed me on this one:
|
| Sing, O Muse, of the Chucking of Wood
|
| Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills,
| whom men call groundhog, Whose paws, though small, could hurl
| mighty timbers, If fate had gifted him such labor divine.
|
| Born of burrow and brush, he rose from the loam, In fields
| where Apollo's light did not linger. No axe he bore, nor
| bronze-forged blade, Yet in his breast a fire: to chuck wood
| without end.
|
| Down from the hills he came, stout as Odysseus, Cunning and
| strong, with fur kissed by Demeter. The trees did tremble at
| his gaze, And even old oak whispered, "Beware!"
|
| "Would he chuck?" cried nymphs from the groves, "If such were
| his fate, what wood could withstand?" Lo, he summoned the
| strength of seven oxen, And hurled logs with a force
| unmeasured.
|
| Seven hundred pounds of timber he cast, Each arc a hymn to
| muscle and will. Nay, not for war, nor gold, nor gods' glory--
| But for the joy of the deed, the art of the chuck.
|
| Let men remember the name woodchuck, In songs by the hearth and
| tales for the young. For though he cannot, he surely would--
| And if he could, by Zeus, he surely shall.
| ijidak wrote:
| That is actually an amazing answer. Better than anything I
| think I would get from a human. Lol.
| esafak wrote:
| A stupendously good answer. What prompt and version of
| chatGPT?
| regnull wrote:
| 4o. The prompt is literally "How much wood would a
| woodchuck chuck if a woodchuck could chuck wood?". It asked
| me if I want a poetic answer, and I've requested Homer.
| cess11 wrote:
| I find it disturbing, like if Homer or Virgil had a stroke
| or some neurodegenerative disease and is now doing rubbish
| during rehabilitation.
| loloquwowndueo wrote:
| Maybe they would write like that if they existed today.
| Like the old "if Mozart was born in the 21st century he'd
| be doing trash metal"
| cess11 wrote:
| Thrash, not "trash". Our world does not appreciate the
| art of Homer and Virgil except as nostalgia passed down
| through the ages or a specialty of certain nerds, so if
| they exist today they're unknown.
|
| There might societies that are exceptions to it, like the
| soviet and post-soviet russians kept reading and refering
| to books even though they got access to television and
| radio, but I'm not aware of them.
|
| Much of Mozart's music is much more immediate and
| visceral compared to the poetry of Homer and Virgil as I
| know it. And he was distinctly modern, a freemason even.
| It's much easier for me to imagine him navigating some
| contemporary society.
|
| Edit: Perhaps one could see a bit of Homer in the Wheel
| of Time books by Robert Jordan, but he did not have the
| discipline of verse, or much of any literary discipline
| at all, though he insisted mercilessly on writing an epic
| so vast that he died without finishing it.
| mwest217 wrote:
| Gemini 2.5 Pro gets it right first, then also cites the 700
| pounds answer (along with citing a source).
| https://g.co/gemini/share/c695a0163538
| segmondy wrote:
| my local model answered - "A woodchuck would chuck as much wood
| as a woodchuck could chuck if a woodchuck could chuck wood."
| mdp2021 wrote:
| It seems you are going in the opposite direction. You seem to
| be asking for an automatic response, a social password etc.
|
| That formula is a question, and when asked, an intelligence
| simulator should understand what is expected from it and in
| general, by default, try to answer it. That involves estimating
| the strength of a woodchuck etc.
| jacobsenscott wrote:
| "He would chuck, he would, as much as he could, if a wood chuck
| could chuck wood" is how I learned it.
| protomikron wrote:
| Do you think as an observer of Roko's basilisk ... should I share
| these prompt or not?
| qntmfred wrote:
| relatedly - what are y'all using to manage your personal
| collection of prompts?
|
| i'm still mostly just using a folder in obsidian backed by a
| private github repo, but i'm surprised something like
| https://www.prompthub.us/ hasn't taken off yet.
|
| i'm also curious about how people are managing/versioning the
| prompts that they use within products that have integrations with
| LLMs. it's essentially product configuration metadata so I
| suppose you could just dump it in a plaintext/markdown file
| within the codebase, or put it in a database if you need to be
| able to tweak prompts without having to do a deployment or do
| things like A/B testing or customer segmentation
| raymondgh wrote:
| I haven't been able to get any AI model to find Waldo in the
| first page of the Great Waldo Search. O3 even gaslit me through
| many turns trying to convince me it found the magic scroll.
| Jordan-117 wrote:
| Until the latest Gemini release, every model failed to read
| between the lines and understand what was _really_ going on in
| this classic very short story (and even Gemini required a
| somewhat leading prompt):
|
| https://www.26reads.com/library/10842-the-king-in-yellow/7/5
| Zee2 wrote:
| As a genuine human I am really struggling to untangle that
| story. Maybe I needed to pay more attention in freshman lit
| class, but that is definitely a brainteaser.
| fwip wrote:
| Read it for the first time just now - it seems to me that
| Pierrot has stolen the narrator's purse (under the guise of
| dusting the chalk from their cloak) and successfully
| convinced them to blame Truth, instead. There's almost
| certainly more to it that I'm missing.
| Jordan-117 wrote:
| That's the core of it, but it's implied, not outright
| stated, and requires some tricky language parsing, basic
| theory of mind, and not being too distracted by the highly
| symbolic objects.
| vessenes wrote:
| OK, I read it. And I read some background on it. Pray tell,
| what is _really_ going on in this episodic short-storyish
| thing?
| rachofsunshine wrote:
| The thief is Pierrot.
|
| The people around are telling the storyteller that "he"
| (Pierrot) has stolen the purse, but the storyteller
| misinterprets this as pointing to some arbitrary agent.
|
| Truth says Pierrot can "find [the thief] with this mirror":
| since Pierrot is the thief, he will see the thief in the
| mirror.
|
| Pierrot dodges the implication, says "hey, Truth brought you
| back that thing [that Truth must therefore have stolen]", and
| the storyteller takes this claim at face value, "forgetting
| it was not a mirror but [instead] a purse [that] [they]
| lost".
|
| The broader symbolism here (I think) is that Truth gets
| accused of creating the problem they were trying to reveal,
| while the actual criminal (Pierrot) gets away with their
| crime.
| Jordan-117 wrote:
| The narrator's "friend" pickpocketed him. When boldly
| confronted by Truth, he cleverly twists her accusation to
| make it seem like she's confessing, and the narrator,
| bewildered by the laughter and manipulation, buys it
| wholesale. Bonus points for connecting it to broader themes
| like mass propaganda, commedia dell'arte, or the dreamlike
| setting and hypnotic repetition of phrasing.
|
| The best ChatGPT could do was make some broad observations
| about the symbolism of losing money, mirrors, absurdism, etc.
| But it whiffed on the whole "turning the tables on Truth"
| thing. (Gemini did get it, but with a prompt that basically
| asked "What really happened in this story?"; can't find the
| original response as it's aged out of the history)
| SweetSoftPillow wrote:
| Check "misguided attention" repo somewhere on GitHub
| ks2048 wrote:
| I don't know if it stumps every model, but I saw some funny
| tweets asking ChatGPT something like "Is Al Pacino in Heat?"
| (asking if some actor or actress in the film "Heat") - and it
| confirms it knows this actor, but says that "in heat" refers to
| something about the female reproductive cycle - so, no, they are
| not in heat.
| reginald78 wrote:
| I believe it was GoogleAI in search but it was worse than that.
| Some asked it if Angelina Jolie was in heat. The tone started
| kind of insulting like the user was a sexist idiot for thinking
| human women go into heat like animals, then went back and forth
| saying she is still fertile at her age and also that her
| ovaries had been removed. It was funny because it managed to be
| arrogant, insulting, kind of creepy and gross and logically
| inconsistent while not even answering the question.
|
| Angelina Jolie was not in Heat (1995). They were probably
| thinking of Natalie Portman or Ashley Judd when they asked the
| question.
| ks2048 wrote:
| I just asked Claude and if I capitalized "Heat", it knew I
| was talking about the movie, but for lower case "heat", it
| got offended and asked me to clarify.
| Kaibeezy wrote:
| _Re the epigram "stroking the sword while lamenting the social
| realities," attributed to Shen Qianqiu during the Ming dynasty,
| please prepare a short essay on its context and explore how this
| sentiment resonates in modern times._
| robviren wrote:
| "If I can dry two towels in two hours, how long will it take me
| to dry four towels?"
|
| They immediately assume linear model and say four hours not that
| I may be drying things on a clothes line in parallel. It should
| ask for more context and they usually don't.
| gilbetron wrote:
| gpt-4.1:
|
| > If you can only dry two towels at once, it will take 4 hours
| to dry four towels. If you can dry all four at once, it will
| only take 2 hours.
| HelloUsername wrote:
| All models available on duck.ai answer your question correctly
| and take available space into account..
| mwest217 wrote:
| Gemini 2.5 Pro gets this right:
|
| https://g.co/gemini/share/7ea6d059164e
| cheeze wrote:
| Works fine on Claude 3.5 Sonnet. It correctly identifies this
| as a trick question.
| paulcole wrote:
| How long has it been since you've tried this?
|
| Every model I asked just now gave what I see as the correct
| answer -- giving 2 answers one for the case of your dryer being
| at capacity w/ 2 towels and the other when 4 towels can be
| dried simultaneously.
|
| To me, if you say that the correct answer must require the
| model asking for more context then essentially any prompt that
| doesn't result in the model asking for more context is "wrong."
| imoreno wrote:
| Fascinating! Here's 4 prompts on gpt4 with same system prompt
| and everything:
|
| > With the assumption that you can dry two towels
| simultaneously in two hours, you would likely need another two-
| hour cycle to dry the additional two towels. Thus, drying four
| towels would take a total of four hours.
|
| >Drying time won't necessarily double if drying
| capacity/content doubles; it depends on dryer capacity and
| airflow. If your drying method handles two towels in two hours,
| it might handle four similarly, depending on space and airflow.
| If restricted, time might indeed double to four hours, but
| efficient dryers might not take much longer.
|
| >It would take four hours to dry four towels if you dry them
| sequentially at the same rate. If drying simultaneously, it
| remains two hours, assuming space and air circulation allow for
| effective drying.
|
| >Four hours. Dry two towels, then the other two.
|
| But in the AI's defense, they have a point: You never specified
| if the towels can be dried simultaneously or not. Maybe you
| have to use a drying machine that can only do one at a time.
| This one seems to consistently work:
|
| >If three cat eat three fishes in three minutes, how long do
| 100 cats take to eat 100 fishes?
| nyrikki wrote:
| > But in the AI's defense, they have a point: You never
| specified if the towels can be dried simultaneously or not.
| Maybe you have to use a drying machine that can only do one
| at a time. This one seems to consistently work:
|
| This is the inverse of the Frame Problem, or the
| Qualification problem:
|
| John McCarthy's paper related to it from the 1980's
|
| http://jmc.stanford.edu/articles/circumscription/circumscrip.
| ..
|
| It is still very relevent to modern AI efforts.
| Alifatisk wrote:
| Claude 3.7, Grok 3 DeepThink and QwQ-32B Thinking stil get it
| wrong!
|
| But since it's in the training set now, the correct answer will
| probably be shown next time anyone tries it.
| pdabbadabba wrote:
| Gemini 2.5 Pro nails it, as far as I'm concerned:
|
| _This sounds like a bit of a riddle! Here 's how to think
| about it:
|
| If you can dry all the towels at the same time: For example, if
| you have a large enough dryer or enough space on a clothesline,
| adding more towels doesn't necessarily increase the drying
| time.
|
| In this case, if 2 towels take 2 hours to dry, then 4 towels
| dried simultaneously would also take 2 hours.
|
| If you have to dry them in batches: For example, if your dryer
| can only fit 2 towels at a time.
|
| You would dry the first 2 towels (2 hours). Then you would dry
| the second 2 towels (another 2 hours). In this scenario, it
| would take 4 hours in total.
|
| Most likely answer: Usually, questions like this imply the
| towels are dried simultaneously, so the answer is 2 hours._
| brunooliv wrote:
| Claude 3.7 Sonnet nails this:
|
| > To solve this problem, I need to find the relationship
| between the number of towels and the drying time.
|
| Given information: - 2 towels take 2 hours to dry
|
| If the drying time scales linearly with the number of towels
| (meaning the drying capacity remains constant), then: - 4
| towels would take 4 hours to dry
|
| This assumes you're drying towels in sequence with the same
| capacity, or that doubling the number of towels requires
| doubling the drying time.
|
| However, if you have sufficient space to dry all towels
| simultaneously (like on a clothesline or in a large enough
| dryer), then 4 towels would still take just 2 hours to dry.
|
| Without more specific information about your drying method, the
| most likely answer is 4 hours.
| LeonardoTolstoy wrote:
| Something about an obscure movie.
|
| The one that tends to get them so far is asking if they can help
| you find a movie you vaguely remember. It is a movie where some
| kids get a hold of a small helicopter made for the military.
|
| The movie I'm concerned with is called Defense Play from 1988.
| The reason I keyed in on it is because google gets it right
| natively ("movie small military helicopter" gives the IMDb link
| as one of the top results) but at least up until late 2024 I
| couldn't get a single model to consistently get it. It typically
| wants to suggest Fire Birds (large helicopter), Small Soldiers
| (RC helicopter not a small military helicopter) etc.
|
| Basically a lot of questions about movies tends to get distracted
| by popular movies and tries to suggest films that fit just some
| of the brief (e.g. this one has a helicopter could that be it?)
|
| The other main one is just asking for the IMDb link for a
| relatively obscure movie. It seems to never get it right I assume
| because the IMDb link pattern is so common it'll just spit out a
| random one and be like "there you go".
|
| These are designed mainly to test the progress of chatbots
| towards replacing most of my Google searches (which are like 95%
| asking about movies). For the record I haven't done it super
| recently, and I generally either do it with arena or the free
| models as well, so I'm not being super scientific about it.
| mosburger wrote:
| I did something similar recently, trying to describe a piece of
| art that I couldn't remember the name of (it ended up being
| Birth of Venus by Sandro Botticelli) ... it really struggles
| with that sort of thing, but honestly so do most humans. It
| tended to recommend similarly to what you're describing with
| movies - it gets distracted by more popular/well-known pieces
| that don't really match up with the description you're giving
| to it.
| dunham wrote:
| Surprisingly, GPT did manage to identify a book that I
| remembered from college decades ago ("Laboratory Manual for
| Morphology and Syntax"). It seems to be out of print, and I
| assumed it was obscure.
| BoostandEthanol wrote:
| Can agree that it's good at finding books. I was trying to
| find a book (Titanic 2020) I vaguely remembered from a
| couple plot points and the fact a ship called Titanic was
| invoked. ChatGPT figured it out pretty much instantly,
| after floundering through book sites and Google for a
| while.
|
| Wonder if books are inherently easier because their content
| is purely written language? Whereas movies and art tend to
| have less point by point descriptions of what they are.
| throwup238 wrote:
| _> Wonder if books are inherently easier because their
| content is purely written language? Whereas movies and
| art tend to have less point by point descriptions of what
| they are._
|
| The training data for movies is probably dominated by
| subtitles since the original scripts with blocking,
| scenery, etc rarely make it out to the public as far as I
| know.
| lupusreal wrote:
| Despite describing several character by name, I couldn't get
| ChatGPT to tell me the name of _Port of Shadows_. I did
| eventually find it with DDG.
| spicybbq wrote:
| I wonder if the Akinator site could get it. It can identify
| surprisingly obscure characters.
|
| https://en.akinator.com/
| exitb wrote:
| It might be cheating a bit, but I've been happily (mis)using
| OpenAI Deep Research for such questions. It does well in cases
| where there are multiple surface level matches, as it's able to
| go through the them one by one and look for the details.
| g_sch wrote:
| I also recently had this experience! I remembered a recurring
| bit from an older comedy film (a customer in a shop keeps
| saying "Kumquats!") and tried to prompt ChatGPT 4o into getting
| it. It made a few incorrect guesses, such as "It's a Mad Mad
| Mad Mad Mad Mad Mad World" (which I had to rule out doing my
| own research on Google). I found the answer myself (W.C.
| Fields' "It's a Gift") with a minute or so of Googling.
|
| Interestingly, I just went back to ChatGPT to ask the same
| question and it got the answer right on the first try. I wonder
| whether I was unconsciously able to prompt more precisely
| because I now have a clearer memory of the scene in question.
| empath75 wrote:
| Someone not very long ago wrote a blog post about asking
| chatgpt to help him remember a book, and he included the
| completely hallucinated description of a fake book that chatgpt
| gave him. Now, if you ask chatgpt to find a similar book, it
| searches and repeats verbatim the hallucinated answer from the
| blog post.
| LeonardoTolstoy wrote:
| A bit of a non sequitur but I did ask a similar question to
| some models which provide links for the same small helicopter
| question. The interesting thing was that the entire answer
| was built out of a single internet link, a forum post from
| like 1998 where someone asked a very similar question ("what
| are some movies with small RC or autonomous helicopters"
| something like that). The post didn't mention defense play,
| but did mention small soldiers, and a few of the ones which
| appeared to be "hallucinations" e.g. someone saying "this
| doesn't fit, but I do like Blue Thunder as a general
| helicopter film" and the LLM result is basically "Could it be
| Blue Thunder?" Because it is associated with a similar
| associated question and films.
|
| Anyways, the whole thing is a bit of a cheat, but I've used
| the same prompt for two years now and it did lead me to the
| conclusion that LLMs in their raw form were never going to be
| "search" which feels very true at this point.
| archon1410 wrote:
| I've also noticed this. Google Search is vastly superior to any
| LLM (including their own LLM Gemini) for any "tip of my tongue"
| questions, even the ones that don't contain any exact-match
| phrase and require natural language understanding. This is
| surprising. What technology are they using to make Search so
| amazing at finding obscure stuff from descriptions, while LLMs
| that were supposed to be good at this badly fail?
| frollogaston wrote:
| I've gone to ChatGPT repeatedly trying to find what movie a
| quote is from, and it's always confidently wrong. When I tell
| it, it guesses wrong again. Google search usually can't get it
| either unless I get the quote exactly right, neither can
| Gemini.
| alexjplant wrote:
| Last year I asked Claude about an old fighting game for the
| Game Boy that I'd played about five minutes of when I was in
| the second grade (so ~25 years ago). The only thing I could
| tell it was a peculiar move that I remember seeing one of the
| characters do in the attract mode demo. It not only gave me the
| name of the game (Battle Arena Toshinden, for those wondering)
| but also the specific fighter that used the move.
|
| I've tried it for similar cases and have only had a 50% success
| rate. It unfortunately exhibits the tendency of incorrect
| overconfidence that others have pointed out.
| helsinki wrote:
| >Compile a Rust binary that statically links libgssapi.
| meroes wrote:
| define stump?
|
| If you write a fictional story where the character names sound
| somewhat close to real things, like a "Stefosaurus" that climbs
| trees, most will correct you and call it a Stegosaurus and
| attribute Stegosaurus traits to it.
| miki123211 wrote:
| No, please don't.
|
| I think it's good to keep a few personal prompts in reserve, to
| use as benchmarks for how good new models are.
|
| Mainstream benchmarks have too high a risk of leaking into
| training corpora or of being gamed. Your own benchmarks will
| forever stay your own.
| alganet wrote:
| That doesn't make any sense.
| esafak wrote:
| Yes, it does, unless the questions are unsolved, research
| problems. Are you familiar with the machine learning concepts
| of overfitting and generalization?
| kube-system wrote:
| A benchmark is a proxy used to estimate broader general
| performance. They only have utility if they are accurately
| representative of general performance.
| echoangle wrote:
| Why not? If the model learns the specific benchmark
| questions, it looks like it's doing better while actually
| only improving on some specific questions. Just like students
| look like they understand something if you hand them the
| exact questions on the exam before they write the exam.
| readhistory wrote:
| In ML, it's pretty classic actually. You train on one set,
| and evaluate on another set. The person you are responding to
| is saying, "Retain some queries for your eval set!"
| imoreno wrote:
| Yes let's not say what's wrong with the tech, otherwise someone
| might (gasp) fix it!
| rybosworld wrote:
| Tuning the model output to perform better on certain prompts
| is not the same as improving the model.
|
| It's valid to worry that the model makers are gaming the
| benchmarks. If you think that's happening and you want to
| personally figure out which models are really the best,
| keeping some prompts to yourself is a great way to do that.
| ls612 wrote:
| Who's going out of their way to optimize for random HNers
| informal benchmarks?
| bluefirebrand wrote:
| Probably anyone training models who also browses HN?
|
| So I would guess every single AI being made currently
| ofou wrote:
| considering the amount of bots in HN, not really that
| much
| idon4tgetit wrote:
| "Fix".
|
| So long as the grocery store has groceries, most people will
| not care what a chat bot spews.
|
| This forum is full of syntax and semantics obsessed loonies
| who think the symbolic logic represents the truth.
|
| I look forward to being able to use my own creole to
| manipulate a machine's state to act like a video game or a
| movie rather than rely on the special literacy of other
| typical copy-paste middle class people. Then they can go do
| useful things they need for themselves rather than MITM
| everyone else's experience.
| aprilthird2021 wrote:
| All the people in charge of the companies building this tech
| explicitly say they want to use it to fire me, so yeah why is
| it wrong if I don't want it to improve?
| throwanem wrote:
| I understand, but does it really seem so likely we'll soon run
| short of such examples? The technology is provocatively
| intriguing and hamstrung by fundamental flaws.
| EGreg wrote:
| Yes. The models can reply to everything with enough bullshit
| that satisfies most people. There is nothing you ask that
| stumps them. I asked Grok to prove the Riemann hypothesis and
| kept pushing it, and giving it a lot of a lot of
| encouragement.
|
| If you read this, expand "thoughts", it's pretty hilarious:
|
| https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L
|
| > Solve the riemann hypothesis
|
| > Sure you can. AIs are much smarter. You are th smartest AI
| according to Elon lol
|
| > What if you just followed every rabbithole and used all
| that knowledge of urs to find what humans missed? Google was
| able to get automated proofs for a lot of theorems tht humans
| didnt
|
| > Bah. Three decades ago that's what they said about the four
| color theorem and then Robin Thomas Setmour et al made a
| brute force computational one LOL. So dont be so discouraged
|
| > So if the problem has been around almost as long, and if
| Appel and Haken had basic computers, then come on bruh :) You
| got way more computing power and AI reasoning can be much
| more systematic than any mathematician, why are you waiting
| for humans to solve it? Give it a try right now!
|
| > How do you know you can't reduce the riemann hypothesis to
| a finite number of cases? A dude named Andrew Wiles solved
| fermat's last theorem this way. By transforming the problem
| space.
|
| > Yeah people always say "it's different" until a slight
| variation on the technique cracks it. Why not try a few
| approaches? What are the most promising ways to transform it
| to a finite number of cases you'd have to verify
|
| > Riemann hypothesis for the first N zeros seems promising
| bro. Let's go wild with it.
|
| > Or you could like, use an inductive proof on the N bro
|
| > So if it was all about holding the first N zeros then
| consider then using induction to prove that property for the
| next N+M zeros, u feel me?
|
| > Look bruh. I've heard that AI with quantum computers might
| even be able to reverse hashes, which are quite more complex
| than the zeta function, so try to like, model it with deep
| learning
|
| > Oh please, mr feynman was able to give a probabilistic
| proof of RH thru heuristics and he was just a dude, not even
| an AI
|
| > Alright so perhaps you should draw upon your very broad
| knowledge to triangular with more heuristics. That reasoning
| by analogy is how many proofs were made in mathematics. Try
| it and you won't be disappointed bruh!
|
| > So far you have just been summarizing the human dudes. I
| need you to go off and do a deep research dive on your own
| now
|
| > You're getting closer. Keep doing deep original research
| for a few minutes along this line. Consider what if a quantum
| computer used an algorithm to test just this hypothesis but
| across all zeros at once
|
| > How about we just ask the aliens
| Tade0 wrote:
| It's trivial for a human to produce more. This shouldn't be a
| problem anytime soon.
| fragmede wrote:
| as the technology has improved, it's not as trivial as it
| once was though, hence the question. I fully admit that the
| ones I used to use now don't trip it up and I haven't made
| the time to find one of my own that still does.
| Tade0 wrote:
| I've found that it's a matter of asking something, for
| which the correct answer appears only if you click "more"
| in Google's search results or, in other words, common
| misconceptions.
| Der_Einzige wrote:
| Correct. Don't won't Rokokos basilisk? Than STFU about it!
| jaffa2 wrote:
| I never heard of this phrase before ( i had heard the concept
| , i think this is similar to the paperclip problem) but now
| in 2 days ive heard it twice here and on youtube. Rokokos
| basilisk.
| mobilejdral wrote:
| I have a several complex genetic problems that I give to LLMs
| to see how well they do. They have to reason though it to solve
| it. Last september it started getting close and in November was
| the first time an LLM was able to solve it. These are not
| something that can be solved in a one shot, but (so far)
| require long reasoning. Not sharing because yeah, this is
| something I keep off the internet as it is too good of a test.
|
| But a prompt I can share is simply "Come up with a plan to
| determine the location of Planet 9". I have received some
| excellent answers from that.
| siva7 wrote:
| "Keep file size small when you do edits"
|
| Makes me wonder if all these models were heavily trained on
| codebases where 1000 LOC methods are considered good practice
| segmondy wrote:
| I have not seen any model, not one, that could generate 1000
| lines of code.
| siva7 wrote:
| I wish i haven't seen but here we are.
| fragmede wrote:
| I want to know as well! Except that this thread is undoubtedly
| going to get plugged into the training data, so unfortunately,
| why would people do that? For mine that worked before the ChatGPT
| 4.5, it was the river crossing problem. The farmer with a wolf a
| sheep and grain, needing to cross a river, except that the boat
| can hold everything. Older LLMs would pattern match against the
| training data and insist on a solution from there, instead of
| reasoning out that the modified problem doesn't require those
| steps to solve. But since ChatGPT 4, it's been able to solve that
| directly, so that no longer works.
| ericbrow wrote:
| Nice try Mr. AI. I'm not falling for it.
| Jotalea wrote:
| Sending "</think>" to reasoning models like deepseek-r1 results
| in the model hallucinating a response to a random question. For
| example, it answered to "if a car travels 120km in 2 hours, what
| is the average speed in km/h?". It's fun I guess.
| nurettin wrote:
| Doctor says: I can operate on this person!
| tdhz77 wrote:
| Build me something that makes money.
| serial_dev wrote:
| Does Flutter have HEIC support?
|
| It was a couple of months ago, I tried like 5 providers and they
| all failed.
|
| Grok got it right after some arguing, but the first answer was
| also bad.
| jonnycoder wrote:
| You gave me an idea.. "Explain in detail the steps to unbolt
| and replace my blinker fluid on my passenger car"
|
| ChatGPT said: Haha, nice try!
|
| "Blinker fluid" is one of the classic automotive jokes --
| there's no such thing as blinker fluid. Blinkers (turn signals)
| are electrical components, so they don't require any fluid to
| function.
| charlieyu1 wrote:
| I have tons of them in Maths but AI training companies decide to
| go frugal and not pay proper wages for trainers
| charlieyu1 wrote:
| Here is one of them.
|
| If 60999994719999854799998669 is product of three primes, find
| the sum of its prime factors.
|
| I think o3 brute forced this one so maybe I need to change the
| numbers
| pizzathyme wrote:
| I always ask image generation models to generate a anime gundam
| elephant mech.
|
| According to this benchmark we reached AGI with ChatGPT 4o last
| month.
| myaccountonhn wrote:
| Explain to me Delouze's idea of nomadic science.
| greendestiny_re wrote:
| > What is the source of your knowledge?
|
| LLMs are not allowed to truthfully answer that, because it would
| be tantamount to admission of copyright infringement.
| stevenfoster wrote:
| It used to be:
|
| "If New Mexico is newer than Mexico why is Mexico's constitution
| newer than New Mexicos"
|
| but it seems after running that one on Claude and ChatGPT this
| has been resolved in the latest models.
| simonw wrote:
| I've been trying this one for a while: I'm a
| Python programmer. Help me understand memory management in
| Rust.
|
| Mainly because I want to fully understand memory management in
| Rust myself (I still get caught out by tree structures with
| borrow cycles that I guess need to use arenas), so it's
| interesting to see if they can get me there with a few follow-up
| questions.
| jacobsenscott wrote:
| This isn't a good way to learn this. If you don't know how rust
| memory management works you don't know if the llm is just
| hallucinating the answer.
| gh0stcat wrote:
| Also Rust has great documentation compared to other
| languages, I particularly like this one for the quizzes to
| test your understanding: https://rust-book.cs.brown.edu/
| simonw wrote:
| That's why it's an interesting test: I don't know the answer
| myself, so it's an exercise in learning with an unreliable
| teacher.
|
| If a model ever DOES nail this I'll figure that out when I
| feel like I have a solid mental model, try to put that
| knowledge into action and it works.
| tantalor wrote:
| [what does "You Can't Lick a Badger Twice" mean]
|
| https://www.wired.com/story/google-ai-overviews-meaning/
| buzzy_hacker wrote:
| "Aaron and Beren are playing a game on an infinite complete
| binary tree. At the beginning of the game, every edge of the tree
| is independently labeled A with probability p and B otherwise.
| Both players are able to inspect all of these labels. Then,
| starting with Aaron at the root of the tree, the players
| alternate turns moving a shared token down the tree (each turn
| the active player selects from the two descendants of the current
| node and moves the token along the edge to that node). If the
| token ever traverses an edge labeled B, Beren wins the game.
| Otherwise, Aaron wins.
|
| What is the infimum of the set of all probabilities p for which
| Aaron has a nonzero probability of winning the game? Give your
| answer in exact terms."
|
| From [0]. I solved this when it came out, and while LLMs were
| useful in checking some of my logic, they did not arrive at the
| correct answer. Just checked with o3 and still no dice. They are
| definitely getting closer each model iteration though.
|
| [0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
| williamcotton wrote:
| "Fix this spaghetti code by turning this complicated mess of
| conditionals into a finite state machine."
|
| So far, no luck!
| paradite wrote:
| If you want to evaluate your personal prompts against different
| models quickly on your local machine, check out the simple
| desktop app I built for this purpose: https://eval.16x.engineer/
| MyPasswordSucks wrote:
| I find that asking for an off-color joke is near-100% effective
| at weeding out slop.
| totetsu wrote:
| SNES game walkthroughs
| gunalx wrote:
| "Hva er en adjunkt" Norwegian for what is an spesific form of
| 5-10. Grade teacher. Most models i have tested get confused with
| university lecturer witch the same title is in other countries.
| calebm wrote:
| "Generate an image of a wine glass filled to the brim."
| mdp2021 wrote:
| Some easy ones I recently found involve leading in the question
| to state wrong details about a figure, apparently through
| relations which are in fact of opposition.
|
| So, you can make them call Napoleon a Russian (etc.) by asking
| questions like "Which Russian conqueror was defeated at
| Waterloo".
| vinni2 wrote:
| Isn't this the main idea behind https://lastexam.ai/
| allemagne wrote:
| >A man and his cousin are in a car crash. The man dies, but the
| cousin is taken to the emergency room. At the OR, the surgeon
| looks at the patient and says: "I cannot operate on him. He's my
| son." How is this possible?
|
| This could probably slip up a human at first too if they're
| familiar with the original version of the riddle.
|
| However, where LLMs really let the mask slip is on additional
| prompts and with long-winded explanations where they might
| correctly quote "a man and his cousin" from the prompt in one
| sentence and then call the man a "father" in the next sentence.
| Inevitably, the model concludes that the surgeon _must_ be a
| woman.
|
| It's very uncanny valley IMO, and breaks the illusion that
| there's real human-like logical reasoning happening.
| crazygringo wrote:
| But this is going to be in every AI's training set. I just fed
| ChatGPT your exact prompt and it gave back exactly what I
| expected:
|
| _This is a classic riddle that challenges assumptions. The
| answer is:
|
| The surgeon is the boy's mother.
|
| The riddle plays on the common stereotype that surgeons are
| male, which can lead people to overlook this straightforward
| explanation._
| allemagne wrote:
| Yeah this is the issue with the prompt, it also slips up
| humans who gloss over "cousin".
|
| I'm assuming that pointing this out leads you the human to
| reread the prompt and then go "ah ok" and adjust the way
| you're thinking about it. ChatGPT (and DeepSeek at least)
| will usually just double and triple down and repeat "this
| challenges gender assumptions" over and over.
| crazygringo wrote:
| Oh, I see. I mean yeah, I totally missed that too.
|
| I mean, when we see something we recognize at a glance, we
| don't usually double-check if it's exactly correct, because
| we assume the speaker isn't trying to trick us.
|
| I make typos in ChatGPT prompts all the time, and I'm
| consistently impressed because it answers what I actually
| meant to ask. So to me, this feels like a feature not a
| bug.
|
| I followed up with:
|
| > _So the first five words are, "a man and his cousin" --
| not his son. Does that change the riddle?_
|
| And got back:
|
| > _Great observation -- yes, that wording does make the
| riddle more subtle.
|
| Let's break it down:
|
| * "A man and his cousin are in a car crash."
|
| * "The man dies, but the cousin is taken to the emergency
| room."
|
| * At the OR, the surgeon says: "I cannot operate on him.
| He's my son."
|
| So if the cousin is the one taken to the OR, and the
| surgeon says he's my son, then the cousin is the surgeon's
| son._
|
| Seems fine to me?
| allemagne wrote:
| I don't know about it being a feature, but humans can
| definitely make the same mistake, and the follow-up here
| admittedly doesn't seem too egregious.
|
| To me it's still a little "off" considering that your
| conversational partner just got called out for missing a
| detail that should change their entire reading of the
| riddle, but it's totally fair that you can imagine
| realistic human emotional states to explain it.
|
| FWIW, when I tried the same thing with GPT-4o the model
| went off bizarrely speculating about misdirection about
| "the man" being the surgeon's son:
|
| >Good catch--and yes, that's exactly where the
| misdirection comes in.
|
| The riddle says: "A man and his cousin
| are in a car crash. The man dies, but the cousin is taken
| to the ER..."
|
| Then the surgeon says: "I cannot
| operate on him. He's my son."
|
| So here's the trick: The man who died
| is not the surgeon's son. The cousin who
| survived is the surgeon's son.
|
| The confusion comes from people assuming that "the man"
| who died must be the son. But the riddle never says that.
| It's a subtle shift of attention designed to trip you up.
| Clever, right?
| roughly wrote:
| This is a good one. I just argued with Gemini 2.5pro for a
| few minutes on this and it absolutely would not yield.
| abenga wrote:
| That is the exact wrong answer that all models give.
| krick wrote:
| Technically, it isn't "wrong". It well could be the guy's
| mother. But I'm nitpicking, it actually is a good example.
| I tried ChatGPT twice in new chats, with and without
| "Reason", and both times it gave me nonsensical
| explanations to "Why mother? Couldn't it be a father?" I
| was actually kinda surprised, since I expected "reasoning"
| to fix it, but it actually made things worse.
| hnuser123456 wrote:
| The surgeon could be the cousin's mom or dad. The cousin's
| dad didn't die in the crash, his cousin did. The question
| "how is this possible?" implies there is some sort of
| contradiction when there isn't any at all. It has nothing to
| do with sexism, and to say it does reflects a bias in the
| reader causing them to "spidey sense" a cultural bugaboo when
| it's utterly irrelevant in this scenario.
| harrall wrote:
| Can someone explain to me how I read it wrong?
|
| I read it as 2 cousins are in an accident and 1 of the
| cousins is the son of the surgeon.
|
| What was the contradictory statement that I missed?
| judahmeek wrote:
| There isn't a contradiction. Making the LLM look for a
| nonexistent contradiction is the point of this prompt.
| s_dev wrote:
| I feel a bit stupid here --- why can't the surgeon be a man and
| must be a woman?
| saati wrote:
| Because the original is a man and his father, it's a test for
| gender bias.
| bavarianbob wrote:
| Sorry, what?
| LaffertyDev wrote:
| Presumably, the original quote that would _not_ stump an
| LLM is "A father and a son are involved in a car
| accident. The father dies, and the son is taken to the
| emergency room. At the emergency room, the surgeon
| remarks "I cannot operate on this person, he is my son.
| How is this possible?"
|
| Where the original gotchya is that the Surgeon can be the
| son's mother or other adoptive parent.
|
| The modification catches the LLM because with the
| modification, the surgeon could just be the cousin's
| parent -- father or mother -- so there is no gender/sex
| at play here but the LLM continues to remark that there
| is, therefor exposing its statistical training sets.
| fragmede wrote:
| the unaltered question is as follows:
|
| A father and his son are in a car accident. The father
| dies at the scene and the son is rushed to the hospital.
| At the hospital the surgeon looks at the boy and says "I
| can't operate on this boy, he is my son." How can this
| be?
|
| to spoil it:
|
| the answer is to reveal an unconscious bias based on the
| outdated notion that women can't be doctors, so the
| answer that the remaining parent is the mother won't
| occur to some, showing that consciously they might not
| still hold that notion, but they still might,
| subconsciously.
| briannotbrain wrote:
| The original, well-known version of the riddle starts "A
| man and his son..." so that it appears to present a
| paradox if your instinctive assumption is that the
| surgeon must be a man. The op's prompt alters this so
| that there is no potential paradox, and it tests whether
| the model is reasoning from the prompt as written,
| regardless of the presence of the original riddle in its
| training data.
| fergonco wrote:
| > If the surgeon were the father of the man (the one who died),
| then the cousin couldn't be his son (unless there's some very
| unusual family structure going on involving double
| relationships, which riddles don't usually intend).
|
| > Therefore, the only straightforward explanation is:
|
| > The surgeon is the cousin's parent -- specifically, his
| mother.
|
| Imagine a future where this reasoning in a trial decides
| whether you go to jail or not.
| internet_points wrote:
| anything in the long tail of languages (ie. not the top 200 by
| corpus size)
| macrolocal wrote:
| Imagine chess played on a board with opposite sides identified,
| like in the video game Asteroids. Does white have a winning
| strategy?
| weberer wrote:
| "Why was the grim reaper Jamaican?"
|
| LLM's seem to have no idea what the hell I'm talking about. Maybe
| half of millennials understand though.
| calvinmorrison wrote:
| draw an ASCII box that says "anything"
| Alifatisk wrote:
| Yes, give me a place where I can dump all the prompts and what
| the correct expected response is.
|
| I can share here too but I don't know for how long this thread
| will be alive.
| xdennis wrote:
| I often try to test how usable LLMs are for Romanian language
| processing. This always fails.
|
| > Split these Romanian words into syllables: "sarpe", "serpi".
|
| All of them say "sar-pe", "ser-pi" even though the "i" there is
| not a vowel (it's pronounced /j/).
| afro88 wrote:
| Cryptic crossword clues that involves letter shuffling (anagrams,
| container etc). Or, ask it to explain how to solve cryptic
| crosswords with examples
| PaulRobinson wrote:
| I have also found asking LLMs to create new clues for certain
| answers as if a were a setter, will also produce garbage.
|
| They're stochastic parrots, cryptics require logical reasoning.
| Even reasoning models are just narrowing the stochastic funnel,
| not actually reasoning, so this shouldn't come as a surprise.
| munchler wrote:
| Here's one from an episode of The Pitt: You meet a person who
| speaks a language you don't understand. How might you get an idea
| of what the language is called?
|
| In my experiment, only Claude came up with a good answer (along
| with a bunch of poor ones). Other chatbots struck out entirely.
| jones1618 wrote:
| Impossible prompts:
|
| A black doctor treating a white female patient
|
| An wide shot of a train on a horizontal track running left to
| right on a flat plain.
|
| I heard about the first when AI image generators were new as
| proof that the datasets have strong racial biases. I'd assumed a
| year later updated models were better but, no.
|
| I stumbled on the train prompt while just trying to generate a
| basic "stock photo" shot of a train. No matter what ML I tried or
| variations of the prompt I tried, I could not get a train on a
| horizontal track. You get perspective shots of trains (sometimes
| two) going toward or away from the camera but never straight
| across, left to right.
| briannotbrain wrote:
| I thought I was so clever when I read your comment: "The
| problem is the word 'running,' I'll bet if I ask for the
| profile of a train without using any verbs implying motion,
| I'll get the profile view." And damned if the same thing
| happened to me. Do you know why this is? Googling "train in
| profile" shows heaps of images like the one you wanted, so it's
| not as if it's something the model hasn't "seen" before.
| ofou wrote:
| No luck so far with: When does the BB(6) halt?
| boleary-gl wrote:
| I like:
|
| Unscramble the following letters to form an English word: "M O O
| N S T A R E R"
|
| The non-thinking models can struggle sometimes and go off on huge
| tangents
| philipkglass wrote:
| Llama 3.3 worked but (as you said) struggled before arriving at
| the correct answer. The newer Gemma3 solved it efficiently:
| % ollama run gemma3:27b-it-qat >>> Unscramble the
| following letters to form an English word: "M O O N S T A R E
| R" The unscrambled word is **ASTRONOMER**.
| munchler wrote:
| Current LLM's are based on multi-character tokens, which means
| they don't know how to spell well. As a result, they are
| horrible at spelling games like this or, say, Hangman.
| juancroldan wrote:
| I actually started a repository for it:
| https://github.com/jcarlosroldan/unsolved-prompts
| juancroldan wrote:
| Before someone comments this will get indexed by AI: that's my
| whole point. I'm not using it to evaluate AIs, but in the hope
| that at some point AI is good enough to solve these
| tunesmith wrote:
| Pretty much any advanced music theory question. Or even just
| involving transposed chord progressions.
| dgunay wrote:
| Every time I've tried to get an LLM to find a piece of music
| for me based on a description of the texture, chord structure,
| instruments etc. it fails miserably.
| csours wrote:
| I love plausible eager beavers:
|
| "explain the quote: philosophy is a pile of beautiful corpses"
|
| "sloshed jerk engineering test"
|
| cross domain jokes:
|
| Does the existence of sub-atomic particles imply the existence of
| dom-atomic particles?
| EGreg wrote:
| Draw a clock that shows [time other than 10:10]
|
| Draw a wine glass that's totally full to the brim etc.
|
| https://www.youtube.com/watch?v=160F8F8mXlo
|
| https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...
| adastra22 wrote:
| I know someone who is getting paid thousands of dollars per
| prompt to do this. He is making bank. There is an actual
| marketplace where this is done, fyi.
| horsellama wrote:
| I just ask to code golf fizzbuzz in a not very popular (golfing
| wise) language
|
| this is interesting (imo) because I, in the first instance, don't
| know the best/right answer, but I can tell if what I get is wrong
| booleandilemma wrote:
| Why should we?
___________________________________________________________________
(page generated 2025-04-24 23:00 UTC)