[HN Gopher] PhD Knowledge Not Required: A Reasoning Challenge fo...
       ___________________________________________________________________
        
       PhD Knowledge Not Required: A Reasoning Challenge for Large
       Language Models
        
       Author : enum
       Score  : 62 points
       Date   : 2025-02-09 18:14 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | enum wrote:
       | Results and dataset explorer here:
       | https://huggingface.co/spaces/nuprl/verbal-reasoning-challen...
        
         | mh- wrote:
         | For ID=3, it shows o1 getting it wrong, but it seems to have
         | succeeded? It did add a space between _Tinker_ and _bell_ , but
         | that is the canonical way of spelling the character apparently.
         | 
         | (That just one caught my attention because I was curious what
         | challenge o1-mini got correct that o1 did not.)
        
           | enum wrote:
           | Thanks, fixed. (Spaces rebuilding.) We have manually combed
           | labelled-wrong answers and tweaked the predicates that check
           | correctness. Sorry we missed this one.
        
             | mh- wrote:
             | I thought maybe I was just missing something silly.. was
             | hesitant to comment without additional coffee. And no
             | problem, thanks for sharing this!
        
         | ks2048 wrote:
         | Is there some standard markup assumed for LLMs? I just looked
         | at a couple and saw markdown-like text (which I expected), but
         | also "$\boxed{foo}$". Looks like LaTeX, but is there some
         | subset of LaTeX LLMs are assumed to know/use?
        
       | aghilmort wrote:
       | really great work! are you a co-author?
        
         | enum wrote:
         | Thanks, and yes :)
        
       | sega_sai wrote:
       | Is it really certain that those problems and the answers were not
       | in the training data for the tested LLMs ? Presumably somebody in
       | the internet wrote about them...
        
         | enum wrote:
         | They are scraped from the web, and discussed on Reddit. So,
         | they are definitely in the training data. Despite that, the
         | non-reasoning LLMs struggle to solve them.
         | 
         | There are however new problems each week, and released every
         | week. So, we can safely assume the latest problems are
         | decontaminated. It remains to be seen if and how performance
         | drops on the problems released in 2025. (Not enough problems
         | yet to tell.)
        
       | shashanoid wrote:
       | Love prepending 'explain' to arxiv links xD
       | https://explainarxiv.org/abs/2502.01584
        
         | foundry27 wrote:
         | I get that sliding in references to a passion project on top-
         | scoring articles might seem like an easy way to give the
         | project exposure, but commenting the same thing over and over
         | comes off as a bit boorish. And just plugging the URL isn't
         | really contributing anything to the discussions IMO. Why not
         | show us something your tool explained or summarized from the
         | articles that isn't obvious from a cursory read? Citing the
         | tool as the source for something cool wouldn't be nearly as in-
         | your-face.
        
           | shashanoid wrote:
           | I won't do it again my sincere apologies
        
         | porridgeraisin wrote:
         | Similar tool (I'm not affiliated):
         | 
         | alphaXiv (www.alphaxiv.org)
        
           | shashanoid wrote:
           | wow that's a cool one! thanks for sharing
        
       | zinccat wrote:
       | My feeling is that a lot of challenge could come from the
       | tokenizer used by the model, similar to r in strawberry problem.
        
         | enum wrote:
         | There is likely a little bit of that, but I don't think that's
         | the primary factor that makes these hard. If you look at the
         | reasoning performed by R1 ("Show Thoughts" checkbox), it has no
         | problem splitting and joining candidate words into letters, or
         | even doing the letter-level manipulations. Just to give an
         | example:
         | 
         | Take a familiar brand name, seen along roads and highways in
         | the United States. It has five letters, two syllables. The
         | first syllable, phonetically, is a word that is the opposite of
         | the word spelled by the second syllable. What brand name is it?
         | 
         | <think> Okay, let me try to figure out this riddle. The
         | question is about a brand name that's familiar on roads and
         | highways in the US. It has five letters and two syllables. The
         | key part is that the first syllable phonetically is the
         | opposite of the second syllable spelled out. Hmm.
         | 
         | First, I need to think of common brands seen on highways. Gas
         | stations, fast food, hotels, maybe car brands. Five letters,
         | two syllables. Let's list some possibilities: Exxon, Shell,
         | Chevron, Mobil, BP (but BP is two letters, so probably not).
         | Maybe food places like Subway, but that's six letters.
         | McDonald's is too long. Burger King? No. Maybe hotels: Motel 6?
         | No, that's a number. Holiday Inn? Too long. Maybe Best Western?
         | No, too long.
         | 
         | Wait, five letters. Let's think of two-syllable brands. Exxon
         | (E-xxon), but Exxon is five letters. First syllable is "Ex",
         | second is "xon". But what's the opposite of "xon"? That doesn't
         | make sense. Maybe the first syllable is a word, and the second
         | syllable when spelled is the opposite. So, for example, if the
         | first syllable is "up", the second syllable's letters spell
         | "down". But "updown" isn't a brand. Wait, the first syllable
         | phonetically is a word, and the second syllable's spelling is
         | the opposite. So when you say the first syllable, it's a word,
         | and the letters of the second syllable spell its opposite.
         | 
         | So, for example, if the brand is "Updn" (made up), where "up"
         | is the first syllable, and "dn" spells "down". But that's not a
         | real brand. Let's think of real brands. Maybe "Comcast"? No,
         | too long. "Verizon"? Also longer. "Sprint"? Five letters.
         | Sprint. S-print. But "S" doesn't mean anything opposite of
         | "print". Hmm. Maybe "Exxon" again. "Ex" as in former, and the
         | opposite would be "xon" spelled as... no.
         | 
         | Wait, maybe the first syllable is a word that's the opposite of
         | the word formed by the second syllable's letters. For instance,
         | "Sunoco" is a gas station, but that's six letters. "Shell" is
         | five letters but one syllable. "Hertz" is car rental. Five
         | letters, one syllable. "Avis"? Four letters. "FedEx"? Five
         | letters? F-E-D-E-X? No, that's six. Wait, FedEx is two
         | syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five
         | letters? F, E, D, E, X. Yes, five letters. FedEx. The first
         | syllable is "Fed", the second is "Ex". Now, "Fed" as in past
         | tense of feed, but maybe "Fed" as in federal. The opposite of
         | "Ex" could be "current" or "present". But that doesn't seem
         | right. Alternatively, "Ex" as in former, so the opposite would
         | be "current" or "new". Hmm, not sure. Maybe "FedEx" isn't the
         | answer.
         | 
         | Another thought: "Amtrak" is a brand on highways, but that's
         | six letters. "Greyhound"? Too long. "Dollar" car rental? Five
         | letters? No, six. "Thrifty" is seven. Wait, maybe "Lyft"? No,
         | four letters. "U-Haul" is hyphenated.
         | 
         | Wait, let's think of opposites. The first syllable's sound is a
         | word, and the second syllable's spelling is its opposite. So,
         | for example, if the brand is "Inote", where "I" (pronounced
         | "eye") and "note", but "eye" and "note" don't have opposites.
         | Alternatively, "Hi-Lo", but that's hyphenated. Maybe "Dayton"
         | (a tire company?), but "Day" vs "ton" - not opposites.
         | 
         | Alternatively, maybe "Duracell" - no, too long. "Energizer"?
         | No. "Eveready"? No. Wait, maybe "Goodyear". That's a tire
         | company. Good-year. "Good" and "year". Not opposites.
         | 
         | Wait, the key is that the first syllable phonetically is a
         | word, and the second syllable's spelling is the opposite. So,
         | for instance, if the first syllable is "pro", the second
         | syllable's letters spell "con". But "procon" isn't a brand. Or
         | "up" and "down". Maybe "Updown", but not a real brand.
         | 
         | Wait, the brand "Exxon" - "Ex" and "xon". If "Ex" is the
         | opposite of "xon" spelled as "xon" - no. Maybe "Ex" (as in
         | former) and "xon" - but "xon" isn't a word. Alternatively,
         | "Exxon" could be broken into "Ex" and "on". Wait, "Ex" (as in
         | former) and "on" (the opposite being "off"). But "Exxon" is
         | E-X-X-O-N, so the second syllable is "xon", not "on". Hmm.
         | 
         | Another approach: think of opposites like "yes/no", "hot/cold",
         | "fast/slow", "high/low", "in/out", "on/off", "start/stop",
         | "day/night". Maybe the brand incorporates one of these.
         | 
         | "Hi-Lo" is hyphenated. "Stop & Go" - no. "Speedpass" - not five
         | letters. "7-Eleven" - no.
         | 
         | Wait, "Loews" hotels? Five letters. "Lo-ews". "Lo" as in low,
         | and "ews" - not an opposite. Maybe not.
         | 
         | Wait, "Lucent" - a telecom company, but that's six letters.
         | "Verizon" - seven letters.
         | 
         | Wait, "Motel 6" - no, includes a number. "Super 8" - same.
         | 
         | Wait, "Shell" - five letters but one syllable. "Chevron" - two
         | syllables but seven letters.
         | 
         | Wait, "Pepsi" - five letters, two syllables. "Pep" and "si".
         | "Pep" as in energy, and "si" - not an opposite.
         | 
         | "Cola" - not a brand. "Coke" - one syllable.
         | 
         | Wait, "Adidas" - three syllables. "Nike" - two syllables but
         | four letters.
         | 
         | Wait, "Reebok" - six letters. "Fila" - four.
         | 
         | Wait, maybe a car brand. "Ford" - one syllable. "Chevy" - two
         | syllables but five letters? C-H-E-V-Y. Five letters, "Chev" and
         | "y"? Not opposites.
         | 
         | Wait, "Honda" - five letters, two syllables: Hon-da. Not
         | opposites. "Toyota" - too long. "Lexus" - two syllables but
         | five letters. "Le-xus". Not opposites.
         | 
         | Wait, "Nissan" - two syllables, six letters. "Mazda" - two
         | syllables, five letters. M-A-Z-D-A. "Maz" and "da". Doesn't
         | fit.
         | 
         | Wait, "Dodge" - one syllable. "Jeep" - one.
         | 
         | Wait, "Subaru" - three syllables.
         | 
         | Hmm. Let's think of gas stations again. "Shell", "Exxon",
         | "Mobil", "Texaco", "Chevron", "BP", "Valero", "Citgo".
         | 
         | Citgo: C-I-T-G-O. Two syllables: Cit-go. "Cit" (like "sit") and
         | "go". Wait, "sit" and "go" - those are opposites in a way. If
         | you sit, you're not moving; if you go, you're moving. So
         | "Citgo" - "Cit" phonetically is "sit", and "go" is spelled as
         | "go". So the first syllable sounds like "sit", which is the
         | opposite of "go". That might be it!
         | 
         | Citgo is a gas station brand, seen on highways, five letters
         | (C-I-T-G-O), two syllables: Cit-go. Phonetically, "Cit" sounds
         | like "sit", and "go" is the opposite. So the first syllable
         | (phonetically "sit") is the opposite of the second syllable
         | spelled "go". That seems to fit.
         | 
         | Let me double-check. The brand is Citgo. Five letters, two
         | syllables. First syllable "Cit" pronounced like "sit", which is
         | the opposite of "go". Yes, that makes sense. I think that's the
         | answer. </think>
         | 
         | The brand name is *Citgo*.
         | 
         | Citgo is a well-known gas station brand commonly seen along
         | U.S. roads and highways. It has five letters and two syllables:
         | *Cit* (pronounced like "sit") and *go*. The first syllable
         | phonetically represents "sit," which is the opposite of the
         | second syllable spelled as "go." This clever play on opposites
         | fits the riddle's criteria perfectly.
        
           | zinccat wrote:
           | I see, but still there's a lot of reasonings just for
           | counting the letters. And ridiculous reasonings like:
           | 
           | FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx
           | is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's
           | five letters? F, E, D, E, X. Yes, five letters. FedEx.
        
             | enum wrote:
             | Definitely a lot of letter counting. It's not not a factor.
             | I think the real problem is that the search space for each
             | problem is enormous. When it gets stuck, it just gets stuck
             | enumerating candidates that meet some but not all of the
             | constraints.
        
       | lokimedes wrote:
       | As if the whole anti-intellectual hunt wasn't enough, now PhD is
       | a category implying holder of rote-knowledge at the highest
       | level. I guess it is hopeless to fight this, but a PhD is 100x
       | more about the apprenticeship and real-world training as a
       | scientist than any accumulated knowledge beyond ones prior
       | training.
       | 
       | I know this is a rant, sorry, just so tired of the stupidity.
        
         | mlinhares wrote:
         | There's no multibillion dollar business if they can't say
         | everyone will be replaced by their models.
        
         | DiscourseFan wrote:
         | PhD students aren't necessarily the brightest (which is
         | probably why many of them don't end up with tenure), but
         | studying something at the graduate level does open your eyes to
         | a particular kind of knowledge which is generally unavailable
         | otherwise. So, yes, of course the work is more important than
         | the credential, but the credential often stands in for the work
         | that gives that knowledge.
        
       | XCabbage wrote:
       | This doesn't feel like a "reasoning" challenge. The mental skill
       | required to solve most of these seems to be the ability to loop
       | over all known members of a category like "popular brand names"
       | or "well-known actors" and see if they fit the clue.
       | 
       | As a human, you'd expect to fail either because you didn't know a
       | category member (e.g. as a non-American I have no idea WTF
       | "Citgo" is; I could never get the answer to the first question
       | because I have never seen that name before in my life) or because
       | you weren't able to bring it to mind; the mental act of looping
       | over all members of a category is quite challenging for a human.
       | 
       | Admittedly this is something an AI system could in principle be
       | REALLY good at, and it's interesting to test and see that current
       | ones are not! But it seems weird to me to call what's being
       | tested "reasoning" when it's so heavily focused on memory recall
       | (and evaluating whether a candidate answer works or not is
       | trivial once you've brought it to mind and doesn't really require
       | any intelligent thought).
       | 
       | (If the questions were multiple-choice, eliminating the challenge
       | of bringing candidate answers to mind that is the main challenge
       | for a human, then I'd agree it was a "reasoning" test.)
        
         | enum wrote:
         | It's definitely U.S. centric.
         | 
         | But, I do think this is reasoning. It requires recall, but
         | anything other than pure logic puzzles do. For example, on a
         | competition math problem or a programming problem, No person or
         | LLM is inventing well-known lemmas and algorithms from first-
         | principles.
        
           | XCabbage wrote:
           | It's not just that it requires recall. It's that it requires
           | no non-trivial thought beyond recall.
        
             | enum wrote:
             | I think what you mean is that once you've managed to
             | recall, checking constraints is easy. Remarkably, a few
             | people are much better at this than others. They are able
             | to think fast and execute an explicit mental search over a
             | very small number of plausible candidates. Other people
             | take forever. Seems to be the case for models too.
        
         | jkhdigital wrote:
         | I had the same thought. It reminds me of solving Project Euler
         | problems, where there is often an obvious naive approach which
         | is guaranteed to produce the correct answer but would consume
         | prohibitive memory/compute resources to execute to completion.
         | I suspect the models would perform much better if prompted to
         | formulate a strategy for efficiently solving these challenges
         | rather than solving them directly... which indicates a
         | direction for potential improvement I suppose.
        
         | tkgally wrote:
         | I agree that recall seems to play an important role in solving
         | these problems. Similar to how the ARC-AGI problems seem to
         | depend on visual recognition and classification of shapes and
         | colors. When I come up with the correct answers to such
         | puzzles, I feel subjectively that the answers flashed into my
         | mind, not that I reasoned my way to them.
        
       | mkoubaa wrote:
       | The term "PhD Knowledge" makes me cringe. PhD is about
       | contribution to a field, not a standardized test score or a weird
       | status-bearing substitute for IQ.
        
         | glial wrote:
         | Also - at least in my experience - doing a PhD makes a person
         | more aware of the boundaries of knowledge, and therefore
         | generally pretty skeptical, and capable of being usefully
         | critical, about developments in their field.
        
         | enum wrote:
         | There is a trend of building hard benchmarks for LLMs, such as
         | GPQA or Humanity's Last Exam, with questions crafted by PhDs.
         | Ergo, most people cannot understand what these benchmarks are
         | testing. If you're an expert in something, and if you're on HN
         | you may be an expert in some sub-field of CS, I recommend
         | trying to solve the questions from your field on Humanity's
         | Last Exam. I took a crack at the Programming Languages
         | questions myself.
        
           | charlieyu1 wrote:
           | - wants PhD knowledge - pays like $100-500 per question
        
         | levocardia wrote:
         | You cannot get a PhD (in the US) without also passing an
         | examination that, nominally at least, demonstrates you have
         | broad and deep knowledge about your field. The qualifying exam
         | is not the most important part, but it is a necessary part.
        
           | enum wrote:
           | Some still exist, but this is less and less true over time.
           | Even when a requirement is called an "exam", it isn't what a
           | high-schooler would recognize as one.
           | 
           | Many of the exams described on this page are personalized and
           | only require mastery of a handful of papers that really
           | matter to the individual student, and knowledge of related
           | work:
           | 
           | https://www.cs.stanford.edu/phd-program-requirements-
           | qualify...
           | 
           | I've been through something that sounds similar to these, and
           | I would not have called it an exam.
        
         | krisoft wrote:
         | In this case it is a shorthand for specialist level knowledge.
         | Thre kind of information you would expect a person pursuing a
         | PhD in a specific field to know but someone not doing that to
         | not know.
         | 
         | For example how in the "Humanity's Last Exam" AI evaluation
         | dataset they have this question: "Hummingbirds within
         | Apodiformes uniquely have a bilaterally paired oval bone, a
         | sesamoid embedded in the caudolateral portion of the expanded,
         | cruciate aponeurosis of insertion of m. depressor caudae. How
         | many paired tendons are supported by this sesamoid bone? Answer
         | with a number."
         | 
         | That's the kind of "PhD knowledge" which this evaluation does
         | not require.
        
         | buescher wrote:
         | It means you have taken graduate level courses and have
         | completed an apprenticeship in doing publishable academic
         | research. Now that a master's is the new bachelor's there's
         | something of a trend to think of a PhD as the new master's but
         | it's really qualitatively different.
        
       | akomtu wrote:
       | The reasoning challenge is made of two parts:
       | 
       | 1. Can you apply an existing model to a problem? For example:
       | you're told how to multiply numbers and asked to multiply AHFG by
       | VRBD in base-26 system.
       | 
       | 2. Can you come up with a model that explains the given examples?
       | For example: you're given 10 triples like AxB=C and asked to
       | explain what they have in common.
       | 
       | Simply imitating answers won't get you very far.
        
       | rahidz wrote:
       | What is so interesting to me is that the reasoning traces for
       | these often have the correct answer, but the model fails to
       | realize it.
       | 
       | Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" - no,
       | too long. Wait, three letters each. Let me think again. Maybe
       | "hay fever" is two words but not three letters each. Maybe "dry
       | eye"? "Dry" and "eye" - both three letters. "Dry eye" is a
       | condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) - no, they
       | don't rhyme. "Eye" is pronounced like "i", while "dry" is
       | "d-rye". Not the same ending."
       | 
       | Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is
       | changed to next letter, maybe the original word is "footnot" (but
       | that's not a word). Alternatively, maybe "foot" + "note", but
       | "note" isn't a body part."
        
         | Workaccount2 wrote:
         | The other day I watched Gemini reasoning go in circles for ~30
         | seconds trying to determine if 9.11 was greater than 9.9.
         | 
         | ----
         | 
         | "Realize a direct comparison of tenths is not enough: Wait,
         | this is not right. In 9.9, the tenths digit is 9. In 9.11, the
         | tenths digit is 1. This initial comparison is incorrectly
         | stated. Let's re-examine.
         | 
         | Re-examine tenths place comparison (correctly): In 9.9, the
         | tenths digit is 9. In 9.11, the tenths digit is 1. Wait again!
         | This is still incorrect. I'm misreading the digits. Let's slow
         | down.
         | 
         | Re-examine the numbers digit by digit from left to right, after
         | the decimal:
         | 
         | 9.9: 9 in the tenths place, nothing explicitly written in the
         | hundredths place (we can think of it as 0 in the hundredths
         | place, so 9.90).
         | 
         | 9.11: 1 in the tenths place, 1 in the hundredths place.
         | 
         | Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9.
         | In 9.11, the tenths digit is 1. Wait... something is still
         | wrong with my comparison logic. I'm getting confused. Let's re-
         | read the numbers again.
         | 
         | Correctly read and compare tenths digits:
         | 
         | 9.9 - Tenths digit is 9.
         | 
         | 9.11 - Tenths digit is 1. This is still wrong! I am
         | consistently misreading the numbers. Let's go back to the
         | start."
         | 
         | ----
         | 
         | This shows steps 5 to 9. It went to step _35_ before
         | determining that 9.9  > 9.11. this was after it decided to
         | convert to money, $9.11 vs $9.90.
        
         | armcat wrote:
         | It feels like lot of the reasoning tokens go to waste on pure
         | brute force approach - plugging in numbers and evaluating and
         | comparing against the answer. "Nope, that didn't work, let's
         | try 4 instead of 6 this time", etc. What if the reward function
         | instead focuses on diversity of procedures within a token
         | budged (10k - 20k tokens). I.e. RL rewards the model in trying
         | different methods or generating different hypotheses, rather
         | than brute forcing its way through, and potentially getting
         | stuck in loops.
        
         | enum wrote:
         | The nature of the problems makes it relatively easy to follow
         | along with the models' reasoning and reasoning errors. For
         | example, on this problem (answer "New England"):
         | 
         | > Think of a place in America. Two words, 10 letters
         | altogether. The first five letters read the same forward and
         | backward. The last five letters spell something found in the
         | body. What place is this?
         | 
         | R1 fixates on answers of the form "CITY, STATE" and eventually
         | returns some confidently wrong nonsense. It doesn't try to
         | explore answers that don't fit the "CITY, STATE" template.
        
       ___________________________________________________________________
       (page generated 2025-02-09 23:00 UTC)