[HN Gopher] PhD Knowledge Not Required: A Reasoning Challenge fo...
___________________________________________________________________
PhD Knowledge Not Required: A Reasoning Challenge for Large
Language Models
Author : enum
Score : 62 points
Date : 2025-02-09 18:14 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| enum wrote:
| Results and dataset explorer here:
| https://huggingface.co/spaces/nuprl/verbal-reasoning-challen...
| mh- wrote:
| For ID=3, it shows o1 getting it wrong, but it seems to have
| succeeded? It did add a space between _Tinker_ and _bell_ , but
| that is the canonical way of spelling the character apparently.
|
| (That just one caught my attention because I was curious what
| challenge o1-mini got correct that o1 did not.)
| enum wrote:
| Thanks, fixed. (Spaces rebuilding.) We have manually combed
| labelled-wrong answers and tweaked the predicates that check
| correctness. Sorry we missed this one.
| mh- wrote:
| I thought maybe I was just missing something silly.. was
| hesitant to comment without additional coffee. And no
| problem, thanks for sharing this!
| ks2048 wrote:
| Is there some standard markup assumed for LLMs? I just looked
| at a couple and saw markdown-like text (which I expected), but
| also "$\boxed{foo}$". Looks like LaTeX, but is there some
| subset of LaTeX LLMs are assumed to know/use?
| aghilmort wrote:
| really great work! are you a co-author?
| enum wrote:
| Thanks, and yes :)
| sega_sai wrote:
| Is it really certain that those problems and the answers were not
| in the training data for the tested LLMs ? Presumably somebody in
| the internet wrote about them...
| enum wrote:
| They are scraped from the web, and discussed on Reddit. So,
| they are definitely in the training data. Despite that, the
| non-reasoning LLMs struggle to solve them.
|
| There are however new problems each week, and released every
| week. So, we can safely assume the latest problems are
| decontaminated. It remains to be seen if and how performance
| drops on the problems released in 2025. (Not enough problems
| yet to tell.)
| shashanoid wrote:
| Love prepending 'explain' to arxiv links xD
| https://explainarxiv.org/abs/2502.01584
| foundry27 wrote:
| I get that sliding in references to a passion project on top-
| scoring articles might seem like an easy way to give the
| project exposure, but commenting the same thing over and over
| comes off as a bit boorish. And just plugging the URL isn't
| really contributing anything to the discussions IMO. Why not
| show us something your tool explained or summarized from the
| articles that isn't obvious from a cursory read? Citing the
| tool as the source for something cool wouldn't be nearly as in-
| your-face.
| shashanoid wrote:
| I won't do it again my sincere apologies
| porridgeraisin wrote:
| Similar tool (I'm not affiliated):
|
| alphaXiv (www.alphaxiv.org)
| shashanoid wrote:
| wow that's a cool one! thanks for sharing
| zinccat wrote:
| My feeling is that a lot of challenge could come from the
| tokenizer used by the model, similar to r in strawberry problem.
| enum wrote:
| There is likely a little bit of that, but I don't think that's
| the primary factor that makes these hard. If you look at the
| reasoning performed by R1 ("Show Thoughts" checkbox), it has no
| problem splitting and joining candidate words into letters, or
| even doing the letter-level manipulations. Just to give an
| example:
|
| Take a familiar brand name, seen along roads and highways in
| the United States. It has five letters, two syllables. The
| first syllable, phonetically, is a word that is the opposite of
| the word spelled by the second syllable. What brand name is it?
|
| <think> Okay, let me try to figure out this riddle. The
| question is about a brand name that's familiar on roads and
| highways in the US. It has five letters and two syllables. The
| key part is that the first syllable phonetically is the
| opposite of the second syllable spelled out. Hmm.
|
| First, I need to think of common brands seen on highways. Gas
| stations, fast food, hotels, maybe car brands. Five letters,
| two syllables. Let's list some possibilities: Exxon, Shell,
| Chevron, Mobil, BP (but BP is two letters, so probably not).
| Maybe food places like Subway, but that's six letters.
| McDonald's is too long. Burger King? No. Maybe hotels: Motel 6?
| No, that's a number. Holiday Inn? Too long. Maybe Best Western?
| No, too long.
|
| Wait, five letters. Let's think of two-syllable brands. Exxon
| (E-xxon), but Exxon is five letters. First syllable is "Ex",
| second is "xon". But what's the opposite of "xon"? That doesn't
| make sense. Maybe the first syllable is a word, and the second
| syllable when spelled is the opposite. So, for example, if the
| first syllable is "up", the second syllable's letters spell
| "down". But "updown" isn't a brand. Wait, the first syllable
| phonetically is a word, and the second syllable's spelling is
| the opposite. So when you say the first syllable, it's a word,
| and the letters of the second syllable spell its opposite.
|
| So, for example, if the brand is "Updn" (made up), where "up"
| is the first syllable, and "dn" spells "down". But that's not a
| real brand. Let's think of real brands. Maybe "Comcast"? No,
| too long. "Verizon"? Also longer. "Sprint"? Five letters.
| Sprint. S-print. But "S" doesn't mean anything opposite of
| "print". Hmm. Maybe "Exxon" again. "Ex" as in former, and the
| opposite would be "xon" spelled as... no.
|
| Wait, maybe the first syllable is a word that's the opposite of
| the word formed by the second syllable's letters. For instance,
| "Sunoco" is a gas station, but that's six letters. "Shell" is
| five letters but one syllable. "Hertz" is car rental. Five
| letters, one syllable. "Avis"? Four letters. "FedEx"? Five
| letters? F-E-D-E-X? No, that's six. Wait, FedEx is two
| syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five
| letters? F, E, D, E, X. Yes, five letters. FedEx. The first
| syllable is "Fed", the second is "Ex". Now, "Fed" as in past
| tense of feed, but maybe "Fed" as in federal. The opposite of
| "Ex" could be "current" or "present". But that doesn't seem
| right. Alternatively, "Ex" as in former, so the opposite would
| be "current" or "new". Hmm, not sure. Maybe "FedEx" isn't the
| answer.
|
| Another thought: "Amtrak" is a brand on highways, but that's
| six letters. "Greyhound"? Too long. "Dollar" car rental? Five
| letters? No, six. "Thrifty" is seven. Wait, maybe "Lyft"? No,
| four letters. "U-Haul" is hyphenated.
|
| Wait, let's think of opposites. The first syllable's sound is a
| word, and the second syllable's spelling is its opposite. So,
| for example, if the brand is "Inote", where "I" (pronounced
| "eye") and "note", but "eye" and "note" don't have opposites.
| Alternatively, "Hi-Lo", but that's hyphenated. Maybe "Dayton"
| (a tire company?), but "Day" vs "ton" - not opposites.
|
| Alternatively, maybe "Duracell" - no, too long. "Energizer"?
| No. "Eveready"? No. Wait, maybe "Goodyear". That's a tire
| company. Good-year. "Good" and "year". Not opposites.
|
| Wait, the key is that the first syllable phonetically is a
| word, and the second syllable's spelling is the opposite. So,
| for instance, if the first syllable is "pro", the second
| syllable's letters spell "con". But "procon" isn't a brand. Or
| "up" and "down". Maybe "Updown", but not a real brand.
|
| Wait, the brand "Exxon" - "Ex" and "xon". If "Ex" is the
| opposite of "xon" spelled as "xon" - no. Maybe "Ex" (as in
| former) and "xon" - but "xon" isn't a word. Alternatively,
| "Exxon" could be broken into "Ex" and "on". Wait, "Ex" (as in
| former) and "on" (the opposite being "off"). But "Exxon" is
| E-X-X-O-N, so the second syllable is "xon", not "on". Hmm.
|
| Another approach: think of opposites like "yes/no", "hot/cold",
| "fast/slow", "high/low", "in/out", "on/off", "start/stop",
| "day/night". Maybe the brand incorporates one of these.
|
| "Hi-Lo" is hyphenated. "Stop & Go" - no. "Speedpass" - not five
| letters. "7-Eleven" - no.
|
| Wait, "Loews" hotels? Five letters. "Lo-ews". "Lo" as in low,
| and "ews" - not an opposite. Maybe not.
|
| Wait, "Lucent" - a telecom company, but that's six letters.
| "Verizon" - seven letters.
|
| Wait, "Motel 6" - no, includes a number. "Super 8" - same.
|
| Wait, "Shell" - five letters but one syllable. "Chevron" - two
| syllables but seven letters.
|
| Wait, "Pepsi" - five letters, two syllables. "Pep" and "si".
| "Pep" as in energy, and "si" - not an opposite.
|
| "Cola" - not a brand. "Coke" - one syllable.
|
| Wait, "Adidas" - three syllables. "Nike" - two syllables but
| four letters.
|
| Wait, "Reebok" - six letters. "Fila" - four.
|
| Wait, maybe a car brand. "Ford" - one syllable. "Chevy" - two
| syllables but five letters? C-H-E-V-Y. Five letters, "Chev" and
| "y"? Not opposites.
|
| Wait, "Honda" - five letters, two syllables: Hon-da. Not
| opposites. "Toyota" - too long. "Lexus" - two syllables but
| five letters. "Le-xus". Not opposites.
|
| Wait, "Nissan" - two syllables, six letters. "Mazda" - two
| syllables, five letters. M-A-Z-D-A. "Maz" and "da". Doesn't
| fit.
|
| Wait, "Dodge" - one syllable. "Jeep" - one.
|
| Wait, "Subaru" - three syllables.
|
| Hmm. Let's think of gas stations again. "Shell", "Exxon",
| "Mobil", "Texaco", "Chevron", "BP", "Valero", "Citgo".
|
| Citgo: C-I-T-G-O. Two syllables: Cit-go. "Cit" (like "sit") and
| "go". Wait, "sit" and "go" - those are opposites in a way. If
| you sit, you're not moving; if you go, you're moving. So
| "Citgo" - "Cit" phonetically is "sit", and "go" is spelled as
| "go". So the first syllable sounds like "sit", which is the
| opposite of "go". That might be it!
|
| Citgo is a gas station brand, seen on highways, five letters
| (C-I-T-G-O), two syllables: Cit-go. Phonetically, "Cit" sounds
| like "sit", and "go" is the opposite. So the first syllable
| (phonetically "sit") is the opposite of the second syllable
| spelled "go". That seems to fit.
|
| Let me double-check. The brand is Citgo. Five letters, two
| syllables. First syllable "Cit" pronounced like "sit", which is
| the opposite of "go". Yes, that makes sense. I think that's the
| answer. </think>
|
| The brand name is *Citgo*.
|
| Citgo is a well-known gas station brand commonly seen along
| U.S. roads and highways. It has five letters and two syllables:
| *Cit* (pronounced like "sit") and *go*. The first syllable
| phonetically represents "sit," which is the opposite of the
| second syllable spelled as "go." This clever play on opposites
| fits the riddle's criteria perfectly.
| zinccat wrote:
| I see, but still there's a lot of reasonings just for
| counting the letters. And ridiculous reasonings like:
|
| FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx
| is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's
| five letters? F, E, D, E, X. Yes, five letters. FedEx.
| enum wrote:
| Definitely a lot of letter counting. It's not not a factor.
| I think the real problem is that the search space for each
| problem is enormous. When it gets stuck, it just gets stuck
| enumerating candidates that meet some but not all of the
| constraints.
| lokimedes wrote:
| As if the whole anti-intellectual hunt wasn't enough, now PhD is
| a category implying holder of rote-knowledge at the highest
| level. I guess it is hopeless to fight this, but a PhD is 100x
| more about the apprenticeship and real-world training as a
| scientist than any accumulated knowledge beyond ones prior
| training.
|
| I know this is a rant, sorry, just so tired of the stupidity.
| mlinhares wrote:
| There's no multibillion dollar business if they can't say
| everyone will be replaced by their models.
| DiscourseFan wrote:
| PhD students aren't necessarily the brightest (which is
| probably why many of them don't end up with tenure), but
| studying something at the graduate level does open your eyes to
| a particular kind of knowledge which is generally unavailable
| otherwise. So, yes, of course the work is more important than
| the credential, but the credential often stands in for the work
| that gives that knowledge.
| XCabbage wrote:
| This doesn't feel like a "reasoning" challenge. The mental skill
| required to solve most of these seems to be the ability to loop
| over all known members of a category like "popular brand names"
| or "well-known actors" and see if they fit the clue.
|
| As a human, you'd expect to fail either because you didn't know a
| category member (e.g. as a non-American I have no idea WTF
| "Citgo" is; I could never get the answer to the first question
| because I have never seen that name before in my life) or because
| you weren't able to bring it to mind; the mental act of looping
| over all members of a category is quite challenging for a human.
|
| Admittedly this is something an AI system could in principle be
| REALLY good at, and it's interesting to test and see that current
| ones are not! But it seems weird to me to call what's being
| tested "reasoning" when it's so heavily focused on memory recall
| (and evaluating whether a candidate answer works or not is
| trivial once you've brought it to mind and doesn't really require
| any intelligent thought).
|
| (If the questions were multiple-choice, eliminating the challenge
| of bringing candidate answers to mind that is the main challenge
| for a human, then I'd agree it was a "reasoning" test.)
| enum wrote:
| It's definitely U.S. centric.
|
| But, I do think this is reasoning. It requires recall, but
| anything other than pure logic puzzles do. For example, on a
| competition math problem or a programming problem, No person or
| LLM is inventing well-known lemmas and algorithms from first-
| principles.
| XCabbage wrote:
| It's not just that it requires recall. It's that it requires
| no non-trivial thought beyond recall.
| enum wrote:
| I think what you mean is that once you've managed to
| recall, checking constraints is easy. Remarkably, a few
| people are much better at this than others. They are able
| to think fast and execute an explicit mental search over a
| very small number of plausible candidates. Other people
| take forever. Seems to be the case for models too.
| jkhdigital wrote:
| I had the same thought. It reminds me of solving Project Euler
| problems, where there is often an obvious naive approach which
| is guaranteed to produce the correct answer but would consume
| prohibitive memory/compute resources to execute to completion.
| I suspect the models would perform much better if prompted to
| formulate a strategy for efficiently solving these challenges
| rather than solving them directly... which indicates a
| direction for potential improvement I suppose.
| tkgally wrote:
| I agree that recall seems to play an important role in solving
| these problems. Similar to how the ARC-AGI problems seem to
| depend on visual recognition and classification of shapes and
| colors. When I come up with the correct answers to such
| puzzles, I feel subjectively that the answers flashed into my
| mind, not that I reasoned my way to them.
| mkoubaa wrote:
| The term "PhD Knowledge" makes me cringe. PhD is about
| contribution to a field, not a standardized test score or a weird
| status-bearing substitute for IQ.
| glial wrote:
| Also - at least in my experience - doing a PhD makes a person
| more aware of the boundaries of knowledge, and therefore
| generally pretty skeptical, and capable of being usefully
| critical, about developments in their field.
| enum wrote:
| There is a trend of building hard benchmarks for LLMs, such as
| GPQA or Humanity's Last Exam, with questions crafted by PhDs.
| Ergo, most people cannot understand what these benchmarks are
| testing. If you're an expert in something, and if you're on HN
| you may be an expert in some sub-field of CS, I recommend
| trying to solve the questions from your field on Humanity's
| Last Exam. I took a crack at the Programming Languages
| questions myself.
| charlieyu1 wrote:
| - wants PhD knowledge - pays like $100-500 per question
| levocardia wrote:
| You cannot get a PhD (in the US) without also passing an
| examination that, nominally at least, demonstrates you have
| broad and deep knowledge about your field. The qualifying exam
| is not the most important part, but it is a necessary part.
| enum wrote:
| Some still exist, but this is less and less true over time.
| Even when a requirement is called an "exam", it isn't what a
| high-schooler would recognize as one.
|
| Many of the exams described on this page are personalized and
| only require mastery of a handful of papers that really
| matter to the individual student, and knowledge of related
| work:
|
| https://www.cs.stanford.edu/phd-program-requirements-
| qualify...
|
| I've been through something that sounds similar to these, and
| I would not have called it an exam.
| krisoft wrote:
| In this case it is a shorthand for specialist level knowledge.
| Thre kind of information you would expect a person pursuing a
| PhD in a specific field to know but someone not doing that to
| not know.
|
| For example how in the "Humanity's Last Exam" AI evaluation
| dataset they have this question: "Hummingbirds within
| Apodiformes uniquely have a bilaterally paired oval bone, a
| sesamoid embedded in the caudolateral portion of the expanded,
| cruciate aponeurosis of insertion of m. depressor caudae. How
| many paired tendons are supported by this sesamoid bone? Answer
| with a number."
|
| That's the kind of "PhD knowledge" which this evaluation does
| not require.
| buescher wrote:
| It means you have taken graduate level courses and have
| completed an apprenticeship in doing publishable academic
| research. Now that a master's is the new bachelor's there's
| something of a trend to think of a PhD as the new master's but
| it's really qualitatively different.
| akomtu wrote:
| The reasoning challenge is made of two parts:
|
| 1. Can you apply an existing model to a problem? For example:
| you're told how to multiply numbers and asked to multiply AHFG by
| VRBD in base-26 system.
|
| 2. Can you come up with a model that explains the given examples?
| For example: you're given 10 triples like AxB=C and asked to
| explain what they have in common.
|
| Simply imitating answers won't get you very far.
| rahidz wrote:
| What is so interesting to me is that the reasoning traces for
| these often have the correct answer, but the model fails to
| realize it.
|
| Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" - no,
| too long. Wait, three letters each. Let me think again. Maybe
| "hay fever" is two words but not three letters each. Maybe "dry
| eye"? "Dry" and "eye" - both three letters. "Dry eye" is a
| condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) - no, they
| don't rhyme. "Eye" is pronounced like "i", while "dry" is
| "d-rye". Not the same ending."
|
| Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is
| changed to next letter, maybe the original word is "footnot" (but
| that's not a word). Alternatively, maybe "foot" + "note", but
| "note" isn't a body part."
| Workaccount2 wrote:
| The other day I watched Gemini reasoning go in circles for ~30
| seconds trying to determine if 9.11 was greater than 9.9.
|
| ----
|
| "Realize a direct comparison of tenths is not enough: Wait,
| this is not right. In 9.9, the tenths digit is 9. In 9.11, the
| tenths digit is 1. This initial comparison is incorrectly
| stated. Let's re-examine.
|
| Re-examine tenths place comparison (correctly): In 9.9, the
| tenths digit is 9. In 9.11, the tenths digit is 1. Wait again!
| This is still incorrect. I'm misreading the digits. Let's slow
| down.
|
| Re-examine the numbers digit by digit from left to right, after
| the decimal:
|
| 9.9: 9 in the tenths place, nothing explicitly written in the
| hundredths place (we can think of it as 0 in the hundredths
| place, so 9.90).
|
| 9.11: 1 in the tenths place, 1 in the hundredths place.
|
| Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9.
| In 9.11, the tenths digit is 1. Wait... something is still
| wrong with my comparison logic. I'm getting confused. Let's re-
| read the numbers again.
|
| Correctly read and compare tenths digits:
|
| 9.9 - Tenths digit is 9.
|
| 9.11 - Tenths digit is 1. This is still wrong! I am
| consistently misreading the numbers. Let's go back to the
| start."
|
| ----
|
| This shows steps 5 to 9. It went to step _35_ before
| determining that 9.9 > 9.11. this was after it decided to
| convert to money, $9.11 vs $9.90.
| armcat wrote:
| It feels like lot of the reasoning tokens go to waste on pure
| brute force approach - plugging in numbers and evaluating and
| comparing against the answer. "Nope, that didn't work, let's
| try 4 instead of 6 this time", etc. What if the reward function
| instead focuses on diversity of procedures within a token
| budged (10k - 20k tokens). I.e. RL rewards the model in trying
| different methods or generating different hypotheses, rather
| than brute forcing its way through, and potentially getting
| stuck in loops.
| enum wrote:
| The nature of the problems makes it relatively easy to follow
| along with the models' reasoning and reasoning errors. For
| example, on this problem (answer "New England"):
|
| > Think of a place in America. Two words, 10 letters
| altogether. The first five letters read the same forward and
| backward. The last five letters spell something found in the
| body. What place is this?
|
| R1 fixates on answers of the form "CITY, STATE" and eventually
| returns some confidently wrong nonsense. It doesn't try to
| explore answers that don't fit the "CITY, STATE" template.
___________________________________________________________________
(page generated 2025-02-09 23:00 UTC)