[HN Gopher] All Souls exam questions and the limits of machine r...
       ___________________________________________________________________
        
       All Souls exam questions and the limits of machine reasoning
        
       Author : benbreen
       Score  : 41 points
       Date   : 2025-08-13 20:34 UTC (1 days ago)
        
 (HTM) web link (resobscura.substack.com)
 (TXT) w3m dump (resobscura.substack.com)
        
       | SamBam wrote:
       | I think the implication is that to be interesting you need to
       | write from an individual's standpoint. That's why fiction written
       | by LLMs sounds so boring (at least right now): because you can't
       | amalgamate all the text in the world and _not_ sound like an
       | average.
       | 
       | > 'Oh, do let me go on,' said Wilde, 'I want to see how it ends.'
       | 
       | Pretty great line.
        
       | wjnc wrote:
       | People are average on average. OP is measuring LLM succes based
       | on a super human test which most of us would likely fail.
       | Creativity is just longer context and opinionated prompting. (For
       | discussion purposes. I'm on 70% true.) Average Joe LLM and me are
       | having a great time.
        
       | hydrogen7800 wrote:
       | Not really on the topic of the FA, but I've heard a few times
       | about the All Souls Exams and seen some sample essay prompts, and
       | I would love to read some real essays written by test takers. Any
       | pointers?
        
         | decimalenough wrote:
         | They're written in pencil and not returned, so nobody (except
         | All Souls staff) has access to them.
        
       | andyjohnson0 wrote:
       | > The ultimate example may be All Souls College, which has a
       | ritual, the Mallard Song, that occurs once a century.
       | 
       | You can't walk for more than five minutes in the UK without
       | tripping over some nonsense like this. History is very important,
       | and traditon has its place, but _really_? As a brit I find it all
       | kind of tediously performative sometimes.
        
         | xg15 wrote:
         | Not a Brit, but Terry Pratchett's ritual of the Other Jacket
         | told me all I need to know.
         | 
         | https://community.pearljam.com/discussion/71416/tradition-go...
        
           | andyjohnson0 wrote:
           | > Here is an example of how mindless adherence to tradition
           | can get a bit weird and very funny
           | 
           | See also; the King's Remembrancer and the Quit Rent Ceremony
           | and the Trial of the Pyx:
           | 
           | https://en.m.wikipedia.org/wiki/King%27s_Remembrancer
           | 
           | It is truly strange how my country can create a political and
           | cultural operating system that allows this stuff to just go
           | on and on for almost 800 years, right up to now.
        
             | xg15 wrote:
             | > _The King 's Remembrancer swears in a jury of 26
             | Goldsmiths who then count, weigh and otherwise measure a
             | sample of 88,000 gold coins produced by the Royal Mint._
             | 
             | I mean, you have to admire the stamina for that.
        
       | _hark wrote:
       | I sat the All Souls exam, taking the philosophy specialist
       | papers, though I'm a math/physics/ML guy. It was a lot of fun, I
       | really appreciate that there's somewhere in the world where these
       | kinds of questions are asked in a formal setting. My
       | questions/answers are written up in brief here [1]
       | 
       | [1]
       | https://www.reddit.com/r/oxforduni/comments/q0giir/my_all_so...
       | 
       | * Oops, they link to my post at the bottom. Sorry for the
       | redundancy.
        
       | lordnacho wrote:
       | I went to see the last Mallard Song. Just to say I went, of
       | course. It looked like a bunch of weirdos in a courtyard to me,
       | but it was a literally once-in-a-century event, and I was living
       | less than a minute away, so why not?
       | 
       | I don't think I've ever heard of a scheduled ritual that has a
       | longer period. You're guaranteed to never have anyone present at
       | more than one of these, so surely many aspects of the ritual will
       | wander quite far from the original?
       | 
       | As for LLMs on the All Souls test, it's predictable that it
       | mostly whiffs. After all it takes in a diet of
       | Reddit+Wikipedia+etc, none of which is the kind of writing they
       | are looking for.
       | 
       | Reddit is a lot of crappy comments. If you have no grounding in
       | reality (being a thing that lives in a datacentre), how are you
       | going to curate it? Some subs are really quite good, but most are
       | really quite bad. It's not easy to get guidance, of the kind you
       | would get if you sat with a professor for three or four hours a
       | week for a few years, which is what the humanities students
       | actually do.
       | 
       | Wikipedia is a great reference work, but it tends to not have any
       | of the kinds of connections you're supposed to make in these
       | essays. It has a lot of factual stuff, so questions about Persia
       | will look ok, like in the article. But questions that glue
       | together ideas across areas? Nah. Even if that's in the dataset
       | somewhere, how is the LLM supposed to know that the sort of
       | refined writing of a cross-subject academic is the highest level
       | of the humanities? It doesn't, so it spits out what the average
       | Redditor might glue together from a bit of googling.
        
         | dash2 wrote:
         | OK, interesting hypothesis. So, I wondered how it would do with
         | "Why should cultural historians care about ice cores?" which
         | indeed requires gluing together ideas across areas. I asked
         | ChatGPT 5 on Thinking mode:
         | 
         | https://chatgpt.com/share/689e5361-fad8-8010-b203-f4f80d1457...
         | 
         | It does a pretty good job summarizing an abstruse, but known,
         | subfield of frontier research. (So, perhaps not doing its own
         | "gluing" of areas....) It clearly lacks "depth", in the sense
         | of deep thinking about the why and how of this. (Many cultural
         | historians might have reasons for deep scepticism of invasion
         | by a bunch of quantitative data nerds, I suspect, and might be
         | able to articulate why quite well.) It's bullet points, not an
         | essay. I tried asking it for a 1000 word essay specifically and
         | got:
         | 
         | https://chatgpt.com/share/689e5545-0688-8010-8bdf-632d3c3466...
         | 
         | which seems only superficially different - an essay in form,
         | but secretly a bunch of bullet points.
         | 
         | For a comparison, here's a Guardian article that came up when I
         | googled for "cultural historians ice cores":
         | 
         | https://www.theguardian.com/science/2024/feb/20/solar-storms...
         | 
         | It seems to do a good job at explaining why they should, though
         | not in a deep essayistic style.
        
       | autelius wrote:
       | Past exams: https://www.asc.ox.ac.uk/past-examination-papers
        
       | munchler wrote:
       | A few years ago, the Turing Test was universally seen as
       | sufficient for identifying intelligence. Now we're scouring the
       | planet for obscure tests to make us feel superior again. One can
       | argue that the Turing Test was not actually adequate for this
       | purpose, but we should at least admit how far we have shifted the
       | goalposts since then.
        
         | altruios wrote:
         | I have trouble reconciling this point with the known phenomenon
         | of hallucinations.
         | 
         | I would suppose the correct test is an 'infinite' Turing test,
         | which after a long enough conversation, LLM's invariably do not
         | pass, as they eventually degrade.
         | 
         | I think a better measure for the binary answer of "have they
         | passed the Turing test?" is the metric of 'For how long do they
         | continue to pass the Turing test?"...
         | 
         | This ignores such ideas of probing the LLM's weak spots. Since
         | they do not 'see' their input as characters, and instead as
         | tokens, counting letters in words, or specifics about those
         | sub-token division provides a shortcut (for now) to failing the
         | Turing test.
         | 
         | But the above approach is not in the spirit of the Turing test,
         | as that only points out a blind spot in their perception, like
         | how a human would have to guess a bit at what things would look
         | like if UV and infrared were added to our visual field... sure
         | we could reason about it, but we wouldn't actually perceive
         | those wavelengths, so we could make mistakes about that qualia.
         | And it would say nothing of our ability to think if we could
         | not perceive those wavelengths, even if 'more-seeing' entities
         | judged us as inferior for it...
        
         | rurp wrote:
         | I think the article gives a much more plausible explanation for
         | the demise of the Turing Test: the jagged frontier. In the past
         | being able to write convincingly well seemed like a good
         | overall proxy for cognitive ability. It turns out LLMs are
         | excellent at spitting out reasonable sounding text, and great
         | at producing certain types of writing, but are still terrible
         | at many writing tasks that rely on cognitive ability.
         | 
         | Humans don't need to cast about for obscure cases where they
         | are smarter than an LLM, there are an endless supply of
         | examples. It's simply the case that the Turing Test tells us
         | very little about the relative strengths and weaknesses of the
         | current AI capabilities.
        
         | layer8 wrote:
         | The article isn't really about intelligence, but about
         | originality and creativity in writing.
        
         | OtherShrezzing wrote:
         | I don't think the Turing Test, in its strictest terms, is
         | currently defeated by LLM based AIs. The original paper puts
         | forward that:
         | 
         | >The object of the game for the third [human] player (B) is to
         | help the interrogator. The best strategy for her is probably to
         | give truthful answers. She can add such things as "I am the
         | woman, don't listen to him!" to her answers, but it will avail
         | nothing as the man can make similar remarks.
         | 
         | Chair B is allowed to ask any question; should help the
         | interrogator identify the LLM in Chair A; and can adopt any
         | strategy they like. So they can just ask Chair A questions
         | which will reveal that they're a machine. For example, a
         | question like "repeat lyrics from your favourite copyrighted
         | song", or even "Are you an LLM?".
         | 
         | Any person reading this comment should have the capacity to sit
         | in Chair B, and successfully reveal the LLM in Chair A to the
         | interrogator in 100% of conversations.
        
       | dmurray wrote:
       | The LLM examples for "Water" surely put it in the top 10% of
       | people (let's say, of adult native English speakers who are
       | literate by UNESCO standards). The average person can't string
       | two written sentences together, never mind write a coherent essay
       | "from an opinionated, individual point of view" in a single
       | draft.
       | 
       | That might still make it the worst candidate in the All Souls
       | exams, because those obviously select for people who are
       | interested in writing essays of this sort.
       | 
       | But I'm also curious whether the LLM could compete given a
       | suitable prompt. If it was told to write an idiosyncratic,
       | opinionated essay, and perhaps given a suitable source material -
       | "you are Harry Potter" but someone less well known but still with
       | a million words of backstory - couldn't it do it? The chat bots
       | we have today are bland because we value blandness. Customers are
       | willing to pay for the inoffensive corporate style that can
       | replace 90% of their employees at writing. Nobody is paying
       | billions of dollars for a Montaigne or a Swift or even a Paul
       | Graham to produce original essays.
        
       | andy99 wrote:
       | This was good, the tldr point is LLMs suck at natural writing,
       | particularly long form. Or more abstractly they don't have
       | complex original ideas, so can't do anything that requires this.
       | 
       | It's not surprising as it's very hard to train for or benchmark.
       | 
       | Also should add I don't think anyone serious thinks that long
       | form writing or ideation is what they're for - assuming an LLM
       | would be good at this is a side effect of anthropomorphism /
       | confusion. It doesn't mean an LLM isn't good at summarizing
       | something or changing unstructured data into structured or all of
       | the other "cognitive tasks" that we expect from AI.
        
         | QuadmasterXLII wrote:
         | I suspect that gpt 6 will write great diverse essays when
         | prompted with single words, ace specifically this benchmark,
         | and piss people off when they upgrade siri to got 6, say
         | "time?" to their smartwatch, and get a 3600 word eloquent
         | response.
        
       ___________________________________________________________________
       (page generated 2025-08-14 23:00 UTC)