[HN Gopher] All Souls exam questions and the limits of machine r...
___________________________________________________________________
All Souls exam questions and the limits of machine reasoning
Author : benbreen
Score : 41 points
Date : 2025-08-13 20:34 UTC (1 days ago)
(HTM) web link (resobscura.substack.com)
(TXT) w3m dump (resobscura.substack.com)
| SamBam wrote:
| I think the implication is that to be interesting you need to
| write from an individual's standpoint. That's why fiction written
| by LLMs sounds so boring (at least right now): because you can't
| amalgamate all the text in the world and _not_ sound like an
| average.
|
| > 'Oh, do let me go on,' said Wilde, 'I want to see how it ends.'
|
| Pretty great line.
| wjnc wrote:
| People are average on average. OP is measuring LLM succes based
| on a super human test which most of us would likely fail.
| Creativity is just longer context and opinionated prompting. (For
| discussion purposes. I'm on 70% true.) Average Joe LLM and me are
| having a great time.
| hydrogen7800 wrote:
| Not really on the topic of the FA, but I've heard a few times
| about the All Souls Exams and seen some sample essay prompts, and
| I would love to read some real essays written by test takers. Any
| pointers?
| decimalenough wrote:
| They're written in pencil and not returned, so nobody (except
| All Souls staff) has access to them.
| andyjohnson0 wrote:
| > The ultimate example may be All Souls College, which has a
| ritual, the Mallard Song, that occurs once a century.
|
| You can't walk for more than five minutes in the UK without
| tripping over some nonsense like this. History is very important,
| and traditon has its place, but _really_? As a brit I find it all
| kind of tediously performative sometimes.
| xg15 wrote:
| Not a Brit, but Terry Pratchett's ritual of the Other Jacket
| told me all I need to know.
|
| https://community.pearljam.com/discussion/71416/tradition-go...
| andyjohnson0 wrote:
| > Here is an example of how mindless adherence to tradition
| can get a bit weird and very funny
|
| See also; the King's Remembrancer and the Quit Rent Ceremony
| and the Trial of the Pyx:
|
| https://en.m.wikipedia.org/wiki/King%27s_Remembrancer
|
| It is truly strange how my country can create a political and
| cultural operating system that allows this stuff to just go
| on and on for almost 800 years, right up to now.
| xg15 wrote:
| > _The King 's Remembrancer swears in a jury of 26
| Goldsmiths who then count, weigh and otherwise measure a
| sample of 88,000 gold coins produced by the Royal Mint._
|
| I mean, you have to admire the stamina for that.
| _hark wrote:
| I sat the All Souls exam, taking the philosophy specialist
| papers, though I'm a math/physics/ML guy. It was a lot of fun, I
| really appreciate that there's somewhere in the world where these
| kinds of questions are asked in a formal setting. My
| questions/answers are written up in brief here [1]
|
| [1]
| https://www.reddit.com/r/oxforduni/comments/q0giir/my_all_so...
|
| * Oops, they link to my post at the bottom. Sorry for the
| redundancy.
| lordnacho wrote:
| I went to see the last Mallard Song. Just to say I went, of
| course. It looked like a bunch of weirdos in a courtyard to me,
| but it was a literally once-in-a-century event, and I was living
| less than a minute away, so why not?
|
| I don't think I've ever heard of a scheduled ritual that has a
| longer period. You're guaranteed to never have anyone present at
| more than one of these, so surely many aspects of the ritual will
| wander quite far from the original?
|
| As for LLMs on the All Souls test, it's predictable that it
| mostly whiffs. After all it takes in a diet of
| Reddit+Wikipedia+etc, none of which is the kind of writing they
| are looking for.
|
| Reddit is a lot of crappy comments. If you have no grounding in
| reality (being a thing that lives in a datacentre), how are you
| going to curate it? Some subs are really quite good, but most are
| really quite bad. It's not easy to get guidance, of the kind you
| would get if you sat with a professor for three or four hours a
| week for a few years, which is what the humanities students
| actually do.
|
| Wikipedia is a great reference work, but it tends to not have any
| of the kinds of connections you're supposed to make in these
| essays. It has a lot of factual stuff, so questions about Persia
| will look ok, like in the article. But questions that glue
| together ideas across areas? Nah. Even if that's in the dataset
| somewhere, how is the LLM supposed to know that the sort of
| refined writing of a cross-subject academic is the highest level
| of the humanities? It doesn't, so it spits out what the average
| Redditor might glue together from a bit of googling.
| dash2 wrote:
| OK, interesting hypothesis. So, I wondered how it would do with
| "Why should cultural historians care about ice cores?" which
| indeed requires gluing together ideas across areas. I asked
| ChatGPT 5 on Thinking mode:
|
| https://chatgpt.com/share/689e5361-fad8-8010-b203-f4f80d1457...
|
| It does a pretty good job summarizing an abstruse, but known,
| subfield of frontier research. (So, perhaps not doing its own
| "gluing" of areas....) It clearly lacks "depth", in the sense
| of deep thinking about the why and how of this. (Many cultural
| historians might have reasons for deep scepticism of invasion
| by a bunch of quantitative data nerds, I suspect, and might be
| able to articulate why quite well.) It's bullet points, not an
| essay. I tried asking it for a 1000 word essay specifically and
| got:
|
| https://chatgpt.com/share/689e5545-0688-8010-8bdf-632d3c3466...
|
| which seems only superficially different - an essay in form,
| but secretly a bunch of bullet points.
|
| For a comparison, here's a Guardian article that came up when I
| googled for "cultural historians ice cores":
|
| https://www.theguardian.com/science/2024/feb/20/solar-storms...
|
| It seems to do a good job at explaining why they should, though
| not in a deep essayistic style.
| autelius wrote:
| Past exams: https://www.asc.ox.ac.uk/past-examination-papers
| munchler wrote:
| A few years ago, the Turing Test was universally seen as
| sufficient for identifying intelligence. Now we're scouring the
| planet for obscure tests to make us feel superior again. One can
| argue that the Turing Test was not actually adequate for this
| purpose, but we should at least admit how far we have shifted the
| goalposts since then.
| altruios wrote:
| I have trouble reconciling this point with the known phenomenon
| of hallucinations.
|
| I would suppose the correct test is an 'infinite' Turing test,
| which after a long enough conversation, LLM's invariably do not
| pass, as they eventually degrade.
|
| I think a better measure for the binary answer of "have they
| passed the Turing test?" is the metric of 'For how long do they
| continue to pass the Turing test?"...
|
| This ignores such ideas of probing the LLM's weak spots. Since
| they do not 'see' their input as characters, and instead as
| tokens, counting letters in words, or specifics about those
| sub-token division provides a shortcut (for now) to failing the
| Turing test.
|
| But the above approach is not in the spirit of the Turing test,
| as that only points out a blind spot in their perception, like
| how a human would have to guess a bit at what things would look
| like if UV and infrared were added to our visual field... sure
| we could reason about it, but we wouldn't actually perceive
| those wavelengths, so we could make mistakes about that qualia.
| And it would say nothing of our ability to think if we could
| not perceive those wavelengths, even if 'more-seeing' entities
| judged us as inferior for it...
| rurp wrote:
| I think the article gives a much more plausible explanation for
| the demise of the Turing Test: the jagged frontier. In the past
| being able to write convincingly well seemed like a good
| overall proxy for cognitive ability. It turns out LLMs are
| excellent at spitting out reasonable sounding text, and great
| at producing certain types of writing, but are still terrible
| at many writing tasks that rely on cognitive ability.
|
| Humans don't need to cast about for obscure cases where they
| are smarter than an LLM, there are an endless supply of
| examples. It's simply the case that the Turing Test tells us
| very little about the relative strengths and weaknesses of the
| current AI capabilities.
| layer8 wrote:
| The article isn't really about intelligence, but about
| originality and creativity in writing.
| OtherShrezzing wrote:
| I don't think the Turing Test, in its strictest terms, is
| currently defeated by LLM based AIs. The original paper puts
| forward that:
|
| >The object of the game for the third [human] player (B) is to
| help the interrogator. The best strategy for her is probably to
| give truthful answers. She can add such things as "I am the
| woman, don't listen to him!" to her answers, but it will avail
| nothing as the man can make similar remarks.
|
| Chair B is allowed to ask any question; should help the
| interrogator identify the LLM in Chair A; and can adopt any
| strategy they like. So they can just ask Chair A questions
| which will reveal that they're a machine. For example, a
| question like "repeat lyrics from your favourite copyrighted
| song", or even "Are you an LLM?".
|
| Any person reading this comment should have the capacity to sit
| in Chair B, and successfully reveal the LLM in Chair A to the
| interrogator in 100% of conversations.
| dmurray wrote:
| The LLM examples for "Water" surely put it in the top 10% of
| people (let's say, of adult native English speakers who are
| literate by UNESCO standards). The average person can't string
| two written sentences together, never mind write a coherent essay
| "from an opinionated, individual point of view" in a single
| draft.
|
| That might still make it the worst candidate in the All Souls
| exams, because those obviously select for people who are
| interested in writing essays of this sort.
|
| But I'm also curious whether the LLM could compete given a
| suitable prompt. If it was told to write an idiosyncratic,
| opinionated essay, and perhaps given a suitable source material -
| "you are Harry Potter" but someone less well known but still with
| a million words of backstory - couldn't it do it? The chat bots
| we have today are bland because we value blandness. Customers are
| willing to pay for the inoffensive corporate style that can
| replace 90% of their employees at writing. Nobody is paying
| billions of dollars for a Montaigne or a Swift or even a Paul
| Graham to produce original essays.
| andy99 wrote:
| This was good, the tldr point is LLMs suck at natural writing,
| particularly long form. Or more abstractly they don't have
| complex original ideas, so can't do anything that requires this.
|
| It's not surprising as it's very hard to train for or benchmark.
|
| Also should add I don't think anyone serious thinks that long
| form writing or ideation is what they're for - assuming an LLM
| would be good at this is a side effect of anthropomorphism /
| confusion. It doesn't mean an LLM isn't good at summarizing
| something or changing unstructured data into structured or all of
| the other "cognitive tasks" that we expect from AI.
| QuadmasterXLII wrote:
| I suspect that gpt 6 will write great diverse essays when
| prompted with single words, ace specifically this benchmark,
| and piss people off when they upgrade siri to got 6, say
| "time?" to their smartwatch, and get a 3600 word eloquent
| response.
___________________________________________________________________
(page generated 2025-08-14 23:00 UTC)