[HN Gopher] Results of "Humanity's Last Exam" benchmark published
___________________________________________________________________
Results of "Humanity's Last Exam" benchmark published
Author : tzury
Score : 53 points
Date : 2025-01-23 17:44 UTC (5 hours ago)
(HTM) web link (scale.com)
(TXT) w3m dump (scale.com)
| chvid wrote:
| So current AI can do less than 10% of these. But it probably
| won't be more than a few days until models start being trained on
| these rendering the indicator invalid.
| LPisGood wrote:
| The 8 sample questions available here are interesting:
|
| https://lastexam.ai/
|
| I might be able to answer 2 of them with great effort (maybe!),
| and I would highly surprised if any human alive could answer 5 or
| more without seeing the problems in advance.
| sebzim4500 wrote:
| I can answer 2 of them quite quickly with pen and paper
| (compsci, physics) and one that I had to look up some
| definitions on wikipedia (maths) so I am certain there are
| people who can do more than 5.
|
| The computer science one seems weirdly easy compared to the
| rest, it's multiple choice and it is very easy to get it by
| process of elimination even if you don't understand how to
| actually do the problem.
| LPisGood wrote:
| Yes, many can answer the compsci and physics problems. The
| math problem is abstract and more difficult, but solving
| those 3 and 2 others seems nearly superhuman.
| zamalek wrote:
| I assume that the questions (and answers) aren't published
| anywhere? Else it would be "Humanity's Last Exam before the
| previous crawl".
| nico1207 wrote:
| You can just view the dataset on hugging face
| GaggiX wrote:
| It really shows how good Deepseek R1 is (even though it was
| evaluated only on text-only questions).
|
| The results are shown here: https://lastexam.ai/
|
| EDIT: the text-only evaluation of the models shown in the paper
| gives o1 an accuracy of 8.9%, so Deepseek R1 is even better than
| I thought.
| myrmidon wrote:
| Is there a text-only evaluation of the non-Deepseek models?
| Because being evaluated on text-only might have helped the
| other models immensely as well from what I can tell?
| GaggiX wrote:
| >Is there a text-only evaluation of the non-Deepseek models?
|
| Not that I can see but it would be cool to have, maybe the
| paper will a more complete evaluation.
| og_kalu wrote:
| Section C.2 of the paper (pg 24) has text only evaluations
| of other models.
| GaggiX wrote:
| Oh I see, the paper is out, I read "(arXiv coming soon)"
| and though it wasn't released yet.
| next_xibalba wrote:
| These types of exams, and most benchmarks to date, seem to be
| very one dimensional in terms of measuring intelligence. For
| instance, if we transported a human from 2,000 years ago to
| present day and asked him to take this exam, he would likely get
| 0%, given that he couldn't read or write, let alone comprehend
| the concepts and context required to solve these questions. But,
| that man would still undoubtedly be far more intelligent than an
| ape on all dimensions. He would likely be more intelligent than a
| toddler on many dimensions. He might even be more intelligent
| than some high schools students on a few dimensions. I can't
| exactly articulate "what" is missing or how to measure it, but I
| can intuit that some things are in these benchmarks.
| oersted wrote:
| "Intelligence" itself is very ill-defined and we've never been
| able to measure it properly, IQ is rife with issues.
|
| At some point, you just have to be pragmatic and measure the
| questions you want the AI to be good at answering, rather than
| trying to measure intelligence in general.
|
| In that sense, I see this as one more benchmark that collects
| questions that we want/expect AI to be good at, is not good at
| yet and have been underrepresented at previous benchmarks.
| That's obviously valuable, there's nothing "magical" about it.
| Although it is reasonable to be annoyed at the "Humanity's Last
| Exam" naming, of course they must have missed plenty of edge-
| cases like everyone else and it is very arrogant to claim it
| will be the "Last" one.
| esotericimpl wrote:
| This is always the answer for anyone who thinks LLMs are
| capable of "intelligence".
|
| It's good at answering questions that its trained on, I would
| suggest general intelligence are things you didnt want/train
| the AI to be good at answering.
| fooker wrote:
| Are you good at answering questions you are not trained to
| answer?
|
| How about a middle school test in a language you don't
| speak?
| xanderlewis wrote:
| Yes -- reasonably so, anyway. I don't have to have seen
| millions of prior examples of exactly the same kind in
| order to tackle a novel problem in mathematics, say.
| oersted wrote:
| Well, LLMs are also remarkably good at generalizing. Look
| at the datasets, they don't literally train on every
| conceivable type of question the user might ask, the LLM
| can adapt just as you can.
|
| The actual challenge towards general intelligence is that
| LLMs struggle with certain types of questions even if you
| *do* train it on millions of examples of that type of
| question. Mostly questions that require complex logical
| reasoning, although consistent progress is being done in
| this direction.
| godelski wrote:
| > Well, LLMs are also remarkably good at generalizing.
| Look at the datasets, they don't literally train on every
| conceivable type of question the user might ask, the LLM
| can adapt just as you can.
|
| Proof needed.
|
| I'm serious. We don't have the datasets. But we do know
| the size of the datasets. And the sizes suggest
| incredible amounts of information.
|
| Take an estimate of 100 tokens ~= 75 words[0]. What is a
| trillion tokens? Well, that's 750bn words. There are
| approximately 450 words on a page[1]. So that's 1.66...
| bn pages! If we put that in 500 page books, that would
| come out to 3.33... million books!
|
| Llama 3 has a pretraining size of 15T tokens[2] (this
| does not include training, so more info added later). So
| that comes to ~50m books. Then, keep in mind that this
| data is filtered and deduplicated. Even considering a
| high failure rate in deduplication, this an unimaginable
| amount of information.
|
| [0] https://help.openai.com/en/articles/4936856-what-are-
| tokens-...
|
| [1] https://wordcounter.net/words-per-page
|
| [2] https://ai.meta.com/blog/meta-llama-3/
| oersted wrote:
| That's a very good point. I just speak from my experience
| of fine-tuning pre-trained models. At least at that stage
| they can memorize new knowledge, that couldn't have been
| in the training data, just by seeing it once during fine-
| tuning (one epoch), which seems magical. Most
| instruction-tuning datasets are also remarkably small
| (very roughly <100K samples). This is only possible if
| the model has internalized the knowledge quite deeply and
| generally, such that new knowledge is a tiny gradient
| update on top of existing expectations.
|
| But yes I see what you mean, they are dumping practically
| the whole internet at it, it's not unreasonable to think
| that it has memorized a massive proportion of common
| question types the user might come up with, such that
| minimal generalization is needed.
| godelski wrote:
| > that couldn't have been in the training data
|
| I'm curious, how do you know this? I'm not doubting, but
| is it falsifiable?
|
| I also am not going to claim that LLMs only perform
| recall. They fit functions in a continuous manner. Even
| if the data is discrete. So they can do more. The
| question is more about how much more.
|
| Another important point is that out of distribution
| doesn't mean "not in training". This is sometimes
| conflated, but if it were true then that's a test set
| lol. OOD means not belonging to the same distribution.
| Though that's a bit complicated, especially when dealing
| with high dimensional data
| xanderlewis wrote:
| I agree. It is surprising the degree to which they seem
| to be able to generalise, though I'd say in my experience
| the generalisation is very much at the syntax level and
| doesn't really reflect an underlying 'understanding' of
| what's being represented by the text -- just a _very,
| very_ good model of what text that represents reality
| tends to look like.
|
| The commenter below is right that the amount of data
| involved is ridiculously massive, so I don't think human
| intuition is well equipped to have a sense of how much
| these models have seen before.
| godelski wrote:
| > Are you good at answering questions you are not trained
| to answer?
|
| Yes. Most schooling is designed around this.
|
| Pick a random math textbook. Any will do. Read a chapter.
| Then move to the homework problems. The typical fashion
| is that the first few problems are quite similar to the
| examples in the chapter. Often solvable by substitution
| and repetition. Middle problems generally require a bit
| of extrapolation. To connect concepts from previous
| chapters or courses in ways that likely were not
| explicitly discussed. This has many forms and frequently
| includes taking the abstract form to practical (i.e. a
| word problem). Challenge problems are those that require
| you to extrapolate the information into new domains.
| Requiring the connection of many ideas and having to
| filter information for what is useful and not.
| > How about a middle school test in a language you don't
| speak?
|
| A language course often makes this explicitly clear. You
| are trained to learn the rules of the language.
| Conjugation is a good example. By learning the structure
| you can hear new words that you've never heard before and
| extract information about it even if not exactly. There's
| a reason you don't just learn vocabulary. It's also
| assumed that by learning vocabulary you'll naturally
| learn rules.
|
| Language is a great example in general. We constantly
| invent new words. It really is not uncommon for someone
| you know to be be talking to you and in that discussion
| drop a word they made up on the spot or just make a sound
| or a gesture. An entirely novel thing yet you will likely
| understand. Often this is zero-shot (sometimes it might
| just appear to be zero-shot but actually isn't)
| jhbadger wrote:
| For a while I was into a trivia program on my phone. It
| was kind of easy, so I decided to set the language to
| Catalan, a language which I never studied. I was still
| able to do well, because I could figure out the questions
| more or less from languages I do know and could
| generalize from them. It would be interesting to know if
| you could say, train an LLM on examples from Romance
| languages but specifically exclude Catalan and see if it
| could do the same.
| visarga wrote:
| > "Intelligence" itself is very ill-defined and we've never
| been able to measure it properly, IQ is rife with issues.
|
| Yes, because it is 1st person exclusively. If you expand a
| bit, consider "search efficiency". It's no longer just 1st
| person, it can be social. And it doesn't hide the search
| space. Intelligence is partially undefined because it doesn't
| specify the problem space, it is left blank. But "search
| efficiency" is more scientific and concrete.
| JohnMakin wrote:
| > IQ is rife with issues
|
| Indeed, and yet people are obsessed with the it and the idea
| of measuring their own intelligence - I completely do not
| understand it. I am in an extremely high percentile, but I am
| a total moron in a lot of areas and if you met me would
| likely think so as well. It's a poor predictor for just about
| everything except how good a person is at recognizing
| patterns (I know there are many different kinds of tests, but
| inevitably, it feels like this) and how quickly they can
| reason. But people are _obsessed_ with it (Go on quora and
| search "IQ", you probably won't half to though, since half
| the questions there are seemingly about IQ).
|
| A thing I like to say is you didn't earn your intelligence
| any more than a 7'0" man earned his height - to some degree
| it seems innate (we don't even really know how).
|
| This all said, it seems even _more_ pointless to try to "IQ"
| test an AI in this manner. What does it predict? What is it
| measuring? And you're not going to be able to use the same
| questions for more than 1 test, because the AI will "learn"
| the answers.
| godelski wrote:
| The lowest IQ thing you can do is be obsessed with IQ.
|
| There are known knowns, there are known unknowns, and there
| are unknown unknowns. The wise man knows he cannot know
| what he does not know and that it'd be naive to presume he
| knows when he cannot know how much he doesn't know.
| Therefore, only the unintelligent man really _knows_
| anything.
| dingnuts wrote:
| IQ is compute speed, not storage. It has nothing to do
| with knowledge. IBM used to give one out as part of their
| hiring process, years ago, and even I took it the entire
| test was a timed multiple choice exam where every
| question was looking at an object made out of cubes and
| choosing the correct orientation of the object from the
| choices, after the object was arbitrarily rotated
| according to instructions in the question.
|
| Then, IQ can be derived by determining how quickly all
| participants can answer the questionnaire correctly, and
| ranking their speeds, and then normalizing the values so
| 100 is in the middle.
|
| Turns out, scores will fall along a bell curve if you do
| that. You can call that phenomenon whatever, but most
| people call it IQ and hopefully I've explained well why
| that has nothing at all to do with static knowledge in
| this comment.
| 11101010001100 wrote:
| Speed can be learned though...chess for example.
| godelski wrote:
| > IQ is compute speed, not storage.
|
| Says who? Honestly. I've never seen that claim before.
| Sure, tests are timed but that's a proxy for efficiency
| in extrapolation.
|
| If we define IQ in this way then LLMs far outperform any
| human. I'm pretty confident this would be true of even
| more traditional LMs.
|
| Speed really is just the measurement of recall. A doubt
| we'd call someone intelligent if they memorized the
| multiplication table up to 100x100. Maybe at first but
| when we ask them for 126*8358?
| godelski wrote:
| > "Intelligence" itself is very ill-defined
|
| While this is true, it is well agreed upon (by domain
| experts) that intelligence is distinct from knowledge recall.
| But that's what most of these tests... test.
|
| If you look at IQ tests you'll see that they are attempts to
| test things that aren't knowledge based. You'll also notice
| that the main critiques of IQ tests are about how they often
| actually measure knowledge and that there's bias in natural
| knowledge acquisition. So even the disagreements about the
| definition of intelligence make clear that knowledge and
| intelligence are distinct. I feel that often people conflate
| "intelligence is ill-defined" with "intelligence has no
| definition." These two are not in opposition. Being ill-
| defined is more like "I know I left my phone in the house,
| but I'm not sure where." This is entirely different from "I
| lost my phone, it is somewhere in California" or "It is
| somewhere on Earth" and clearly different from "I lost my
| phone. I'm unsure if I had a phone. What even is a phone?"
| oersted wrote:
| Yes agreed, there is indeed a rough consensus on what
| intelligence is and reasonable ways to approximately
| measure it. These standard tests have been applied to LLMs
| from the beginning, they have not proven to be the most
| helpful to guide research, but there's value to applying
| benchmarks that have been battle-tested with humans.
|
| It's just that OP was questioning this group's criteria for
| selecting the questions that determine intelligence. Then
| we get into endless discussions of semantics.
|
| At the end of the day, you are just testing which questions
| your AI performs well on, and you can describe how you
| chose those questions. Claiming it measures "general
| intelligence" is just unhelpful and frustrating.
| godelski wrote:
| They were applied in the beginning because we really
| weren't that good at solving the tasks. So like any good
| researchers, we break it down.
|
| But this is like trying to test an elephant but you can't
| get access to an elephant so you instead train a dog. But
| putting a dog in an elephant costume doesn't make it an
| elephant. Sure, dog training will likely mean you can
| learn to train an elephant faster had you not first
| trained a dog. Some things transfer, but others don't
|
| I also want to stress that there is a rough consensus.
| But the ML field (which I'm a part of) often ignores
| this. I'm not sure why. We should be leveraging the work
| of others, not trying to start from scratch (unless
| there's good reason, in which case we must be explicit.
| But I'm just seeing simple claims of "intelligence is
| ill-defined" and treating that as if that means no
| definition instead of fuzzy definition. Which gets extra
| weird when people talk about moving goal posts. That's
| how progress works? Especially when exploring into the
| unknown?)
| og_kalu wrote:
| This is true but that's because it's gotten hard to do much
| else. LLMs are eating up everything else that don't require
| long horizon planning or multimodality.
|
| If you created a new benchmark today that didn't lean on the
| things I've mentioned or esoteric/super specialized domain
| knowledge (that would actually require some sort of super-human
| performance to ace) like this or Frontier Math, LLMs would
| probably do pretty well.
| WanderPanda wrote:
| I mean it is humanity's LAST exam. Humanity's first exam would
| probably be something about communication? Or about building
| and predicting effects of certain tools?
| golol wrote:
| The things that are missing are what stops us from having
| useful agents so far: Agency, judgement, sense of time, long
| horizon planning, not being gullible. I kinda feel like some
| amount of ego is necessary to get a model to behave like that.
| fooker wrote:
| Put 'em in diverse simulations and see how long they survive.
|
| I can imagine a dystopian world where people are subject to
| this for training and testing AI.
| modeless wrote:
| ARC-AGI is a benchmark with no language that could plausibly be
| solved by primitive humans, assuming only intelligence.
| godelski wrote:
| > seem to be very one dimensional in terms of measuring
| intelligence.
|
| I would argue that they DON'T measure intelligence, rather they
| test knowledge.
|
| Frustratingly, I think we have a society greatly focused on
| knowledge based testing due to its correlation with
| intelligence and that it is exponentially easier to test
| knowledge. But this is easy to hack. Being in CS it feels very
| odd since we all know a great way to get hired is to study
| leetcode questions. That is, study to the test.
|
| This is critical to recognize this difference as what we know
| for certain is that LLMs and other ML systems are analogous to
| a database with a human language interface[0]. What we DO NOT
| KNOW is if these systems are intelligent. That is, that they
| can use their exploit their knowledge to unfamiliar
| territories. Then there's the whole question of wisdom...
|
| This stuff is highly abstract and we can get fuzzy so it is
| natural to go for the simple thing but we need to graduate.
| Don't avoid the tough questions, dig in. As we advance in any
| study nuance takes over. This should be obvious. If we
| approximate things, to improve we need to tackle higher order
| terms, and that almost always becomes exponentially more
| difficult with each step.
|
| And come on, is this benchmark not obvious bait? Calling it
| "humanity's last exam" is extremely arrogant.
|
| Definitions: Knowledge: Awareness of facts. The
| ability to recall information. Intelligence: Ability
| to exploit knowledge to new settings. To be able to plan and
| reason. (Definitions of intelligence are much
| more debated than knowledge but what is far less controversial
| is that intelligence is about the way one uses knowledge. These
| two are distinct. This is fairly well agreed upon throughout
| history and within modern literature around psychology and
| cognitive science.) Wisdom: The efficient use of
| one's knowledge
| https://en.wikipedia.org/wiki/Knowledge
| https://en.wikipedia.org/wiki/Intelligence
| https://en.wikipedia.org/wiki/Wisdom
|
| There is a implicit hierarchy here[1] where knowledge is
| something to be had, intelligence is the utilization of that,
| and wisdom is about efficiency. There's a decent analogy to
| this hierarchy. Knowledge is like having a tool. Intelligence
| is like using it, a craftsman[2]. Wisdom is akin to being a
| master craftsman.
|
| [0] I mean that they fit the data. A database is discrete, but
| these curve fit, so that will be a continuous function (in most
| cases). Thus it won't be exact retrieval nor does this mean
| information can't be interpolated. But that gets to be a deeper
| and much more complex conversation that I think we like to
| admit.
|
| [1] This is clearly multi-dimensional. You can organize
| hierarchies in multiple ways, I'm not suggesting this is the
| only way or "the right way"
|
| [2] What is argued is what is a sufficient threshold. An
| armchair expert might know how to use a lathe because they read
| about its usage but does that mean they can use it? What about
| a novice who you can show something to and they can repeat it?
| Monkey see monkey do style. An apprentice? A craftsman? There's
| a lot of gray area between being able to recall something from
| a book and being a wizard (gray beard).
| tkgally wrote:
| I agree that many aspects of intelligence--and of the lack of
| intelligence--are not being measured by such benchmarks. One
| issue is they are only examining problems that have right
| answers.
|
| One of the most powerful uses of LLMs for me, at least, is
| brainstorming: having them suggest possible avenues for me to
| pursue with specific projects I am working on. If I give Claude
| or ChatGPT or Gemini enough context about my problems, they
| usually come up with useful suggestions--sometimes amazingly
| well. Are they better at that than the smartest human? I don't
| know. How do you quantify the quality of an idea? But those
| ideas often seem really, really good to me.
|
| Another difficult-to-measure capability is interaction. Back-
| and-forth conversations with models don't always go well, but
| when they work they frequently blow me away. But those
| successes are dependent partly on the model, partly on me, and
| partly on how the conversation happens to unfold. Again, that
| success or failure doesn't seem measurable with benchmarks that
| require objectively right answers.
| taeric wrote:
| I'm curious why you are confident they would be more
| intelligent than a modern toddler?
|
| I largely empathize with your point. But, as I can recognize
| there are some out there far better at problem solving than I
| am, I am growing ok with the idea that intelligence can be
| measured. Not to a single number, most likely, but to a variety
| of different aspects.
|
| Similarly, I'd imagine that a human from 2000 years ago is
| probably more hardy than one from the modern age. If only
| because of selection effects at play.
|
| Obviously, you can't extrapolate a straight line between either
| measurement and expect it to continue in either direction. But
| I don't know why you couldn't build up a measurement for it?
|
| (And it should go without saying that you shouldn't be judging
| worth using this sort of measurement.)
| ianburrell wrote:
| Adults from 2000 years ago would absolutely be smarter than
| toddlers. Adults back then watched and out thought their
| toddlers. Do you think toddlers now are much smarter?
| Especially when toddlers are from before they get educated.
|
| Remember that 2000 years ago is 24AD, the middle of the Roman
| empire and Han dynasty which covered half of the world
| population. Nobles would be literate and well educated,
| artisans and soldiers would be skilled, and I bet there were
| lots of smart peasants that got ignored.
|
| They wouldn't do well on intelligence tests because not used
| to it, but that is more about tests than their intelligence.
| I'm sure that the average intelligence is lower than now from
| lack of education and malnutrition. Smart ones would still be
| smart. Also, I bet people from now would do poorly in their
| environment.
| munchbunny wrote:
| I think the concept you're dancing around the edges of is the
| nature of what parts of "intelligence" are driven by:
|
| 1. Language and how interrelated it is to our ability to
| transfer knowledge and experience, as well as its role in
| structuring our internal thinking. I haven't seen any academic
| research on the matter, but there are more and less concrete
| instances of this throughout history. This Wikipedia article
| about the history of Algebra is a great example of how 2000
| years of evolution led to a formulation of the same concepts,
| but with a reduced cognitive load that 10-12 years olds learn
| today as a matter of course. (https://en.wikipedia.org/wiki/His
| tory_of_algebra#Stages_of_a...).
|
| 2. Knowledge, transferred through language, education, and
| culture. Calculus in the early 1600's is a great example,
| without it and subsequent developments, probably 80% of the
| college/post-grad math/science/physics education wouldn't even
| exist. The stuff we teach our 18 year olds today required the
| 1600s' greatest minds to figure out.
|
| 3. The capacity of our human wetware.
|
| It's hard to treat #3 in isolation because our modern concept
| of intelligence is inextricably tied to #1 and #2. Also it's
| hard to place where "critical thinking" and "creativity" enter
| the picture, since they both rely heavily on all three aspects
| above.
| fakedang wrote:
| So Deepseek gives out the correct answer the highest percentage
| of all SOTA models, yet is the least confident of all models?
| myrmidon wrote:
| There is no text-only evaluation of the other models, though.
| The comparison might be completely invalid.
| og_kalu wrote:
| There is actually. It's a bit buried. Section C.2 of the
| paper(page 24).
|
| R1 is still the best. o1 drops a little (8.9)
| sottol wrote:
| I think it might mean the opposite of what one would expect.
| Afaict, calibration error means something along the lines of
| "how often was the model wrong but confident that the answer
| was correct".
|
| That means a low calibration error would be a good thing, ie
| the model correctly recognizes when it is unsure about answers
| instead of confidently stating the wrong answer.
| m3kw9 wrote:
| Looks more like first exam
| nottorp wrote:
| So who told all these "AI" companies that it's a good idea to
| market your product as the one who will bring the end of homo
| sapiens fastest?
| bananapub wrote:
| seems to be working fine, people seem to care what Sam Altman
| says and Elon Musk is making himself deputy emperor of a
| nuclear weapons state. pretty fucking dire indictment of the
| rest of us and what we let the world come to.
| tivert wrote:
| > seems to be working fine, people seem to care what Sam
| Altman says and Elon Musk is making himself deputy emperor of
| a nuclear weapons state.
|
| For the billionaires and chatterers.
|
| But even for non-chatterers, you probably should pay
| attention to what Altman says, not so much in a gullible
| take-it-at-face-value sense, but in a kremlinologist look-
| carefully-for- _hints_ -about-what's-really-going-on sense.
|
| > pretty fucking dire indictment of the rest of us and what
| we let the world come to.
|
| What are the rest of us to do? Pretty much _everyone_ has
| been trained by society to follow the rules above all else as
| a strong moral imperative, no matter how stupid or how bad
| the collective outcome may be. If you do otherwise, you
| _will_ get smacked _hard_ ; and if you try to organize, you
| will all get smacked _harder_.
| xnx wrote:
| Interesting marketing for Scale AI. I'd be surprised if any
| foundation models started benchmarking against this.
|
| Captchas seem like the more interesting test. As long as there
| are captchas that average people can solve, but computers can't,
| we will still have a long way to go toward artificial
| intelligence.
| sebzim4500 wrote:
| I don't think this is necessary true. I can imagine a future in
| which we have robots that can do 99% of human jobs but there's
| one thing they are strangely bad at some otherwise unimportant
| skill that can be used as a captcha.
| elicksaur wrote:
| XKCD #927 vibes. https://xkcd.com/927/
|
| Prediction: Just like how ARC wasn't actually a measure of AGI,
| this too will get "solved" without AI being useful enough to gain
| mass adoption.
| GaggiX wrote:
| Don't they have already achieved mass adoption? And I'm talking
| about LLMs in particular, because AIs in general like the ones
| used by Instagram filters and the TikTok recommendation
| algorithm are already use by billions.
| sebzim4500 wrote:
| I don't think that's really relevant, because there is an
| actual need for a new benchmark given how the existing ones
| either keep getting saturated or are probably out of the reach
| for the next generation of models.
|
| The closest existing thing is the frontierAI benchmark but
| that's just maths whereas this is more diverse.
| og_kalu wrote:
| chatgpt is like the 8th most visited site worldwide 2 years
| after release. It already has mass adoption lol. This is about
| more than that.
| renjimen wrote:
| I don't know about groundbreaking. It's just more academic
| questions. We already have a lot of those benchmarks, this is
| just a bit harder, but at this point these models are so
| glaringly bad at so many other areas APART from academic
| questions. Benchmarks for spatial reasoning or theory of mind are
| more interesting now, for example. These kinds of understanding
| are far more important if we expect to integrate AI into our
| everyday lives. I suspect even our most distant primate cousins
| could outperform multi-modal models on these kinds of tests.
| jfengel wrote:
| It does feel a bit like the early days of AI:
|
| "We want to make computers do what smart people do. What do
| smart people do? They play chess! Once we've solved that,
| everything else will be easier."
|
| It has been remarkable how much of the "easier" stuff they've
| made progress on -- like natural language and images. But after
| a huge quantum improvement, it doesn't seem very good at
| adapting to a lot of the things we really need them for.
| renjimen wrote:
| Exactly!
|
| Whatever world model LLMs have is like this crippled view
| through the lens of the internet. They are really like
| savants.
|
| It's annoying the AI companies are still touting their
| performance on all these metrics for domain knowledge in
| white collar jobs, but in truth they will fail in all but the
| most narrow application in those domains because they can't
| understand basic human behaviour.
| dccsillag wrote:
| Can we please rename this submission? This is excessively
| grandiose, way over the top......
| m_ke wrote:
| The only reliable final test will be a black box test suite that
| takes your model, executes it in a sealed environment and gives
| you a grade back, potentially with a performance break down by
| subject.
|
| No telling companies what the questions look like, what the
| output format is, what topics are covered, so that there's no
| room to make up synthetic data to interpolate from.
| sebzim4500 wrote:
| The name is obviously a bit stupid, but based on the sample
| questions I think they did a good job of creating a harder
| version of the existing academic question benchmarks.
|
| The questions are possible for a smart person familiar with the
| subject but still just beyond SOTA models.
|
| My guess is that within the next few years we will have models
| that can ace this test but are still bizarrely bad at things we
| find easy.
| pavel_lishin wrote:
| > _Hummingbirds within Apodiformes uniquely have a bilaterally
| paired oval bone, a sesamoid embedded in the caudolateral portion
| of the expanded, cruciate aponeurosis of insertion of m.
| depressor caudae. How many paired tendons are supported by this
| sesamoid bone? Answer with a number._
|
| I wonder how many questions give a gentle _nudge_ towards the
| answer like this. How many answers would have been wildly off the
| mark without specifying what the answer needs to look like?
| zeroonetwothree wrote:
| Good point. I wouldn't expect a human to need the last
| sentence.
| salynchnew wrote:
| The generous hypothesis, here, is that this is so they can
| automate the benchmarking itself. If that is true, then this
| is likely a result of the test authors being too clever for
| their own good and over-optimizing. If an LLM can't figure
| out on their own that "how many" is asking for a number, it
| has failed at a much more basic level.
|
| You should be able to easily accept answers like "four" and
| "4" as equivalent, for example. I doubt there will be that
| many frontier models running against this test at any time,
| and a simple glance at the answers from any human should be
| enough to catch edge cases like this one.
| sdwr wrote:
| Isn't this a terrible question to measure intelligence? It
| looks like it's testing niche domain knowledge along the lines
| of:
|
| > What color is the ball hidden behind the flowerpot in my
| neighbor's backyard?
|
| Maybe you can reason towards the answer if you only have a deep
| knowledge of bird anatomy and not Apodiformes anatomy, and
| that's the intelligence part?
| disambiguation wrote:
| I haven't been following up to the minute details of ai progress,
| training, and benchmarking - beyond a daily dose of HN articles.
|
| But the trend seems to be: today's benchmark becomes tomorrow's
| training data.
| bwfan123 wrote:
| please dont self-proclaim "groundbreaking" or "novel" or
| "innovative" - It diminishes your contribution since it clearly
| is an attention-grab.
| dang wrote:
| I briefly merged this thread into
| https://news.ycombinator.com/item?id=42804853, but actually the
| current article has more context, so probably we should keep this
| as the top link and then people can look at https://lastexam.ai
| also.
| dang wrote:
| The project site is https://lastexam.ai. Readers may want to look
| at both.
| jbenoit wrote:
| They started collecting problems last fall, saying the top 550
| submissions sent in by Nov 1st would get rewarded, to the tune of
| $500-$5000 each.
|
| Near the deadline, I counted the total number of submissions, and
| realized that each question I wrote had an expected value of
| hundreds of dollars, which is a great use of my time. So I wrote
| a good number, using the knowledge gained in my CS Ph. D.
|
| Then, as the Nov 1st deadline rolled around, they announced they
| extended the deadline to Nov 15th. Then Nov 15th came, and it
| said on their website they were still accepting submissions.
|
| Most of my submissions are being included in the benchmark, but
| I'm only getting paid $500, for one of them (the one I thought
| was most standard and least difficult, funnily enough). Had they
| closed submissions when they said they would, it seems likely I'd
| be paid for a few more.
|
| From my perspective, they basically conned hundreds of Ph. D.'s
| around the world to write questions for much less reward than
| promised. My close friend wrote a large number of questions for
| them, is getting paid thousands of dollars, and still feels
| defrauded.
|
| I'm not sure what they're doing in the end. It sounds like
| they're mostly just paying people who submitted before Nov 1st
| with a few exceptions, but either way they lied. There was no
| indication that people who submitted later would not get paid,
| and there was no indication that the deadline would be extended.
| Either they pay people who submitted after Nov 1st, meaning they
| lied to the people who submitted before about their expected
| reward. Or they don't, meaning they majorly lied to the people
| who submitted after. Either way, it's clear grounds for a class
| action lawsuit, and I hope one gets running.
| vkou wrote:
| You shouldn't engage in a CAL, a regular lawsuit from anyone
| wronged will be cheaper and way more painful for them.
|
| If you're in the US, consider small claims court. It's a small
| sum of money, you won't need to pay a lawyer, they'll probably
| not even show up.
| jbenoit wrote:
| Hmmm. I can see how it would be more painful for them to
| fight, but most people were conned <$200, and it's rather
| self-sacrificing to fight for that. Plus, no-one wants a
| reputation as litigious, but starting a CAL is less conducive
| to creating that reputation.
|
| I only submitted before Nov 1st, so I'm not sure to what
| extent I was personally conned.
| smandelbrot wrote:
| I think it'd be illuminating to see some overview stats on
| the submission dates and authors of all questions, accepted
| and not. Is something like this available somewhere?
| baobabKoodaa wrote:
| Scale AI's whole business model is wage theft. I don't mean to
| be insensitive, but out of all the Scale AI experiences I've
| heard about, yours is the least egregious. It's a dystopian,
| shitty company.
| levocardia wrote:
| I was similarly conned by Scale AI -- promised a significant
| bonus for some tasks, then rejected and not paid at all. Bet
| they kept my task text anyways.
|
| It's a classic scam: make a job post for freelancers, ask for
| a "work sample" or "take-home project," then have a few dozen
| applicants do the actual task you need them to do as their
| sample, then reject everybody.
| mrandish wrote:
| Assessing AI's progress toward replicating the full breadth and
| depth of human intelligence is a deceptively hard problem. A
| paper by Francois Chollet, who was until recently a researcher at
| Google, called "On the Measure of Intelligence" is the best
| overview of the challenges I've read. Highly recommended.
|
| https://arxiv.org/abs/1911.01547
| UncleOxidant wrote:
| Interesting that DeepSeek R1 which supposedly cost only $5.5M to
| train currently has the top score at 9.4%
| kaonwarb wrote:
| Quite the name! Looking forward to "Humanity's Last Exam
| v2.final.FINAL2..." coming next
| EncomLab wrote:
| I am reminded of the study that showed an AI trained on tumor
| identification was heavily biased toward indicating a tumor was
| cancerous if it was circled in purple ink or a visual scale was
| included in the image - as the cancerous tumors in its training
| set shared those traits while images of benign tumors did not.
|
| These systems so not posses some sort of "woo" that gives them
| magical powers when running LLM code that they lose if they ran a
| spreadsheet. Whatever attributions of intelligence are given have
| far more to do with our human willingness to anthropomorphize
| than a hidden ghost in the machine.
___________________________________________________________________
(page generated 2025-01-23 23:01 UTC)