[HN Gopher] Results of "Humanity's Last Exam" benchmark published
       ___________________________________________________________________
        
       Results of "Humanity's Last Exam" benchmark published
        
       Author : tzury
       Score  : 53 points
       Date   : 2025-01-23 17:44 UTC (5 hours ago)
        
 (HTM) web link (scale.com)
 (TXT) w3m dump (scale.com)
        
       | chvid wrote:
       | So current AI can do less than 10% of these. But it probably
       | won't be more than a few days until models start being trained on
       | these rendering the indicator invalid.
        
       | LPisGood wrote:
       | The 8 sample questions available here are interesting:
       | 
       | https://lastexam.ai/
       | 
       | I might be able to answer 2 of them with great effort (maybe!),
       | and I would highly surprised if any human alive could answer 5 or
       | more without seeing the problems in advance.
        
         | sebzim4500 wrote:
         | I can answer 2 of them quite quickly with pen and paper
         | (compsci, physics) and one that I had to look up some
         | definitions on wikipedia (maths) so I am certain there are
         | people who can do more than 5.
         | 
         | The computer science one seems weirdly easy compared to the
         | rest, it's multiple choice and it is very easy to get it by
         | process of elimination even if you don't understand how to
         | actually do the problem.
        
           | LPisGood wrote:
           | Yes, many can answer the compsci and physics problems. The
           | math problem is abstract and more difficult, but solving
           | those 3 and 2 others seems nearly superhuman.
        
       | zamalek wrote:
       | I assume that the questions (and answers) aren't published
       | anywhere? Else it would be "Humanity's Last Exam before the
       | previous crawl".
        
         | nico1207 wrote:
         | You can just view the dataset on hugging face
        
       | GaggiX wrote:
       | It really shows how good Deepseek R1 is (even though it was
       | evaluated only on text-only questions).
       | 
       | The results are shown here: https://lastexam.ai/
       | 
       | EDIT: the text-only evaluation of the models shown in the paper
       | gives o1 an accuracy of 8.9%, so Deepseek R1 is even better than
       | I thought.
        
         | myrmidon wrote:
         | Is there a text-only evaluation of the non-Deepseek models?
         | Because being evaluated on text-only might have helped the
         | other models immensely as well from what I can tell?
        
           | GaggiX wrote:
           | >Is there a text-only evaluation of the non-Deepseek models?
           | 
           | Not that I can see but it would be cool to have, maybe the
           | paper will a more complete evaluation.
        
             | og_kalu wrote:
             | Section C.2 of the paper (pg 24) has text only evaluations
             | of other models.
        
               | GaggiX wrote:
               | Oh I see, the paper is out, I read "(arXiv coming soon)"
               | and though it wasn't released yet.
        
       | next_xibalba wrote:
       | These types of exams, and most benchmarks to date, seem to be
       | very one dimensional in terms of measuring intelligence. For
       | instance, if we transported a human from 2,000 years ago to
       | present day and asked him to take this exam, he would likely get
       | 0%, given that he couldn't read or write, let alone comprehend
       | the concepts and context required to solve these questions. But,
       | that man would still undoubtedly be far more intelligent than an
       | ape on all dimensions. He would likely be more intelligent than a
       | toddler on many dimensions. He might even be more intelligent
       | than some high schools students on a few dimensions. I can't
       | exactly articulate "what" is missing or how to measure it, but I
       | can intuit that some things are in these benchmarks.
        
         | oersted wrote:
         | "Intelligence" itself is very ill-defined and we've never been
         | able to measure it properly, IQ is rife with issues.
         | 
         | At some point, you just have to be pragmatic and measure the
         | questions you want the AI to be good at answering, rather than
         | trying to measure intelligence in general.
         | 
         | In that sense, I see this as one more benchmark that collects
         | questions that we want/expect AI to be good at, is not good at
         | yet and have been underrepresented at previous benchmarks.
         | That's obviously valuable, there's nothing "magical" about it.
         | Although it is reasonable to be annoyed at the "Humanity's Last
         | Exam" naming, of course they must have missed plenty of edge-
         | cases like everyone else and it is very arrogant to claim it
         | will be the "Last" one.
        
           | esotericimpl wrote:
           | This is always the answer for anyone who thinks LLMs are
           | capable of "intelligence".
           | 
           | It's good at answering questions that its trained on, I would
           | suggest general intelligence are things you didnt want/train
           | the AI to be good at answering.
        
             | fooker wrote:
             | Are you good at answering questions you are not trained to
             | answer?
             | 
             | How about a middle school test in a language you don't
             | speak?
        
               | xanderlewis wrote:
               | Yes -- reasonably so, anyway. I don't have to have seen
               | millions of prior examples of exactly the same kind in
               | order to tackle a novel problem in mathematics, say.
        
               | oersted wrote:
               | Well, LLMs are also remarkably good at generalizing. Look
               | at the datasets, they don't literally train on every
               | conceivable type of question the user might ask, the LLM
               | can adapt just as you can.
               | 
               | The actual challenge towards general intelligence is that
               | LLMs struggle with certain types of questions even if you
               | *do* train it on millions of examples of that type of
               | question. Mostly questions that require complex logical
               | reasoning, although consistent progress is being done in
               | this direction.
        
               | godelski wrote:
               | > Well, LLMs are also remarkably good at generalizing.
               | Look at the datasets, they don't literally train on every
               | conceivable type of question the user might ask, the LLM
               | can adapt just as you can.
               | 
               | Proof needed.
               | 
               | I'm serious. We don't have the datasets. But we do know
               | the size of the datasets. And the sizes suggest
               | incredible amounts of information.
               | 
               | Take an estimate of 100 tokens ~= 75 words[0]. What is a
               | trillion tokens? Well, that's 750bn words. There are
               | approximately 450 words on a page[1]. So that's 1.66...
               | bn pages! If we put that in 500 page books, that would
               | come out to 3.33... million books!
               | 
               | Llama 3 has a pretraining size of 15T tokens[2] (this
               | does not include training, so more info added later). So
               | that comes to ~50m books. Then, keep in mind that this
               | data is filtered and deduplicated. Even considering a
               | high failure rate in deduplication, this an unimaginable
               | amount of information.
               | 
               | [0] https://help.openai.com/en/articles/4936856-what-are-
               | tokens-...
               | 
               | [1] https://wordcounter.net/words-per-page
               | 
               | [2] https://ai.meta.com/blog/meta-llama-3/
        
               | oersted wrote:
               | That's a very good point. I just speak from my experience
               | of fine-tuning pre-trained models. At least at that stage
               | they can memorize new knowledge, that couldn't have been
               | in the training data, just by seeing it once during fine-
               | tuning (one epoch), which seems magical. Most
               | instruction-tuning datasets are also remarkably small
               | (very roughly <100K samples). This is only possible if
               | the model has internalized the knowledge quite deeply and
               | generally, such that new knowledge is a tiny gradient
               | update on top of existing expectations.
               | 
               | But yes I see what you mean, they are dumping practically
               | the whole internet at it, it's not unreasonable to think
               | that it has memorized a massive proportion of common
               | question types the user might come up with, such that
               | minimal generalization is needed.
        
               | godelski wrote:
               | > that couldn't have been in the training data
               | 
               | I'm curious, how do you know this? I'm not doubting, but
               | is it falsifiable?
               | 
               | I also am not going to claim that LLMs only perform
               | recall. They fit functions in a continuous manner. Even
               | if the data is discrete. So they can do more. The
               | question is more about how much more.
               | 
               | Another important point is that out of distribution
               | doesn't mean "not in training". This is sometimes
               | conflated, but if it were true then that's a test set
               | lol. OOD means not belonging to the same distribution.
               | Though that's a bit complicated, especially when dealing
               | with high dimensional data
        
               | xanderlewis wrote:
               | I agree. It is surprising the degree to which they seem
               | to be able to generalise, though I'd say in my experience
               | the generalisation is very much at the syntax level and
               | doesn't really reflect an underlying 'understanding' of
               | what's being represented by the text -- just a _very,
               | very_ good model of what text that represents reality
               | tends to look like.
               | 
               | The commenter below is right that the amount of data
               | involved is ridiculously massive, so I don't think human
               | intuition is well equipped to have a sense of how much
               | these models have seen before.
        
               | godelski wrote:
               | > Are you good at answering questions you are not trained
               | to answer?
               | 
               | Yes. Most schooling is designed around this.
               | 
               | Pick a random math textbook. Any will do. Read a chapter.
               | Then move to the homework problems. The typical fashion
               | is that the first few problems are quite similar to the
               | examples in the chapter. Often solvable by substitution
               | and repetition. Middle problems generally require a bit
               | of extrapolation. To connect concepts from previous
               | chapters or courses in ways that likely were not
               | explicitly discussed. This has many forms and frequently
               | includes taking the abstract form to practical (i.e. a
               | word problem). Challenge problems are those that require
               | you to extrapolate the information into new domains.
               | Requiring the connection of many ideas and having to
               | filter information for what is useful and not.
               | > How about a middle school test in a language you don't
               | speak?
               | 
               | A language course often makes this explicitly clear. You
               | are trained to learn the rules of the language.
               | Conjugation is a good example. By learning the structure
               | you can hear new words that you've never heard before and
               | extract information about it even if not exactly. There's
               | a reason you don't just learn vocabulary. It's also
               | assumed that by learning vocabulary you'll naturally
               | learn rules.
               | 
               | Language is a great example in general. We constantly
               | invent new words. It really is not uncommon for someone
               | you know to be be talking to you and in that discussion
               | drop a word they made up on the spot or just make a sound
               | or a gesture. An entirely novel thing yet you will likely
               | understand. Often this is zero-shot (sometimes it might
               | just appear to be zero-shot but actually isn't)
        
               | jhbadger wrote:
               | For a while I was into a trivia program on my phone. It
               | was kind of easy, so I decided to set the language to
               | Catalan, a language which I never studied. I was still
               | able to do well, because I could figure out the questions
               | more or less from languages I do know and could
               | generalize from them. It would be interesting to know if
               | you could say, train an LLM on examples from Romance
               | languages but specifically exclude Catalan and see if it
               | could do the same.
        
           | visarga wrote:
           | > "Intelligence" itself is very ill-defined and we've never
           | been able to measure it properly, IQ is rife with issues.
           | 
           | Yes, because it is 1st person exclusively. If you expand a
           | bit, consider "search efficiency". It's no longer just 1st
           | person, it can be social. And it doesn't hide the search
           | space. Intelligence is partially undefined because it doesn't
           | specify the problem space, it is left blank. But "search
           | efficiency" is more scientific and concrete.
        
           | JohnMakin wrote:
           | > IQ is rife with issues
           | 
           | Indeed, and yet people are obsessed with the it and the idea
           | of measuring their own intelligence - I completely do not
           | understand it. I am in an extremely high percentile, but I am
           | a total moron in a lot of areas and if you met me would
           | likely think so as well. It's a poor predictor for just about
           | everything except how good a person is at recognizing
           | patterns (I know there are many different kinds of tests, but
           | inevitably, it feels like this) and how quickly they can
           | reason. But people are _obsessed_ with it (Go on quora and
           | search  "IQ", you probably won't half to though, since half
           | the questions there are seemingly about IQ).
           | 
           | A thing I like to say is you didn't earn your intelligence
           | any more than a 7'0" man earned his height - to some degree
           | it seems innate (we don't even really know how).
           | 
           | This all said, it seems even _more_ pointless to try to  "IQ"
           | test an AI in this manner. What does it predict? What is it
           | measuring? And you're not going to be able to use the same
           | questions for more than 1 test, because the AI will "learn"
           | the answers.
        
             | godelski wrote:
             | The lowest IQ thing you can do is be obsessed with IQ.
             | 
             | There are known knowns, there are known unknowns, and there
             | are unknown unknowns. The wise man knows he cannot know
             | what he does not know and that it'd be naive to presume he
             | knows when he cannot know how much he doesn't know.
             | Therefore, only the unintelligent man really _knows_
             | anything.
        
               | dingnuts wrote:
               | IQ is compute speed, not storage. It has nothing to do
               | with knowledge. IBM used to give one out as part of their
               | hiring process, years ago, and even I took it the entire
               | test was a timed multiple choice exam where every
               | question was looking at an object made out of cubes and
               | choosing the correct orientation of the object from the
               | choices, after the object was arbitrarily rotated
               | according to instructions in the question.
               | 
               | Then, IQ can be derived by determining how quickly all
               | participants can answer the questionnaire correctly, and
               | ranking their speeds, and then normalizing the values so
               | 100 is in the middle.
               | 
               | Turns out, scores will fall along a bell curve if you do
               | that. You can call that phenomenon whatever, but most
               | people call it IQ and hopefully I've explained well why
               | that has nothing at all to do with static knowledge in
               | this comment.
        
               | 11101010001100 wrote:
               | Speed can be learned though...chess for example.
        
               | godelski wrote:
               | > IQ is compute speed, not storage.
               | 
               | Says who? Honestly. I've never seen that claim before.
               | Sure, tests are timed but that's a proxy for efficiency
               | in extrapolation.
               | 
               | If we define IQ in this way then LLMs far outperform any
               | human. I'm pretty confident this would be true of even
               | more traditional LMs.
               | 
               | Speed really is just the measurement of recall. A doubt
               | we'd call someone intelligent if they memorized the
               | multiplication table up to 100x100. Maybe at first but
               | when we ask them for 126*8358?
        
           | godelski wrote:
           | > "Intelligence" itself is very ill-defined
           | 
           | While this is true, it is well agreed upon (by domain
           | experts) that intelligence is distinct from knowledge recall.
           | But that's what most of these tests... test.
           | 
           | If you look at IQ tests you'll see that they are attempts to
           | test things that aren't knowledge based. You'll also notice
           | that the main critiques of IQ tests are about how they often
           | actually measure knowledge and that there's bias in natural
           | knowledge acquisition. So even the disagreements about the
           | definition of intelligence make clear that knowledge and
           | intelligence are distinct. I feel that often people conflate
           | "intelligence is ill-defined" with "intelligence has no
           | definition." These two are not in opposition. Being ill-
           | defined is more like "I know I left my phone in the house,
           | but I'm not sure where." This is entirely different from "I
           | lost my phone, it is somewhere in California" or "It is
           | somewhere on Earth" and clearly different from "I lost my
           | phone. I'm unsure if I had a phone. What even is a phone?"
        
             | oersted wrote:
             | Yes agreed, there is indeed a rough consensus on what
             | intelligence is and reasonable ways to approximately
             | measure it. These standard tests have been applied to LLMs
             | from the beginning, they have not proven to be the most
             | helpful to guide research, but there's value to applying
             | benchmarks that have been battle-tested with humans.
             | 
             | It's just that OP was questioning this group's criteria for
             | selecting the questions that determine intelligence. Then
             | we get into endless discussions of semantics.
             | 
             | At the end of the day, you are just testing which questions
             | your AI performs well on, and you can describe how you
             | chose those questions. Claiming it measures "general
             | intelligence" is just unhelpful and frustrating.
        
               | godelski wrote:
               | They were applied in the beginning because we really
               | weren't that good at solving the tasks. So like any good
               | researchers, we break it down.
               | 
               | But this is like trying to test an elephant but you can't
               | get access to an elephant so you instead train a dog. But
               | putting a dog in an elephant costume doesn't make it an
               | elephant. Sure, dog training will likely mean you can
               | learn to train an elephant faster had you not first
               | trained a dog. Some things transfer, but others don't
               | 
               | I also want to stress that there is a rough consensus.
               | But the ML field (which I'm a part of) often ignores
               | this. I'm not sure why. We should be leveraging the work
               | of others, not trying to start from scratch (unless
               | there's good reason, in which case we must be explicit.
               | But I'm just seeing simple claims of "intelligence is
               | ill-defined" and treating that as if that means no
               | definition instead of fuzzy definition. Which gets extra
               | weird when people talk about moving goal posts. That's
               | how progress works? Especially when exploring into the
               | unknown?)
        
         | og_kalu wrote:
         | This is true but that's because it's gotten hard to do much
         | else. LLMs are eating up everything else that don't require
         | long horizon planning or multimodality.
         | 
         | If you created a new benchmark today that didn't lean on the
         | things I've mentioned or esoteric/super specialized domain
         | knowledge (that would actually require some sort of super-human
         | performance to ace) like this or Frontier Math, LLMs would
         | probably do pretty well.
        
         | WanderPanda wrote:
         | I mean it is humanity's LAST exam. Humanity's first exam would
         | probably be something about communication? Or about building
         | and predicting effects of certain tools?
        
         | golol wrote:
         | The things that are missing are what stops us from having
         | useful agents so far: Agency, judgement, sense of time, long
         | horizon planning, not being gullible. I kinda feel like some
         | amount of ego is necessary to get a model to behave like that.
        
         | fooker wrote:
         | Put 'em in diverse simulations and see how long they survive.
         | 
         | I can imagine a dystopian world where people are subject to
         | this for training and testing AI.
        
         | modeless wrote:
         | ARC-AGI is a benchmark with no language that could plausibly be
         | solved by primitive humans, assuming only intelligence.
        
         | godelski wrote:
         | > seem to be very one dimensional in terms of measuring
         | intelligence.
         | 
         | I would argue that they DON'T measure intelligence, rather they
         | test knowledge.
         | 
         | Frustratingly, I think we have a society greatly focused on
         | knowledge based testing due to its correlation with
         | intelligence and that it is exponentially easier to test
         | knowledge. But this is easy to hack. Being in CS it feels very
         | odd since we all know a great way to get hired is to study
         | leetcode questions. That is, study to the test.
         | 
         | This is critical to recognize this difference as what we know
         | for certain is that LLMs and other ML systems are analogous to
         | a database with a human language interface[0]. What we DO NOT
         | KNOW is if these systems are intelligent. That is, that they
         | can use their exploit their knowledge to unfamiliar
         | territories. Then there's the whole question of wisdom...
         | 
         | This stuff is highly abstract and we can get fuzzy so it is
         | natural to go for the simple thing but we need to graduate.
         | Don't avoid the tough questions, dig in. As we advance in any
         | study nuance takes over. This should be obvious. If we
         | approximate things, to improve we need to tackle higher order
         | terms, and that almost always becomes exponentially more
         | difficult with each step.
         | 
         | And come on, is this benchmark not obvious bait? Calling it
         | "humanity's last exam" is extremely arrogant.
         | 
         | Definitions:                 Knowledge: Awareness of facts. The
         | ability to recall information.            Intelligence: Ability
         | to exploit knowledge to new settings. To be able to plan and
         | reason.                  (Definitions of intelligence are much
         | more debated than knowledge but what is far less controversial
         | is that intelligence is about the way one uses knowledge. These
         | two are distinct. This is fairly well agreed upon throughout
         | history and within modern literature around psychology and
         | cognitive science.)            Wisdom: The efficient use of
         | one's knowledge
         | https://en.wikipedia.org/wiki/Knowledge
         | https://en.wikipedia.org/wiki/Intelligence
         | https://en.wikipedia.org/wiki/Wisdom
         | 
         | There is a implicit hierarchy here[1] where knowledge is
         | something to be had, intelligence is the utilization of that,
         | and wisdom is about efficiency. There's a decent analogy to
         | this hierarchy. Knowledge is like having a tool. Intelligence
         | is like using it, a craftsman[2]. Wisdom is akin to being a
         | master craftsman.
         | 
         | [0] I mean that they fit the data. A database is discrete, but
         | these curve fit, so that will be a continuous function (in most
         | cases). Thus it won't be exact retrieval nor does this mean
         | information can't be interpolated. But that gets to be a deeper
         | and much more complex conversation that I think we like to
         | admit.
         | 
         | [1] This is clearly multi-dimensional. You can organize
         | hierarchies in multiple ways, I'm not suggesting this is the
         | only way or "the right way"
         | 
         | [2] What is argued is what is a sufficient threshold. An
         | armchair expert might know how to use a lathe because they read
         | about its usage but does that mean they can use it? What about
         | a novice who you can show something to and they can repeat it?
         | Monkey see monkey do style. An apprentice? A craftsman? There's
         | a lot of gray area between being able to recall something from
         | a book and being a wizard (gray beard).
        
         | tkgally wrote:
         | I agree that many aspects of intelligence--and of the lack of
         | intelligence--are not being measured by such benchmarks. One
         | issue is they are only examining problems that have right
         | answers.
         | 
         | One of the most powerful uses of LLMs for me, at least, is
         | brainstorming: having them suggest possible avenues for me to
         | pursue with specific projects I am working on. If I give Claude
         | or ChatGPT or Gemini enough context about my problems, they
         | usually come up with useful suggestions--sometimes amazingly
         | well. Are they better at that than the smartest human? I don't
         | know. How do you quantify the quality of an idea? But those
         | ideas often seem really, really good to me.
         | 
         | Another difficult-to-measure capability is interaction. Back-
         | and-forth conversations with models don't always go well, but
         | when they work they frequently blow me away. But those
         | successes are dependent partly on the model, partly on me, and
         | partly on how the conversation happens to unfold. Again, that
         | success or failure doesn't seem measurable with benchmarks that
         | require objectively right answers.
        
         | taeric wrote:
         | I'm curious why you are confident they would be more
         | intelligent than a modern toddler?
         | 
         | I largely empathize with your point. But, as I can recognize
         | there are some out there far better at problem solving than I
         | am, I am growing ok with the idea that intelligence can be
         | measured. Not to a single number, most likely, but to a variety
         | of different aspects.
         | 
         | Similarly, I'd imagine that a human from 2000 years ago is
         | probably more hardy than one from the modern age. If only
         | because of selection effects at play.
         | 
         | Obviously, you can't extrapolate a straight line between either
         | measurement and expect it to continue in either direction. But
         | I don't know why you couldn't build up a measurement for it?
         | 
         | (And it should go without saying that you shouldn't be judging
         | worth using this sort of measurement.)
        
           | ianburrell wrote:
           | Adults from 2000 years ago would absolutely be smarter than
           | toddlers. Adults back then watched and out thought their
           | toddlers. Do you think toddlers now are much smarter?
           | Especially when toddlers are from before they get educated.
           | 
           | Remember that 2000 years ago is 24AD, the middle of the Roman
           | empire and Han dynasty which covered half of the world
           | population. Nobles would be literate and well educated,
           | artisans and soldiers would be skilled, and I bet there were
           | lots of smart peasants that got ignored.
           | 
           | They wouldn't do well on intelligence tests because not used
           | to it, but that is more about tests than their intelligence.
           | I'm sure that the average intelligence is lower than now from
           | lack of education and malnutrition. Smart ones would still be
           | smart. Also, I bet people from now would do poorly in their
           | environment.
        
         | munchbunny wrote:
         | I think the concept you're dancing around the edges of is the
         | nature of what parts of "intelligence" are driven by:
         | 
         | 1. Language and how interrelated it is to our ability to
         | transfer knowledge and experience, as well as its role in
         | structuring our internal thinking. I haven't seen any academic
         | research on the matter, but there are more and less concrete
         | instances of this throughout history. This Wikipedia article
         | about the history of Algebra is a great example of how 2000
         | years of evolution led to a formulation of the same concepts,
         | but with a reduced cognitive load that 10-12 years olds learn
         | today as a matter of course. (https://en.wikipedia.org/wiki/His
         | tory_of_algebra#Stages_of_a...).
         | 
         | 2. Knowledge, transferred through language, education, and
         | culture. Calculus in the early 1600's is a great example,
         | without it and subsequent developments, probably 80% of the
         | college/post-grad math/science/physics education wouldn't even
         | exist. The stuff we teach our 18 year olds today required the
         | 1600s' greatest minds to figure out.
         | 
         | 3. The capacity of our human wetware.
         | 
         | It's hard to treat #3 in isolation because our modern concept
         | of intelligence is inextricably tied to #1 and #2. Also it's
         | hard to place where "critical thinking" and "creativity" enter
         | the picture, since they both rely heavily on all three aspects
         | above.
        
       | fakedang wrote:
       | So Deepseek gives out the correct answer the highest percentage
       | of all SOTA models, yet is the least confident of all models?
        
         | myrmidon wrote:
         | There is no text-only evaluation of the other models, though.
         | The comparison might be completely invalid.
        
           | og_kalu wrote:
           | There is actually. It's a bit buried. Section C.2 of the
           | paper(page 24).
           | 
           | R1 is still the best. o1 drops a little (8.9)
        
         | sottol wrote:
         | I think it might mean the opposite of what one would expect.
         | Afaict, calibration error means something along the lines of
         | "how often was the model wrong but confident that the answer
         | was correct".
         | 
         | That means a low calibration error would be a good thing, ie
         | the model correctly recognizes when it is unsure about answers
         | instead of confidently stating the wrong answer.
        
       | m3kw9 wrote:
       | Looks more like first exam
        
       | nottorp wrote:
       | So who told all these "AI" companies that it's a good idea to
       | market your product as the one who will bring the end of homo
       | sapiens fastest?
        
         | bananapub wrote:
         | seems to be working fine, people seem to care what Sam Altman
         | says and Elon Musk is making himself deputy emperor of a
         | nuclear weapons state. pretty fucking dire indictment of the
         | rest of us and what we let the world come to.
        
           | tivert wrote:
           | > seems to be working fine, people seem to care what Sam
           | Altman says and Elon Musk is making himself deputy emperor of
           | a nuclear weapons state.
           | 
           | For the billionaires and chatterers.
           | 
           | But even for non-chatterers, you probably should pay
           | attention to what Altman says, not so much in a gullible
           | take-it-at-face-value sense, but in a kremlinologist look-
           | carefully-for- _hints_ -about-what's-really-going-on sense.
           | 
           | > pretty fucking dire indictment of the rest of us and what
           | we let the world come to.
           | 
           | What are the rest of us to do? Pretty much _everyone_ has
           | been trained by society to follow the rules above all else as
           | a strong moral imperative, no matter how stupid or how bad
           | the collective outcome may be. If you do otherwise, you
           | _will_ get smacked _hard_ ; and if you try to organize, you
           | will all get smacked _harder_.
        
       | xnx wrote:
       | Interesting marketing for Scale AI. I'd be surprised if any
       | foundation models started benchmarking against this.
       | 
       | Captchas seem like the more interesting test. As long as there
       | are captchas that average people can solve, but computers can't,
       | we will still have a long way to go toward artificial
       | intelligence.
        
         | sebzim4500 wrote:
         | I don't think this is necessary true. I can imagine a future in
         | which we have robots that can do 99% of human jobs but there's
         | one thing they are strangely bad at some otherwise unimportant
         | skill that can be used as a captcha.
        
       | elicksaur wrote:
       | XKCD #927 vibes. https://xkcd.com/927/
       | 
       | Prediction: Just like how ARC wasn't actually a measure of AGI,
       | this too will get "solved" without AI being useful enough to gain
       | mass adoption.
        
         | GaggiX wrote:
         | Don't they have already achieved mass adoption? And I'm talking
         | about LLMs in particular, because AIs in general like the ones
         | used by Instagram filters and the TikTok recommendation
         | algorithm are already use by billions.
        
         | sebzim4500 wrote:
         | I don't think that's really relevant, because there is an
         | actual need for a new benchmark given how the existing ones
         | either keep getting saturated or are probably out of the reach
         | for the next generation of models.
         | 
         | The closest existing thing is the frontierAI benchmark but
         | that's just maths whereas this is more diverse.
        
         | og_kalu wrote:
         | chatgpt is like the 8th most visited site worldwide 2 years
         | after release. It already has mass adoption lol. This is about
         | more than that.
        
       | renjimen wrote:
       | I don't know about groundbreaking. It's just more academic
       | questions. We already have a lot of those benchmarks, this is
       | just a bit harder, but at this point these models are so
       | glaringly bad at so many other areas APART from academic
       | questions. Benchmarks for spatial reasoning or theory of mind are
       | more interesting now, for example. These kinds of understanding
       | are far more important if we expect to integrate AI into our
       | everyday lives. I suspect even our most distant primate cousins
       | could outperform multi-modal models on these kinds of tests.
        
         | jfengel wrote:
         | It does feel a bit like the early days of AI:
         | 
         | "We want to make computers do what smart people do. What do
         | smart people do? They play chess! Once we've solved that,
         | everything else will be easier."
         | 
         | It has been remarkable how much of the "easier" stuff they've
         | made progress on -- like natural language and images. But after
         | a huge quantum improvement, it doesn't seem very good at
         | adapting to a lot of the things we really need them for.
        
           | renjimen wrote:
           | Exactly!
           | 
           | Whatever world model LLMs have is like this crippled view
           | through the lens of the internet. They are really like
           | savants.
           | 
           | It's annoying the AI companies are still touting their
           | performance on all these metrics for domain knowledge in
           | white collar jobs, but in truth they will fail in all but the
           | most narrow application in those domains because they can't
           | understand basic human behaviour.
        
       | dccsillag wrote:
       | Can we please rename this submission? This is excessively
       | grandiose, way over the top......
        
       | m_ke wrote:
       | The only reliable final test will be a black box test suite that
       | takes your model, executes it in a sealed environment and gives
       | you a grade back, potentially with a performance break down by
       | subject.
       | 
       | No telling companies what the questions look like, what the
       | output format is, what topics are covered, so that there's no
       | room to make up synthetic data to interpolate from.
        
       | sebzim4500 wrote:
       | The name is obviously a bit stupid, but based on the sample
       | questions I think they did a good job of creating a harder
       | version of the existing academic question benchmarks.
       | 
       | The questions are possible for a smart person familiar with the
       | subject but still just beyond SOTA models.
       | 
       | My guess is that within the next few years we will have models
       | that can ace this test but are still bizarrely bad at things we
       | find easy.
        
       | pavel_lishin wrote:
       | > _Hummingbirds within Apodiformes uniquely have a bilaterally
       | paired oval bone, a sesamoid embedded in the caudolateral portion
       | of the expanded, cruciate aponeurosis of insertion of m.
       | depressor caudae. How many paired tendons are supported by this
       | sesamoid bone? Answer with a number._
       | 
       | I wonder how many questions give a gentle _nudge_ towards the
       | answer like this. How many answers would have been wildly off the
       | mark without specifying what the answer needs to look like?
        
         | zeroonetwothree wrote:
         | Good point. I wouldn't expect a human to need the last
         | sentence.
        
           | salynchnew wrote:
           | The generous hypothesis, here, is that this is so they can
           | automate the benchmarking itself. If that is true, then this
           | is likely a result of the test authors being too clever for
           | their own good and over-optimizing. If an LLM can't figure
           | out on their own that "how many" is asking for a number, it
           | has failed at a much more basic level.
           | 
           | You should be able to easily accept answers like "four" and
           | "4" as equivalent, for example. I doubt there will be that
           | many frontier models running against this test at any time,
           | and a simple glance at the answers from any human should be
           | enough to catch edge cases like this one.
        
         | sdwr wrote:
         | Isn't this a terrible question to measure intelligence? It
         | looks like it's testing niche domain knowledge along the lines
         | of:
         | 
         | > What color is the ball hidden behind the flowerpot in my
         | neighbor's backyard?
         | 
         | Maybe you can reason towards the answer if you only have a deep
         | knowledge of bird anatomy and not Apodiformes anatomy, and
         | that's the intelligence part?
        
       | disambiguation wrote:
       | I haven't been following up to the minute details of ai progress,
       | training, and benchmarking - beyond a daily dose of HN articles.
       | 
       | But the trend seems to be: today's benchmark becomes tomorrow's
       | training data.
        
       | bwfan123 wrote:
       | please dont self-proclaim "groundbreaking" or "novel" or
       | "innovative" - It diminishes your contribution since it clearly
       | is an attention-grab.
        
       | dang wrote:
       | I briefly merged this thread into
       | https://news.ycombinator.com/item?id=42804853, but actually the
       | current article has more context, so probably we should keep this
       | as the top link and then people can look at https://lastexam.ai
       | also.
        
       | dang wrote:
       | The project site is https://lastexam.ai. Readers may want to look
       | at both.
        
       | jbenoit wrote:
       | They started collecting problems last fall, saying the top 550
       | submissions sent in by Nov 1st would get rewarded, to the tune of
       | $500-$5000 each.
       | 
       | Near the deadline, I counted the total number of submissions, and
       | realized that each question I wrote had an expected value of
       | hundreds of dollars, which is a great use of my time. So I wrote
       | a good number, using the knowledge gained in my CS Ph. D.
       | 
       | Then, as the Nov 1st deadline rolled around, they announced they
       | extended the deadline to Nov 15th. Then Nov 15th came, and it
       | said on their website they were still accepting submissions.
       | 
       | Most of my submissions are being included in the benchmark, but
       | I'm only getting paid $500, for one of them (the one I thought
       | was most standard and least difficult, funnily enough). Had they
       | closed submissions when they said they would, it seems likely I'd
       | be paid for a few more.
       | 
       | From my perspective, they basically conned hundreds of Ph. D.'s
       | around the world to write questions for much less reward than
       | promised. My close friend wrote a large number of questions for
       | them, is getting paid thousands of dollars, and still feels
       | defrauded.
       | 
       | I'm not sure what they're doing in the end. It sounds like
       | they're mostly just paying people who submitted before Nov 1st
       | with a few exceptions, but either way they lied. There was no
       | indication that people who submitted later would not get paid,
       | and there was no indication that the deadline would be extended.
       | Either they pay people who submitted after Nov 1st, meaning they
       | lied to the people who submitted before about their expected
       | reward. Or they don't, meaning they majorly lied to the people
       | who submitted after. Either way, it's clear grounds for a class
       | action lawsuit, and I hope one gets running.
        
         | vkou wrote:
         | You shouldn't engage in a CAL, a regular lawsuit from anyone
         | wronged will be cheaper and way more painful for them.
         | 
         | If you're in the US, consider small claims court. It's a small
         | sum of money, you won't need to pay a lawyer, they'll probably
         | not even show up.
        
           | jbenoit wrote:
           | Hmmm. I can see how it would be more painful for them to
           | fight, but most people were conned <$200, and it's rather
           | self-sacrificing to fight for that. Plus, no-one wants a
           | reputation as litigious, but starting a CAL is less conducive
           | to creating that reputation.
           | 
           | I only submitted before Nov 1st, so I'm not sure to what
           | extent I was personally conned.
        
             | smandelbrot wrote:
             | I think it'd be illuminating to see some overview stats on
             | the submission dates and authors of all questions, accepted
             | and not. Is something like this available somewhere?
        
         | baobabKoodaa wrote:
         | Scale AI's whole business model is wage theft. I don't mean to
         | be insensitive, but out of all the Scale AI experiences I've
         | heard about, yours is the least egregious. It's a dystopian,
         | shitty company.
        
           | levocardia wrote:
           | I was similarly conned by Scale AI -- promised a significant
           | bonus for some tasks, then rejected and not paid at all. Bet
           | they kept my task text anyways.
           | 
           | It's a classic scam: make a job post for freelancers, ask for
           | a "work sample" or "take-home project," then have a few dozen
           | applicants do the actual task you need them to do as their
           | sample, then reject everybody.
        
       | mrandish wrote:
       | Assessing AI's progress toward replicating the full breadth and
       | depth of human intelligence is a deceptively hard problem. A
       | paper by Francois Chollet, who was until recently a researcher at
       | Google, called "On the Measure of Intelligence" is the best
       | overview of the challenges I've read. Highly recommended.
       | 
       | https://arxiv.org/abs/1911.01547
        
       | UncleOxidant wrote:
       | Interesting that DeepSeek R1 which supposedly cost only $5.5M to
       | train currently has the top score at 9.4%
        
       | kaonwarb wrote:
       | Quite the name! Looking forward to "Humanity's Last Exam
       | v2.final.FINAL2..." coming next
        
       | EncomLab wrote:
       | I am reminded of the study that showed an AI trained on tumor
       | identification was heavily biased toward indicating a tumor was
       | cancerous if it was circled in purple ink or a visual scale was
       | included in the image - as the cancerous tumors in its training
       | set shared those traits while images of benign tumors did not.
       | 
       | These systems so not posses some sort of "woo" that gives them
       | magical powers when running LLM code that they lose if they ran a
       | spreadsheet. Whatever attributions of intelligence are given have
       | far more to do with our human willingness to anthropomorphize
       | than a hidden ghost in the machine.
        
       ___________________________________________________________________
       (page generated 2025-01-23 23:01 UTC)