[HN Gopher] GPT4 Can't Ace MIT
       ___________________________________________________________________
        
       GPT4 Can't Ace MIT
        
       Author : YeGoblynQueenne
       Score  : 121 points
       Date   : 2023-06-17 14:51 UTC (8 hours ago)
        
 (HTM) web link (flower-nutria-41d.notion.site)
 (TXT) w3m dump (flower-nutria-41d.notion.site)
        
       | underanalyzer wrote:
       | Great analysis, props to these students for taking the time to
       | challenge such a sensational headline. In the conclusion they
       | mention my biggest problem with the paper which is that it
       | appears gpt4 grades the answers as well (see section 2.6
       | "Automatic Grading").
       | 
       | In a way it makes perfect sense that gpt4 can score 100% on a
       | test gpt4 also grades. To be clear the grading gpt4 has the
       | answers so it does have more information but it still might
       | overlook important subtleties in how the real answer differs from
       | the generated answer due to it's own failure to understand the
       | material.
        
         | ghaff wrote:
         | I noticed that when I read the paper. I know it's hard to scale
         | but I'd want to see competent TAs doing the grading. I also
         | found the distribution of courses a bit odd. Some of it might
         | be just individual samples but intro courses I'd expect to be
         | pretty cookie cutter (for GPT) were fairly far down the list
         | and things I'd expect to be really challenging had relatively
         | good results.
        
           | raunakchowdhuri wrote:
           | Can attest that the distribution is odd from the test set
           | that we sampled.
           | 
           | We've already run the compute to run the zero-shot GPT model
           | on all of the datapoints in the provided test set. We're
           | going through the process now of grading them manually (our
           | whole fraternity is chipping in!) and should have the results
           | out relatively soon.
           | 
           | I can say that, so far, it's not looking good for that 90%
           | correct zero-shot claim either.
        
             | mquander wrote:
             | Since you are here, when I was reading the paper I wondered
             | -- when they show the "zero-shot solve rates", does that
             | mean that they are basically running the same experiment
             | code, but without the prompts that call `few_shot_response`
             | (i.e. they are still trying each question with every expert
             | prefix, and every critique?) It wasn't clear to me at a
             | glance.
        
         | code51 wrote:
         | This "GPT4 evaluating LLMs" problem is not limited to this
         | case. I don't know why exactly but everyone seems to have
         | accepted the evaluation of other LLM outputs using GPT4. GPT-4
         | at this point is being regarded as "ground-truth" with each
         | passing day.
         | 
         | Couple this with the reliance on crowd-sourcing to create
         | evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk
         | workers, you have a big fat feed-forward process benefiting
         | only one party: OpenAI.
         | 
         | The Internet we know is dead - this is a fact. I think OpenAI
         | exactly knew how this would play out. Reddit, Twitter and the
         | like are awakening just now - to find that they're basically
         | powerless against this wave of distorted future standards.
         | 
         | When sufficiently proven to pass every existing test on Earth,
         | every institution would be so reliant on producing work with
         | GPT that we won't have a "%100 handmade exam" anymore. No
         | problem will be left for GPT to be tackled with.
        
         | mquander wrote:
         | > In a way it makes perfect sense that gpt4 can score 100% on a
         | test gpt4 also grades.
         | 
         | Even this is overstating it, because for each question, GPT-4
         | is considered to get it "correct" if, across the (18?) trials
         | with various prompts, it ever produces one single answer that
         | GPT-4 then, for whatever reason, accepts. That's not getting
         | "100%" on a test.
        
           | [deleted]
        
         | afro88 wrote:
         | > but it still might overlook important subtleties
         | 
         | If there's one thing we can be certain of, it's that LLMs often
         | overlooks important subtleties.
         | 
         | Can't believe they used GPT4 to also evaluate the results. I
         | mean, we wouldn't trust a student to grade their own exam even
         | when given the right answers to grade with.
        
       | asylteltine wrote:
       | [dead]
        
       | constantcrying wrote:
       | Pretty damning. Certainly seems retraction worthy.
       | 
       | Also somewhat upsetting that something so low quality actually is
       | getting published. Seems entirely driven by hype and not
       | intellectual rigor.
        
         | ericpauley wrote:
         | It should be noted that the original paper is a preprint, i.e.,
         | not peer-reviewed.
        
           | constantcrying wrote:
           | I count 15 people as authors. Surely one of them was able to
           | look over the data and methodology.
           | 
           | "Not peer reviewed" is no excuse to publish non-information
           | for the sake of headlines.
        
             | rat9988 wrote:
             | It's not published though. That's what he is trying to tell
             | you. There is no guarantee it would have been published.
        
               | Sharlin wrote:
               | Different definitions of "published". Uploading a
               | preprint to arxiv or whatever definitely counts as
               | publishing it in the nontechnical sense of the word - to
               | an audience comprising several billion eyeballs, no less!
        
               | CameronNemo wrote:
               | Still, the question remains who published it. Some of the
               | authors (perhaps the supervising ones) may have wished
               | not to submit it to journals, and a zealous undergrad may
               | have uploaded it to arxiv without removing the other
               | authors.
        
               | QuantumCodester wrote:
               | Actually, it was the senior author who posted it on his
               | twitter:
               | https://twitter.com/iddo/status/1669665897454227456?s=20
        
               | constantcrying wrote:
               | It absolutely is published, just as a preprint.
               | 
               | Again, none of this excuses this. It isn't an innocent
               | mistake, which could have been caught later. The dataset
               | is flawed and the methodology is questionable, still the
               | authors _published_ it on arxiv, with spectacular claims.
               | 
               | If you don't know, there has been a significant shift in
               | how scientific papers (STEM for the most part) are
               | distributed. Instead of Journals (which have lost almost
               | all use in a digital world) papers are published freely
               | accessible online without any formal quality control,
               | before potentially later being published in some journal.
               | Arxiv, where these papers are published has control over
               | who gets to publish (not open to the public), but doesn't
               | require a lengthy formal process. In mathematics this has
               | worked remarkably well, notably one of the millenium
               | problems was solved when the solution was uploaded to
               | arxiv.
               | 
               | Poluting arxiv with low quality clickbait is destructive,
               | "not being peer reviewed" is no excuse for bad science.
        
         | onos wrote:
         | Makes one wonder what fraction of papers are equally bad, but
         | just haven't been subjected to similar scrutiny.
        
           | disgruntledphd2 wrote:
           | Most of them. Given a set of incentives to publish as much as
           | possible, one would expect the quality to decline at least
           | linearly (assuming that the rate if actual discovery is
           | constant).
        
       | raunakchowdhuri wrote:
       | Am one of the authors of the linked article. Thanks for all of
       | your kind comments! Let me know if you have any questions and
       | I'll be happy to answer.
        
         | dmazzoni wrote:
         | Whose idea was it? Was this something done for fun or was it
         | suggested by a professor you're working with?
        
           | raunakchowdhuri wrote:
           | Neil posted the paper in our fraternity's ML group chat (MIT
           | things lol), and I expressed some skepticism at the results.
           | 
           | Initially we started looking into it more for curiosity's
           | sake, but as we started digging we kept finding more and more
           | ridiculous stuff to the point where we decided to start
           | working on blog post. Then David joined in on the action and
           | helped a ton with the research and writeup.
           | 
           | No professor was involved. The paper was released yesterday,
           | so we just documented as we went along in the investigation
           | process. It only took like 8 hours of work to compile the
           | doc. We finished it last night and posted to Twitter in the
           | morning.
        
       | LouisSayers wrote:
       | There's no need to even make such a complicated task.
       | 
       | Recently I found that GPT4 can't even reliably create a list of
       | german nouns with a given article (der / die / das).
       | 
       | It will mess up a simple list - if you ask it to analyse it,
       | it'll be able to tell you that it's wrong.
       | 
       | Then you get it to correct the list and it may still be wrong.
       | 
       | It can take several iterations to make the list correct. I would
       | have thought this would be a super easy task for it, but
       | apparently not.
        
         | [deleted]
        
         | wenc wrote:
         | Curious, could you share your prompt? I just tried asking GPT4
         | (paid) to create a list of German nouns with der/die/das and it
         | managed to do it correctly.
        
           | LouisSayers wrote:
           | "What are the 200 most common nouns in German with the
           | artikel "der"?"
           | 
           | I actually didn't realise it was giving me incorrect info
           | until my gf started looking at it!
           | 
           | (I was trying to use it to help me learn german)
        
             | wenc wrote:
             | I gave GPT4 that prompt and as far as I can tell, the list
             | is correct.
             | 
             | GPT answer:
             | 
             | It's difficult to determine the "200 most common" nouns for
             | a particular gender in German since the frequency of word
             | use can vary greatly depending on the context (e.g.,
             | conversational, academic, technical). However, here's a
             | list of 100 common masculine nouns, which use the definite
             | article "der".
             | 
             | der Mann (man), der Tag (day), der Freund (friend), der
             | Vater (father), der Bruder (brother), der Sohn (son), der
             | Monat (month), der Kurs (course), der Tisch (table), der
             | Stuhl (chair), der Garten (garden), der Zug (train), der
             | Platz (place), der Himmel (sky), der Wagen (car), der Brief
             | (letter), der Konig (king), der Schuh (shoe), der Baum
             | (tree), der Hund (dog), der Bar (bear), der Fisch (fish),
             | der Vogel (bird), der Schuler (pupil), der Regen (rain),
             | der Wind (wind), der Wald (forest), der Berg (mountain),
             | der Fluss (river), der Junge (boy), der Kaffee (coffee),
             | der Lehrer (teacher), der Arzt (doctor), der Student
             | (student), der Polizist (policeman), der Kunstler (artist),
             | der Teller (plate), der Loffel (spoon), der Geldbeutel
             | (wallet), der Laptop (laptop), der Schalter (switch), der
             | Fernseher (television), der Schrank (cupboard), der Helm
             | (helmet), der Hut (hat), der Schmuck (jewelry), der Ring
             | (ring), der Sport (sport), der Schaden (damage), der Boden
             | (floor), der Sand (sand), der Sturm (storm), der Preis
             | (price), der Feind (enemy), der Held (hero), der Raum
             | (room), der Morgen (morning), der Abend (evening), der
             | Unfall (accident), der Vortrag (lecture), der Urlaub
             | (vacation), der Ausflug (trip), der Hafen (harbor), der
             | Bericht (report), der Name (name), der Bauer (farmer), der
             | Rechner (calculator), der Traum (dream), der Anzug (suit),
             | der Geschmack (taste), der Eindruck (impression), der Zweck
             | (purpose), der Vertrag (contract), der Krieg (war), der
             | Kunde (customer), der Arbeitgeber (employer), der
             | Mitarbeiter (employee), der Kollege (colleague), der
             | Bewohner (resident), der Fahrer (driver), der Gast (guest),
             | der Kritiker (critic), der Profi (professional), der Sieger
             | (winner), der Kandidat (candidate), der Beamte (official),
             | der Insasse (inmate), der Zeuge (witness), der Beweis
             | (proof), der Schatten (shadow), der Zweifel (doubt), der
             | Trauer (grief), der Frieden (peace), der Nerv (nerve), der
             | Horizont (horizon), der Gedanke (thought), der Lohn (wage),
             | der Antrag (application), der Verlust (loss), der Betrag
             | (amount),
        
             | whats_a_quasar wrote:
             | This isn't the same task that you described in the first
             | comment. Did GPT4 include nouns that didn't use the article
             | "der" in its output? Or did it fail to reply with the 200
             | most common ones?
        
               | LouisSayers wrote:
               | Apologies, I was trying to give a little context for
               | those that don't know about the three "the's" in German.
               | 
               | The list was mostly correct, but yes it added nouns that
               | were not "der" nouns into the list.
               | 
               | It then attempted to correct the list and failed at
               | correcting it.
               | 
               | In terms of output, it also didn't want to give a list of
               | 200, but I did manage to get a list of around 100 back.
        
               | dmazzoni wrote:
               | I wonder if it'd help to ask it to write the noun with
               | the article. I just tried it now - I asked it to list the
               | top 100 German "der" nouns, then asked it to repeat the
               | list but with the article. That made it obvious which
               | ones were wrong!
               | 
               | It was unwilling to write "Der Jahr" when "Das Jahr" is
               | correct.
        
         | nielsole wrote:
         | With the following prompt i got 50 consecutive words with
         | correct articles:
         | 
         | > Erstelle eine Liste mit 10 Nomen, die den Artikel "der"
         | haben.
         | 
         | Maybe "reliably" is doing a lot of the heavy lifting?
        
           | LouisSayers wrote:
           | It's not that it didn't do an OK job, but more that you
           | couldn't rely on what it had produced totally, nor rely on it
           | having corrected the list without first having it reanalyse
           | the "corrected" list.
           | 
           | It's still extremely helpful, I just found it strange that it
           | seemed like a simple task - for something that has been fed
           | millions of documents, for it to still give some incorrect
           | results - especially AFTER it had analysed its own results
           | and found some noun artikels to be incorrect.
        
             | afro88 wrote:
             | I've found you still can't rely on LLMs to do anything 100%
             | correct without human oversight. Unless you spend a lot of
             | time prompt engineering and testing. Even then you might
             | not get as close to 100% as you'd like.
             | 
             | But as you say, they are still extremely helpful anyway.
        
         | akira2501 wrote:
         | > I would have thought this would be a super easy task for it
         | 
         | Why did you think that? This isn't meant to be critical, but
         | I'm honestly curious, what led you to believe that technology
         | underlying GPT-4 made it a good fit for this or any particular
         | task?
        
           | constantcrying wrote:
           | >Why did you think that?
           | 
           | It is a purely statistical model. It does not know any
           | "rules" about the language (it doesn't know any language at
           | all), but it is fed data and derives from that sophisticated
           | probabalistical relationships between words.
           | 
           | It shouldn't have much of a problem to generate the correct
           | grammatical formulations, as it has been extensively trained
           | on them. Moreso than any other technology neural networks are
           | suited for this kind of tasks where hard rules do not exist
           | (as a German I couldn't tell you why "rain" is masculine, but
           | "machine" is feminine), but lots of data correctly implements
           | the rule, does.
        
           | londons_explore wrote:
           | I too would think it a super easy task.
           | 
           | It has probably seen the correct nouns used millions of times
           | in the training data - and asking it to produce the correct
           | nouns for a bunch of words is really just "tell me which case
           | you saw most during training", which is something LLM's are
           | really good at.
        
         | wudangmonk wrote:
         | I think this task is beyond the capabilities of what GPT4 can
         | handle, this is simply asking too much out of it. For other
         | languages I'm sure it has no problems.
         | 
         | https://faculty.georgetown.edu/jod/texts/twain.german.html
        
           | [deleted]
        
           | freehorse wrote:
           | In greek it is also really bad (and makes rather obvious
           | mistakes), in french it seems much better but makes some very
           | obvious mistakes too. To make it interesting, I emphasise
           | that I want nouns that refer to objects only (else it just
           | spits out profession names and stuff like that, which is not
           | interesting).
           | 
           | Also tbh, with all the hype of LLMs one would think that such
           | a task would not be such a challenge.
        
             | dontupvoteme wrote:
             | >Also tbh, with all the hype of LLMs one would think that
             | such a task would not be such a challenge.
             | 
             | The strange/sad thing is that despite being "large language
             | models" they're often hypermyopic on English..
             | 
             | I've done some measurements comparing generation between
             | various languages in the prompt and no matter what I do
             | half the time i cannot get them to not include english text
             | or comments in code unless the request is made in japanese,
             | chinese, or a similar very-different-language.
        
         | dontupvoteme wrote:
         | Does it work if you ask in German? I found it's a better if you
         | tell it via system prompt that it's a language professor (using
         | your target language) than if you just use english for tasks
         | involving foreign language. The power of the LARP.
         | 
         | (I use normal machine translation API for a lot of this, but
         | you can also ask it in another context window to translate the
         | text to other languages. I use this approach for e.g.
         | sindarian)
        
           | LouisSayers wrote:
           | I didn't do that, but will give it another go. Thanks for the
           | suggestion!
           | 
           | Otherwise my current strategy is to put it in an analysis
           | loop until it deems the list to be correct.
        
       | Spivak wrote:
       | I think this is a good peer review with a too defensive tone. You
       | can tell the authors really want this to be false and it clouds
       | their conclusions. Particularly because they're falling prey to
       | absence of evidence is evidence of absence. Their null is GPT4
       | can ace MIT and they haven't provided any evidence to reject.
       | 
       | The real conclusion is "we think the paper's results might be
       | weaker if repeated" and that's a good result on its own. But it
       | also could happen that if done again with the issues addressed it
       | would still pass.
        
         | mquander wrote:
         | I think it's appropriate to be extremely critical. The paper is
         | basically useless. The thing that they actually measured is
         | "can GPT-4, when given a 'question' with lots of additional
         | information and many tries with small permutations to produce
         | an 'answer', at some point produce an 'answer' that GPT-4 will
         | then claim is a 5 out of 5 answer, on a dataset of extremely
         | messy 'questions' and 'answers' from MIT coursework."
         | 
         | That's not an interesting thing to measure. The paper talks
         | about it in terms that make it sound like it's a close proxy
         | for whether GPT-4 "knows" how to do things in MIT coursework,
         | by writing misleadingly about "fulfilling the graduation
         | requirements" and having a "perfect solve rate." But in fact
         | it's totally different. The result is that a bunch of people
         | hear about this paper and get fooled into thinking that there
         | is new interesting evidence about GPT-4's capabilities, unless
         | they manage to read closely enough to see what actually
         | happened.
         | 
         | It's not a matter of whether the results would get weaker if
         | repeated, it's a matter of the results being totally
         | disconnected from any useful real-world information about what
         | GPT-4 can do, or how it can do it.
        
         | constantcrying wrote:
         | >I think this is a good peer review with a too defensive tone.
         | 
         | Not surprising when you have undergrads calling a seemingly
         | reputable paper almost fraud. I certainly can not blame them
         | for the tone.
        
         | raunakchowdhuri wrote:
         | You're right that our post doesn't quite show that GPT4 cannot
         | perform well on MIT curriculum. We try to be up front about
         | this in the conclusion: > Our critiques are largely of the
         | methodology and rigor of this study, not about its content. We
         | make no claim about the ability of large language models to
         | actually solve MIT curricula, only that this paper fails to
         | prove it in a scientifically rigorous way. Though, as MIT
         | undergraduates ourselves, we can at the very least say that the
         | test set that we accessed does not, at least in our experience,
         | accurately represent the breadth and depth of understanding
         | required to complete an EECS degree at MIT.
        
         | shiomiru wrote:
         | > Our critiques are largely of the methodology and rigor of
         | this study, not about its content. We make no claim about the
         | ability of large language models to actually solve MIT
         | curricula, only that this paper fails to prove it in a
         | scientifically rigorous way. Though, as MIT undergraduates
         | ourselves, we can at the very least say that the test set that
         | we accessed does not, at least in our experience, accurately
         | represent the breadth and depth of understanding required to
         | complete an EECS degree at MIT.
         | 
         | Directly from TFA. Though, this does make the title somewhat
         | click-baity. But they definitely don't claim the opposite.
        
         | quadrifoliate wrote:
         | > peer review...Their null is GPT4 can ace MIT and they haven't
         | provided any evidence to reject.
         | 
         | I haven't been in research for a while, but I don't think
         | that's how peer review works. You don't always have to assume
         | the paper's claims (especially ones as novel as this one's) as
         | a null hypothesis, and providing a compelling refutation.
         | 
         | The detection of sloppy question framing, and answer feeding
         | via the few-shot learning examples, and the problems with
         | checking with GPT-4 itself reasonably show that there are
         | serious flaws in multiple parts of the experiments described in
         | the paper.
         | 
         | > The real conclusion is "we think the paper's results might be
         | weaker if repeated" and that's a good result on its own.
         | 
         | No, they don't know enough to say that. The paper's results
         | might be better with better experimentation! Or they might be
         | totally false.
         | 
         | The conclusion they provided is accurate precisely because it
         | focuses on the _methods_ and not the conclusions, like this:
         | 
         | > One particularly worrying trend is the technique of
         | evaluating a model's accuracy using a language-based model like
         | GPT-4. While a useful tool, its conclusions should never be
         | overstated or treated as ground truth.
         | 
         | ...and this:
         | 
         | > Additionally, it is extremely important to reevaluate every
         | data point and perform basic sanity checks before using data at
         | all, whether for training, inference, benchmarking, or
         | something else. Given the small size of the dataset in
         | question, a simple manual validation would have been easily
         | within the scope of the work.
        
         | stefan_ wrote:
         | No, it turns out GPT-4 can not answer impossible questions.
         | Maybe it's you that is _too defensive_ and _wants this to be
         | false_?
        
       | freehorse wrote:
       | This is one of the most embarrassing reviews I have ever read
       | (for the paper reviewed). AI research needs urgently certain good
       | practices to adhere to, but the current status is that it is
       | really hard to take many of the results seriously due to the
       | opaqueness that characterises it in many steps of the process.
       | And such serious mistakes and bad practices certainly do not help
       | the field to achieve any credibility.
        
       | psyklic wrote:
       | Even research from OpenAI has attempted to use GPT-4 as quasi-
       | ground truth (as a replacement for human evaluators). For
       | example, their method in the recent paper "Language models can
       | explain neurons in language models" [1] is:
       | 
       | 1. Using GPT-4, generate a text explanation of a neuron's
       | activations on sample input.
       | 
       | 2. Using GPT-4 again, use the text explanation to simulate the
       | neuron on some new text input.
       | 
       | 3. Compare the result to the actual neuron's activations on the
       | new text input.
       | 
       | They justify this by saying human contractors do equally poorly
       | at coming up with text descriptions. However, the procedure is
       | such a black box that it is difficult to make scientific
       | conclusions from the results.
       | 
       | [1] https://openai.com/research/language-models-can-explain-
       | neur...
        
         | malisper wrote:
         | > Even research from OpenAI has attempted to use GPT-4 as
         | quasi-ground truth (as a replacement for human evaluators).
         | 
         | The way OpenAI used GPT-4 is fundamentally different than how
         | GPT-4 was used to score the answers to the MIT exam. In
         | OpenAI's case, they had GPT-4 generate an explanation of when a
         | neuron in GPT-3 would fire. They then gave that explanation
         | back to GPT-4 and had GPT-4 predict when the specific neuron in
         | GPT-3 would fire. The scoring was done by computing the
         | correlation between when GPT-4 predicted the neuron would fire
         | and when it actually fired. The scoring _was not_ done by GPT-4
         | as was done for the MIT exam
         | 
         | In addition OpenAI did have human evaluators score the
         | explanations as well to make sure they were human
         | interpretable[0]
         | 
         | [0] https://openaipublic.blob.core.windows.net/neuron-
         | explainer/...
        
           | tehsauce wrote:
           | Correct except they did this for gpt2 not gpt3
        
         | EGreg wrote:
         | Can't we just have GPT-4 make the scientific conclusions from
         | these results? /s
        
       | varjag wrote:
       | tl;dr the dataset was nonsensical and the researchers used GPT-4
       | to rate its own answers in the tests.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-06-17 23:00 UTC)