[HN Gopher] GPT4 Can't Ace MIT
___________________________________________________________________
GPT4 Can't Ace MIT
Author : YeGoblynQueenne
Score : 121 points
Date : 2023-06-17 14:51 UTC (8 hours ago)
(HTM) web link (flower-nutria-41d.notion.site)
(TXT) w3m dump (flower-nutria-41d.notion.site)
| underanalyzer wrote:
| Great analysis, props to these students for taking the time to
| challenge such a sensational headline. In the conclusion they
| mention my biggest problem with the paper which is that it
| appears gpt4 grades the answers as well (see section 2.6
| "Automatic Grading").
|
| In a way it makes perfect sense that gpt4 can score 100% on a
| test gpt4 also grades. To be clear the grading gpt4 has the
| answers so it does have more information but it still might
| overlook important subtleties in how the real answer differs from
| the generated answer due to it's own failure to understand the
| material.
| ghaff wrote:
| I noticed that when I read the paper. I know it's hard to scale
| but I'd want to see competent TAs doing the grading. I also
| found the distribution of courses a bit odd. Some of it might
| be just individual samples but intro courses I'd expect to be
| pretty cookie cutter (for GPT) were fairly far down the list
| and things I'd expect to be really challenging had relatively
| good results.
| raunakchowdhuri wrote:
| Can attest that the distribution is odd from the test set
| that we sampled.
|
| We've already run the compute to run the zero-shot GPT model
| on all of the datapoints in the provided test set. We're
| going through the process now of grading them manually (our
| whole fraternity is chipping in!) and should have the results
| out relatively soon.
|
| I can say that, so far, it's not looking good for that 90%
| correct zero-shot claim either.
| mquander wrote:
| Since you are here, when I was reading the paper I wondered
| -- when they show the "zero-shot solve rates", does that
| mean that they are basically running the same experiment
| code, but without the prompts that call `few_shot_response`
| (i.e. they are still trying each question with every expert
| prefix, and every critique?) It wasn't clear to me at a
| glance.
| code51 wrote:
| This "GPT4 evaluating LLMs" problem is not limited to this
| case. I don't know why exactly but everyone seems to have
| accepted the evaluation of other LLM outputs using GPT4. GPT-4
| at this point is being regarded as "ground-truth" with each
| passing day.
|
| Couple this with the reliance on crowd-sourcing to create
| evaluation datasets and heavy use of GPT3.5 and GPT4 by MTurk
| workers, you have a big fat feed-forward process benefiting
| only one party: OpenAI.
|
| The Internet we know is dead - this is a fact. I think OpenAI
| exactly knew how this would play out. Reddit, Twitter and the
| like are awakening just now - to find that they're basically
| powerless against this wave of distorted future standards.
|
| When sufficiently proven to pass every existing test on Earth,
| every institution would be so reliant on producing work with
| GPT that we won't have a "%100 handmade exam" anymore. No
| problem will be left for GPT to be tackled with.
| mquander wrote:
| > In a way it makes perfect sense that gpt4 can score 100% on a
| test gpt4 also grades.
|
| Even this is overstating it, because for each question, GPT-4
| is considered to get it "correct" if, across the (18?) trials
| with various prompts, it ever produces one single answer that
| GPT-4 then, for whatever reason, accepts. That's not getting
| "100%" on a test.
| [deleted]
| afro88 wrote:
| > but it still might overlook important subtleties
|
| If there's one thing we can be certain of, it's that LLMs often
| overlooks important subtleties.
|
| Can't believe they used GPT4 to also evaluate the results. I
| mean, we wouldn't trust a student to grade their own exam even
| when given the right answers to grade with.
| asylteltine wrote:
| [dead]
| constantcrying wrote:
| Pretty damning. Certainly seems retraction worthy.
|
| Also somewhat upsetting that something so low quality actually is
| getting published. Seems entirely driven by hype and not
| intellectual rigor.
| ericpauley wrote:
| It should be noted that the original paper is a preprint, i.e.,
| not peer-reviewed.
| constantcrying wrote:
| I count 15 people as authors. Surely one of them was able to
| look over the data and methodology.
|
| "Not peer reviewed" is no excuse to publish non-information
| for the sake of headlines.
| rat9988 wrote:
| It's not published though. That's what he is trying to tell
| you. There is no guarantee it would have been published.
| Sharlin wrote:
| Different definitions of "published". Uploading a
| preprint to arxiv or whatever definitely counts as
| publishing it in the nontechnical sense of the word - to
| an audience comprising several billion eyeballs, no less!
| CameronNemo wrote:
| Still, the question remains who published it. Some of the
| authors (perhaps the supervising ones) may have wished
| not to submit it to journals, and a zealous undergrad may
| have uploaded it to arxiv without removing the other
| authors.
| QuantumCodester wrote:
| Actually, it was the senior author who posted it on his
| twitter:
| https://twitter.com/iddo/status/1669665897454227456?s=20
| constantcrying wrote:
| It absolutely is published, just as a preprint.
|
| Again, none of this excuses this. It isn't an innocent
| mistake, which could have been caught later. The dataset
| is flawed and the methodology is questionable, still the
| authors _published_ it on arxiv, with spectacular claims.
|
| If you don't know, there has been a significant shift in
| how scientific papers (STEM for the most part) are
| distributed. Instead of Journals (which have lost almost
| all use in a digital world) papers are published freely
| accessible online without any formal quality control,
| before potentially later being published in some journal.
| Arxiv, where these papers are published has control over
| who gets to publish (not open to the public), but doesn't
| require a lengthy formal process. In mathematics this has
| worked remarkably well, notably one of the millenium
| problems was solved when the solution was uploaded to
| arxiv.
|
| Poluting arxiv with low quality clickbait is destructive,
| "not being peer reviewed" is no excuse for bad science.
| onos wrote:
| Makes one wonder what fraction of papers are equally bad, but
| just haven't been subjected to similar scrutiny.
| disgruntledphd2 wrote:
| Most of them. Given a set of incentives to publish as much as
| possible, one would expect the quality to decline at least
| linearly (assuming that the rate if actual discovery is
| constant).
| raunakchowdhuri wrote:
| Am one of the authors of the linked article. Thanks for all of
| your kind comments! Let me know if you have any questions and
| I'll be happy to answer.
| dmazzoni wrote:
| Whose idea was it? Was this something done for fun or was it
| suggested by a professor you're working with?
| raunakchowdhuri wrote:
| Neil posted the paper in our fraternity's ML group chat (MIT
| things lol), and I expressed some skepticism at the results.
|
| Initially we started looking into it more for curiosity's
| sake, but as we started digging we kept finding more and more
| ridiculous stuff to the point where we decided to start
| working on blog post. Then David joined in on the action and
| helped a ton with the research and writeup.
|
| No professor was involved. The paper was released yesterday,
| so we just documented as we went along in the investigation
| process. It only took like 8 hours of work to compile the
| doc. We finished it last night and posted to Twitter in the
| morning.
| LouisSayers wrote:
| There's no need to even make such a complicated task.
|
| Recently I found that GPT4 can't even reliably create a list of
| german nouns with a given article (der / die / das).
|
| It will mess up a simple list - if you ask it to analyse it,
| it'll be able to tell you that it's wrong.
|
| Then you get it to correct the list and it may still be wrong.
|
| It can take several iterations to make the list correct. I would
| have thought this would be a super easy task for it, but
| apparently not.
| [deleted]
| wenc wrote:
| Curious, could you share your prompt? I just tried asking GPT4
| (paid) to create a list of German nouns with der/die/das and it
| managed to do it correctly.
| LouisSayers wrote:
| "What are the 200 most common nouns in German with the
| artikel "der"?"
|
| I actually didn't realise it was giving me incorrect info
| until my gf started looking at it!
|
| (I was trying to use it to help me learn german)
| wenc wrote:
| I gave GPT4 that prompt and as far as I can tell, the list
| is correct.
|
| GPT answer:
|
| It's difficult to determine the "200 most common" nouns for
| a particular gender in German since the frequency of word
| use can vary greatly depending on the context (e.g.,
| conversational, academic, technical). However, here's a
| list of 100 common masculine nouns, which use the definite
| article "der".
|
| der Mann (man), der Tag (day), der Freund (friend), der
| Vater (father), der Bruder (brother), der Sohn (son), der
| Monat (month), der Kurs (course), der Tisch (table), der
| Stuhl (chair), der Garten (garden), der Zug (train), der
| Platz (place), der Himmel (sky), der Wagen (car), der Brief
| (letter), der Konig (king), der Schuh (shoe), der Baum
| (tree), der Hund (dog), der Bar (bear), der Fisch (fish),
| der Vogel (bird), der Schuler (pupil), der Regen (rain),
| der Wind (wind), der Wald (forest), der Berg (mountain),
| der Fluss (river), der Junge (boy), der Kaffee (coffee),
| der Lehrer (teacher), der Arzt (doctor), der Student
| (student), der Polizist (policeman), der Kunstler (artist),
| der Teller (plate), der Loffel (spoon), der Geldbeutel
| (wallet), der Laptop (laptop), der Schalter (switch), der
| Fernseher (television), der Schrank (cupboard), der Helm
| (helmet), der Hut (hat), der Schmuck (jewelry), der Ring
| (ring), der Sport (sport), der Schaden (damage), der Boden
| (floor), der Sand (sand), der Sturm (storm), der Preis
| (price), der Feind (enemy), der Held (hero), der Raum
| (room), der Morgen (morning), der Abend (evening), der
| Unfall (accident), der Vortrag (lecture), der Urlaub
| (vacation), der Ausflug (trip), der Hafen (harbor), der
| Bericht (report), der Name (name), der Bauer (farmer), der
| Rechner (calculator), der Traum (dream), der Anzug (suit),
| der Geschmack (taste), der Eindruck (impression), der Zweck
| (purpose), der Vertrag (contract), der Krieg (war), der
| Kunde (customer), der Arbeitgeber (employer), der
| Mitarbeiter (employee), der Kollege (colleague), der
| Bewohner (resident), der Fahrer (driver), der Gast (guest),
| der Kritiker (critic), der Profi (professional), der Sieger
| (winner), der Kandidat (candidate), der Beamte (official),
| der Insasse (inmate), der Zeuge (witness), der Beweis
| (proof), der Schatten (shadow), der Zweifel (doubt), der
| Trauer (grief), der Frieden (peace), der Nerv (nerve), der
| Horizont (horizon), der Gedanke (thought), der Lohn (wage),
| der Antrag (application), der Verlust (loss), der Betrag
| (amount),
| whats_a_quasar wrote:
| This isn't the same task that you described in the first
| comment. Did GPT4 include nouns that didn't use the article
| "der" in its output? Or did it fail to reply with the 200
| most common ones?
| LouisSayers wrote:
| Apologies, I was trying to give a little context for
| those that don't know about the three "the's" in German.
|
| The list was mostly correct, but yes it added nouns that
| were not "der" nouns into the list.
|
| It then attempted to correct the list and failed at
| correcting it.
|
| In terms of output, it also didn't want to give a list of
| 200, but I did manage to get a list of around 100 back.
| dmazzoni wrote:
| I wonder if it'd help to ask it to write the noun with
| the article. I just tried it now - I asked it to list the
| top 100 German "der" nouns, then asked it to repeat the
| list but with the article. That made it obvious which
| ones were wrong!
|
| It was unwilling to write "Der Jahr" when "Das Jahr" is
| correct.
| nielsole wrote:
| With the following prompt i got 50 consecutive words with
| correct articles:
|
| > Erstelle eine Liste mit 10 Nomen, die den Artikel "der"
| haben.
|
| Maybe "reliably" is doing a lot of the heavy lifting?
| LouisSayers wrote:
| It's not that it didn't do an OK job, but more that you
| couldn't rely on what it had produced totally, nor rely on it
| having corrected the list without first having it reanalyse
| the "corrected" list.
|
| It's still extremely helpful, I just found it strange that it
| seemed like a simple task - for something that has been fed
| millions of documents, for it to still give some incorrect
| results - especially AFTER it had analysed its own results
| and found some noun artikels to be incorrect.
| afro88 wrote:
| I've found you still can't rely on LLMs to do anything 100%
| correct without human oversight. Unless you spend a lot of
| time prompt engineering and testing. Even then you might
| not get as close to 100% as you'd like.
|
| But as you say, they are still extremely helpful anyway.
| akira2501 wrote:
| > I would have thought this would be a super easy task for it
|
| Why did you think that? This isn't meant to be critical, but
| I'm honestly curious, what led you to believe that technology
| underlying GPT-4 made it a good fit for this or any particular
| task?
| constantcrying wrote:
| >Why did you think that?
|
| It is a purely statistical model. It does not know any
| "rules" about the language (it doesn't know any language at
| all), but it is fed data and derives from that sophisticated
| probabalistical relationships between words.
|
| It shouldn't have much of a problem to generate the correct
| grammatical formulations, as it has been extensively trained
| on them. Moreso than any other technology neural networks are
| suited for this kind of tasks where hard rules do not exist
| (as a German I couldn't tell you why "rain" is masculine, but
| "machine" is feminine), but lots of data correctly implements
| the rule, does.
| londons_explore wrote:
| I too would think it a super easy task.
|
| It has probably seen the correct nouns used millions of times
| in the training data - and asking it to produce the correct
| nouns for a bunch of words is really just "tell me which case
| you saw most during training", which is something LLM's are
| really good at.
| wudangmonk wrote:
| I think this task is beyond the capabilities of what GPT4 can
| handle, this is simply asking too much out of it. For other
| languages I'm sure it has no problems.
|
| https://faculty.georgetown.edu/jod/texts/twain.german.html
| [deleted]
| freehorse wrote:
| In greek it is also really bad (and makes rather obvious
| mistakes), in french it seems much better but makes some very
| obvious mistakes too. To make it interesting, I emphasise
| that I want nouns that refer to objects only (else it just
| spits out profession names and stuff like that, which is not
| interesting).
|
| Also tbh, with all the hype of LLMs one would think that such
| a task would not be such a challenge.
| dontupvoteme wrote:
| >Also tbh, with all the hype of LLMs one would think that
| such a task would not be such a challenge.
|
| The strange/sad thing is that despite being "large language
| models" they're often hypermyopic on English..
|
| I've done some measurements comparing generation between
| various languages in the prompt and no matter what I do
| half the time i cannot get them to not include english text
| or comments in code unless the request is made in japanese,
| chinese, or a similar very-different-language.
| dontupvoteme wrote:
| Does it work if you ask in German? I found it's a better if you
| tell it via system prompt that it's a language professor (using
| your target language) than if you just use english for tasks
| involving foreign language. The power of the LARP.
|
| (I use normal machine translation API for a lot of this, but
| you can also ask it in another context window to translate the
| text to other languages. I use this approach for e.g.
| sindarian)
| LouisSayers wrote:
| I didn't do that, but will give it another go. Thanks for the
| suggestion!
|
| Otherwise my current strategy is to put it in an analysis
| loop until it deems the list to be correct.
| Spivak wrote:
| I think this is a good peer review with a too defensive tone. You
| can tell the authors really want this to be false and it clouds
| their conclusions. Particularly because they're falling prey to
| absence of evidence is evidence of absence. Their null is GPT4
| can ace MIT and they haven't provided any evidence to reject.
|
| The real conclusion is "we think the paper's results might be
| weaker if repeated" and that's a good result on its own. But it
| also could happen that if done again with the issues addressed it
| would still pass.
| mquander wrote:
| I think it's appropriate to be extremely critical. The paper is
| basically useless. The thing that they actually measured is
| "can GPT-4, when given a 'question' with lots of additional
| information and many tries with small permutations to produce
| an 'answer', at some point produce an 'answer' that GPT-4 will
| then claim is a 5 out of 5 answer, on a dataset of extremely
| messy 'questions' and 'answers' from MIT coursework."
|
| That's not an interesting thing to measure. The paper talks
| about it in terms that make it sound like it's a close proxy
| for whether GPT-4 "knows" how to do things in MIT coursework,
| by writing misleadingly about "fulfilling the graduation
| requirements" and having a "perfect solve rate." But in fact
| it's totally different. The result is that a bunch of people
| hear about this paper and get fooled into thinking that there
| is new interesting evidence about GPT-4's capabilities, unless
| they manage to read closely enough to see what actually
| happened.
|
| It's not a matter of whether the results would get weaker if
| repeated, it's a matter of the results being totally
| disconnected from any useful real-world information about what
| GPT-4 can do, or how it can do it.
| constantcrying wrote:
| >I think this is a good peer review with a too defensive tone.
|
| Not surprising when you have undergrads calling a seemingly
| reputable paper almost fraud. I certainly can not blame them
| for the tone.
| raunakchowdhuri wrote:
| You're right that our post doesn't quite show that GPT4 cannot
| perform well on MIT curriculum. We try to be up front about
| this in the conclusion: > Our critiques are largely of the
| methodology and rigor of this study, not about its content. We
| make no claim about the ability of large language models to
| actually solve MIT curricula, only that this paper fails to
| prove it in a scientifically rigorous way. Though, as MIT
| undergraduates ourselves, we can at the very least say that the
| test set that we accessed does not, at least in our experience,
| accurately represent the breadth and depth of understanding
| required to complete an EECS degree at MIT.
| shiomiru wrote:
| > Our critiques are largely of the methodology and rigor of
| this study, not about its content. We make no claim about the
| ability of large language models to actually solve MIT
| curricula, only that this paper fails to prove it in a
| scientifically rigorous way. Though, as MIT undergraduates
| ourselves, we can at the very least say that the test set that
| we accessed does not, at least in our experience, accurately
| represent the breadth and depth of understanding required to
| complete an EECS degree at MIT.
|
| Directly from TFA. Though, this does make the title somewhat
| click-baity. But they definitely don't claim the opposite.
| quadrifoliate wrote:
| > peer review...Their null is GPT4 can ace MIT and they haven't
| provided any evidence to reject.
|
| I haven't been in research for a while, but I don't think
| that's how peer review works. You don't always have to assume
| the paper's claims (especially ones as novel as this one's) as
| a null hypothesis, and providing a compelling refutation.
|
| The detection of sloppy question framing, and answer feeding
| via the few-shot learning examples, and the problems with
| checking with GPT-4 itself reasonably show that there are
| serious flaws in multiple parts of the experiments described in
| the paper.
|
| > The real conclusion is "we think the paper's results might be
| weaker if repeated" and that's a good result on its own.
|
| No, they don't know enough to say that. The paper's results
| might be better with better experimentation! Or they might be
| totally false.
|
| The conclusion they provided is accurate precisely because it
| focuses on the _methods_ and not the conclusions, like this:
|
| > One particularly worrying trend is the technique of
| evaluating a model's accuracy using a language-based model like
| GPT-4. While a useful tool, its conclusions should never be
| overstated or treated as ground truth.
|
| ...and this:
|
| > Additionally, it is extremely important to reevaluate every
| data point and perform basic sanity checks before using data at
| all, whether for training, inference, benchmarking, or
| something else. Given the small size of the dataset in
| question, a simple manual validation would have been easily
| within the scope of the work.
| stefan_ wrote:
| No, it turns out GPT-4 can not answer impossible questions.
| Maybe it's you that is _too defensive_ and _wants this to be
| false_?
| freehorse wrote:
| This is one of the most embarrassing reviews I have ever read
| (for the paper reviewed). AI research needs urgently certain good
| practices to adhere to, but the current status is that it is
| really hard to take many of the results seriously due to the
| opaqueness that characterises it in many steps of the process.
| And such serious mistakes and bad practices certainly do not help
| the field to achieve any credibility.
| psyklic wrote:
| Even research from OpenAI has attempted to use GPT-4 as quasi-
| ground truth (as a replacement for human evaluators). For
| example, their method in the recent paper "Language models can
| explain neurons in language models" [1] is:
|
| 1. Using GPT-4, generate a text explanation of a neuron's
| activations on sample input.
|
| 2. Using GPT-4 again, use the text explanation to simulate the
| neuron on some new text input.
|
| 3. Compare the result to the actual neuron's activations on the
| new text input.
|
| They justify this by saying human contractors do equally poorly
| at coming up with text descriptions. However, the procedure is
| such a black box that it is difficult to make scientific
| conclusions from the results.
|
| [1] https://openai.com/research/language-models-can-explain-
| neur...
| malisper wrote:
| > Even research from OpenAI has attempted to use GPT-4 as
| quasi-ground truth (as a replacement for human evaluators).
|
| The way OpenAI used GPT-4 is fundamentally different than how
| GPT-4 was used to score the answers to the MIT exam. In
| OpenAI's case, they had GPT-4 generate an explanation of when a
| neuron in GPT-3 would fire. They then gave that explanation
| back to GPT-4 and had GPT-4 predict when the specific neuron in
| GPT-3 would fire. The scoring was done by computing the
| correlation between when GPT-4 predicted the neuron would fire
| and when it actually fired. The scoring _was not_ done by GPT-4
| as was done for the MIT exam
|
| In addition OpenAI did have human evaluators score the
| explanations as well to make sure they were human
| interpretable[0]
|
| [0] https://openaipublic.blob.core.windows.net/neuron-
| explainer/...
| tehsauce wrote:
| Correct except they did this for gpt2 not gpt3
| EGreg wrote:
| Can't we just have GPT-4 make the scientific conclusions from
| these results? /s
| varjag wrote:
| tl;dr the dataset was nonsensical and the researchers used GPT-4
| to rate its own answers in the tests.
| [deleted]
___________________________________________________________________
(page generated 2023-06-17 23:00 UTC)