[HN Gopher] Benchmarking GPT-4 Turbo - A Cautionary Tale
       ___________________________________________________________________
        
       Benchmarking GPT-4 Turbo - A Cautionary Tale
        
       Author : ja3k
       Score  : 202 points
       Date   : 2023-11-09 13:00 UTC (10 hours ago)
        
 (HTM) web link (blog.mentat.ai)
 (TXT) w3m dump (blog.mentat.ai)
        
       | Kuinox wrote:
       | tldr: GPT-4 Turbo have worse score on synthetic benchmark of the
       | first attempt because they speculate it's a smaller model, and
       | isn't able to memorize as well the response.
        
       | sunpazed wrote:
       | This reflects my anecdotal findings in a very narrow domain:
       | GPT-4 consistently resolves complex SQL from natural language,
       | whereas GPT-4-Turbo is hit-and-miss, and similar to 3.5-Turbo in
       | performance.
        
         | Semaphor wrote:
         | But that's not what the article is saying at all?
        
       | boxed wrote:
       | > We designed a test for this theory: we reran the benchmarks
       | without showing the models the instructions to each exercise.
       | Instead, we just told them that they were Exercism exercises, and
       | gave them the exercise names and function stubs.
       | 
       | This summarizes all my skepticism agains the AI field. Pretty
       | clear that they aren't solving the problem, they have them
       | memorized.
        
         | mavhc wrote:
         | LLMs are lossy compression
        
           | stoicfungi wrote:
           | All models are, including humain brain.
        
             | ChatGTP wrote:
             | The human brain is a model?
        
               | rocgf wrote:
               | It models the world around it, so it's fairly similar to
               | what GPT does, especially with the newly-added image
               | capabilities and stuff.
        
               | FrustratedMonky wrote:
               | Consciousness itself is a model of the world.
               | 
               | Our experience of the world is a model executing.
               | 
               | Comparing the latest neuroscience to latest neural
               | networks. They look and behave very similarly.
        
         | og_kalu wrote:
         | "They memorized all the problems" is not what was found here
         | and still a wrong overcorrection.
        
           | dartos wrote:
           | "Gpt-4 has more problems memorized than gpt-4 turbo" was
           | exactly what was found here.
           | 
           | That doesn't mean it's only able to solve problems in its
           | training set (tho it's much better at that obviously.)
        
           | boxed wrote:
           | If you are shown only the title of a coding problem and the
           | site name where it's from, and you manage to solve it you are
           | showing that you either cheated or knew the answer.
        
             | og_kalu wrote:
             | I mean sure, it memorized some of the answers. I'm not
             | denying that. Clearly, it didn't memorize all of them.
        
               | boxed wrote:
               | When people say "oh look how amazing, it can solve
               | programming problems!" when in fact it has only seen the
               | models CHEAT, is an enormous problem.
               | 
               | For cases where finding the answer it's perfectly fine,
               | but it's not fine for claims that it can code. There's a
               | huge difference.
        
               | FeepingCreature wrote:
               | I've seen it code on completely novel tasks, so I'm not
               | sure what you're suggesting here. The model can
               | unquestionably code.
        
               | og_kalu wrote:
               | Okay... Funny how forcing it to not CHEAT did not
               | increase apparent ability.
               | 
               | It can code and it has memorized some coding questions
               | are not mutually exclusive.
        
               | boxed wrote:
               | > Okay... Funny how forcing it to not CHEAT did not
               | increase apparent ability.
               | 
               | The article did the opposite. It forced the models to
               | cheat to solve the problems. Which it did happily. It
               | should have stated "there is no actual problem to solve
               | here, you must supply a problem for me to solve".
               | 
               | > It can code and it has memorized some coding questions
               | are not mutually exclusive
               | 
               | This I will give you. Many humans try to cheat at basic
               | math because they are lazy, so will this model. Maybe
               | that's a sign of intelligence :P
        
               | raincole wrote:
               | Me: What's 6x6?
               | 
               | You: 36
               | 
               | Me: You cheated! You just cited the answer you memorized!
               | You should have started from addition.
               | 
               | You: ...okay? 6+6=12, 12+6=18, 18+...
               | 
               | Me: You cheated again! You just have 6+6=12 memorized!
               | You should make the rule of addition out of Peano axioms.
               | 
               | You: ...you're being annoying, but okay? First axiom, we
               | define 0 as...
               | 
               | Me: You cheated _again_! You memorized Peano Axioms!
               | Jesus Christ, is there any intelligent creature left?
        
               | boxed wrote:
               | But in this case it's not like that at all. They only saw
               | the NAME of the problem. Like if I said "Page 23 of
               | Mathbook Y, problem number 3". Which happens to be 6x6.
        
               | raincole wrote:
               | Me: I was being in such a blah blah situation... does the
               | article 3 of Digital Government Act applies here?
               | 
               | My lawyer: Hmm the article 3 says--
               | 
               | Me: I knew it! Lawyers are not intelligent...
        
               | furyofantares wrote:
               | It said they gave the exercise name, which doesn't sound
               | like just the exercise number but probably mildly
               | descriptive -- and they also gave it function stubs.
        
               | stnmtn wrote:
               | If I gave you a programming problem and all I told you
               | was that the problem name was Traveling Salesman, you
               | might be able to solve it based on that.
               | 
               | If not that, then if I just said "fizzbuzz" to you, I'm
               | sure you would be able to give the solution without me
               | needing to say any other descriptions of the problem
        
               | boxed wrote:
               | Again, because of memorization, not being able to code.
        
               | qup wrote:
               | I know this is deep down a bad comment thread, but I
               | thought I'd chime in here.
               | 
               | I have been writing function names and test names, and
               | then telling gpt to fill in the test, which is usually
               | does how I want (maybe with errors, but it tests the
               | correct thing), and then I tell it to fill out the
               | answers.
               | 
               | this is in a thing I'm building that's never been built,
               | with names that I made up (but describe the functionality
               | well)
               | 
               | It cannot have this spot memorized, I just invented it
               | myself
        
               | jsight wrote:
               | TBH, people underestimate how much of coding is just
               | memorization. I'm guessing those of us with bad memories
               | understand this more than the ones with good memories. :)
               | 
               | I can't remember how many times I've googled, "how do I
               | create a directory in Python?". Now bard often generates
               | an inline answer for me.
        
               | hanselot wrote:
               | Though this is exactly what happened. The initial test
               | was ran on a model that "Cheated" (aka has memorized the
               | answers). The second test was run on a model that didn't
               | "Cheat" as much, yet still got only 2% less score. So,
               | the question is not resolved really. How much did the
               | first model cheat, and how much did the second? If the
               | second model "cheats" less, then it wins.
               | 
               | Also, I don't understand your obsession with the word
               | cheating. If you have solved a problem before on a
               | different website and solve it again, did you cheat? Or
               | did you just use your brain to store the solution for
               | later?
        
               | boxed wrote:
               | > Also, I don't understand your obsession with the word
               | cheating.
               | 
               | It's all about the rule set yea. Since the rule set is
               | not defined, technically nothing is cheating. I just
               | interpret the rule set as "can it code?" and for this
               | rule set, it seems to me that it's cheating.
        
               | boxed wrote:
               | > How much did the first model cheat, and how much did
               | the second? If the second model "cheats" less, then it
               | wins.
               | 
               | They both cheated 100%. Because they both never saw the
               | problem. AT ALL. They just saw the title and the name of
               | the website.
        
               | raincole wrote:
               | Almost 2024 and people still can't accept that LLM can
               | code...
        
               | danielmarkbruce wrote:
               | Of course they can't. And self-driving cars also don't
               | exist, it's like 10 years away at best.
        
               | notnaut wrote:
               | It can generate never-before-seen strings of
               | comprehensible language. It can react to the inherent
               | logic embedded in words and text and provide a brute
               | forced version of what a human could. That it can "solve"
               | a problem only through "cheating" is an anthropomorphism
               | that betrays the magic that is evident to anyone who has
               | used these things.
        
             | Terretta wrote:
             | On the contrary, it could mean you were, to some percentage
             | of success, able to guess what problem is, and then, to
             | some multiplier percentage of success, solve it.
             | 
             | The key is, can you guess the problem from the title and
             | the function name? I'd argue, sure, at least half the
             | time?, why not...
        
         | DecayingOrganic wrote:
         | Memorization often gets a bad rap as the underachiever's
         | shortcut. However, it's a fundamental component of any learning
         | process! Our ability to reason, solve problems, and innovate is
         | all built upon a foundation of memorized information. In fact,
         | it's precisely the reason humans have thrived for so long; we
         | were able to memorize and pass down knowledge culturally long
         | before the written word, not because we were 100 times smarter
         | than our nearest cousins. Without memorization, be it in our
         | brains or AI algorithms, there's no foundation to build upon
         | for higher reasoning.
        
         | viraptor wrote:
         | It's hard to decide for me without seeing the data. Even if you
         | don't know the exact exercise, seeing the title and the
         | function name/parameters is often enough for me to guess what
         | the challenge is. I checked the public questions on exercism
         | and almost all of those (that I spot checked) that contained
         | the function name were extremely obvious. Knowing it's a
         | programming challenge would also improve my guessing chances.
         | 
         | For example the function stubs I can find are
         | "value_of_card(<card>)" in exercise "Black Jack", or
         | "generate_seat_letters(<number>)" in exercise "Plane Tickets".
         | I think I could guess those without seeing the rest of the
         | question.
        
         | GaggiX wrote:
         | So how can it solve novel problems? Internet does not have all
         | combinations for every possible task with any random
         | programming language, library or constraints. It can even solve
         | problems with non-existing programming languages and libraries
         | if you describe them, if that's just memorization then I don't
         | know what it isn't.
        
         | m3kw9 wrote:
         | From a black box point of view and one angle, gpt is a web
         | filter where it will try to find you the exact thing you are
         | looking for but from memory. Vs google you have to distill all
         | the info into what you need
        
         | ren_engineer wrote:
         | even if it's not true AI or even an architecture with the
         | potential to become AI, LLMs are already good enough to provide
         | real world value. Obviously "super autocomplete" isn't as sexy
         | as true AI, but still very useful
        
         | drcode wrote:
         | You can call it whatever you want, all I know is I used to
         | write programs in lines of code, then blocks of code at a time,
         | spit out by LLMs
         | 
         | Using GPT-4 Turbo yesterday, I feel like I'm moving to pages of
         | code at a time now.
         | 
         | Taking the ideas in my head and turning them into reality is so
         | easy now
        
         | blovescoffee wrote:
         | Ok but you understand there's a body of literature that shows
         | that LLMs don't "just" memorize
        
         | gloosx wrote:
         | +100 to that. My biggest scepticism is people actually creating
         | a new problem while thinking they are solving problem. Don't
         | get me wrong, translating natural language ideas into code is
         | fun and all, the truth it is also code, yet in ambiguous
         | language format given to the machine.
         | 
         | When did natural language became better for expressing
         | development ideas than code? I know - when you don't know how
         | to code in the first place. Then you will have to bet on all of
         | the ambiguities of the language, cultural and meta-physical
         | which words carry in order to hack your thing together instead
         | of expressing yourself directly and explicitly.
         | 
         | Finally what is beautiful about strict code format we are so
         | used to - it is truly the fastest and shortest path to get your
         | thing done, in case you possess the knowledge needed.
        
           | ericrallen wrote:
           | That sounds a lot like gatekeeping.
           | 
           | These tools will empower folks who aren't developers to build
           | stuff and maybe learn a bit more about how programming works.
           | 
           | They will enable folks who have ideas, but can't express
           | them, to actually be able to create what they are imagining.
           | 
           | That's awesome.
           | 
           | Code isn't beautiful (except for a few rare exceptions).
           | Creating something with code is.
        
           | doug_durham wrote:
           | Natural language isn't superior to computer languages. NL
           | allows you to describe a software concept in a computer
           | language and framework neutral way. The LLM generates the
           | code. The real benefit is when you work across languages and
           | frameworks. It is difficult to keep all of the details of all
           | of the framework calls in your head all of the time.
        
         | caesil wrote:
         | If that's your takeaway from this then you really missed the
         | point. The implication here is that gpt-4 to gpt-4-turbo
         | represents a leap away from memorization and toward better
         | reasoning with a more complete world model.
        
         | nathanfig wrote:
         | "memorize" implies they can only recite things verbatim and
         | that's ignoring the massive leap in being able to synthesize
         | disjoint "memories" in new ways.
        
       | mccrory wrote:
       | GPT -4 Turbo is still in preview, maybe wait until it is fully
       | released before judging?
        
         | msp26 wrote:
         | The point of a preview phase is to test the model in real world
         | use.
        
           | seanhunter wrote:
           | This isn't really real-world use any more than putting these
           | same problems to people as a whiteboard coding exercise in an
           | interview is real-world coding, yet seemingly a lot of people
           | seem to be generalising from this tiny sample to all manner
           | of overarching statements about performance of the model in
           | general "it's faster but dumber", "this proves it only
           | memorises" etc.
        
       | KaoruAoiShiho wrote:
       | Why are all the comments here so negative... this is a good
       | thing, turbo has less memorization but keeps the same reasoning
       | ability. That's excellent and a relief.
        
         | boxed wrote:
         | Or the programming quiz problems it tried to "solve" were in
         | fact posted elsewhere also so it cheated on the ones it got
         | right too.
        
         | CamperBob2 wrote:
         | People here spent a lot of time (and money) in school learning
         | to do things that can now be automated. The whining is just
         | beginning.
        
       | anotherpaulg wrote:
       | Aider has had an Exercism benchmarking suite for quite some time.
       | 
       | Interestingly, my benchmark results of GPT 4 Turbo show an
       | opposite result: the new gpt-4-1106-preview did significantly
       | _better_ on the first try than the March and June models.
       | 
       | https://aider.chat/docs/benchmarks-1106.html
       | 
       | Aider benchmarks against the 133 Exercism python exercises, not
       | js exercises that mentat's benchmark uses. So this is not an
       | apples-to-apples comparison, but there doesn't seem to be a
       | strong reason to expect qualitatively different results.
       | 
       | I also notice that the instructions prompt that mentat uses seems
       | to be _inspired by_ the aider benchmark? Glad to see others
       | adopting similar benchmarking approaches.
       | 
       | https://github.com/AbanteAI/mentat/blob/main/tests/benchmark...
       | 
       | https://github.com/paul-gauthier/aider/blob/main/benchmark/p...
       | 
       | Edit: Not sure if the mentat authors are in this thread? After
       | looking around a bit, there seems to be a bunch of aider code in
       | your repo. Some attribution would be appreciated. It might even
       | be required under aider's Apache 2.0 license?
        
         | naiv wrote:
         | I am also noticing a massive improvement over the old model
        
         | hanselot wrote:
         | Isn't it a good thing that of the benchmarks they ran, the
         | newer model has fewer of the answers memorized (aka, its
         | parroting less)?
         | 
         | Wouldn't this actually be exactly proof that the model has
         | improved over its predecessor by having to solve the problem
         | itself rather than rely on memory?
         | 
         | What use is a model that memorizes the answers to all the
         | benchmarks (see the 7b models on open llm leaderboard for more
         | info on that).
        
         | derwiki wrote:
         | I've been using the new model with Aider since it was released,
         | and my anecdata agrees--the "edits applied successfully "
         | failure rate is much lower than classic gpt4.
         | 
         | Also THANK YOU for Aider! I talk it up to all my programmer
         | friends; it really feels like a glimpse into the future of
         | coding.
        
         | strich wrote:
         | Does aider work with c# at all?
        
           | anotherpaulg wrote:
           | Yes!
           | 
           | Thanks for asking. I've been meaning to address these kinds
           | of questions in the aider FAQ [0]. Here's the entry I just
           | added:
           | 
           | Aider supports pretty much all the popular coding languages.
           | This is partly because GPT-4 is fluent in most mainstream
           | languages, and familiar with popular libraries, packages and
           | frameworks.
           | 
           | In fact, coding with aider is sometimes the most magical when
           | you're working in a language that you are less familiar with.
           | GPT often knows the language better than you, and can
           | generate all the boilerplate to get to the heart of your
           | problem. GPT will often solve your problem in an elegant way
           | using a library or package that you weren't even aware of.
           | 
           | Aider uses tree-sitter to do code analysis and help GPT
           | navigate larger code bases by producing a repository map [1].
           | 
           | Aider can currently produce repository maps for most
           | mainstream languages, listed below. But aider should work
           | quite well for other languages, even without repo map
           | support.                 - C       - C#       - C++       -
           | Emacs Lisp       - Elixir       - Elm       - Go       - Java
           | - Javascript       - OCaml       - PHP       - Python       -
           | QL       - Ruby       - Rust       - Typescript
           | 
           | [0] https://aider.chat/docs/faq.html#what-code-languages-
           | does-ai...
           | 
           | [1] https://aider.chat/docs/repomap.html
        
             | lemming wrote:
             | I was actually wondering this myself yesterday. So it's not
             | possible to plug a different tree-sitter implementation in
             | for a niche language?
        
               | anotherpaulg wrote:
               | It should be possible, but not currently. Aider would
               | need a bit more configurability to be able to load up
               | arbitrary tree-sitter language implementations at
               | runtime.
               | 
               | There's an open issue you might want to follow for
               | updates:
               | 
               | https://github.com/paul-gauthier/aider/issues/321
        
             | epiccoleman wrote:
             | I've _just_ started playing with aider this week, and I
             | find it extremely fun and exciting. But I will say that I
             | 've had middling results with an Elixir / Phoenix app. I
             | don't think this has anything to do with aider - rather, I
             | think that the GPT models haven't quite internalized the
             | new approaches in Phoenix 1.7, since up until Turbo their
             | training data was fairly old and probably still contains
             | more pre 1.7 Phoenix examples than post 1.7.
             | 
             | In spite of these frustrations, I have had some genuinely
             | amazing moments coding with GPT-4 lately though. I upgraded
             | to ChatGPT plus lately and it's just mindblowing how
             | helpful it can be in the right contexts. I'm hoping that as
             | I get better with aider I might just drop the ChatGPT sub
             | and stick to API usage.
             | 
             | I totally understand the skepticism many have, because this
             | stuff is still a bit finicky - but I'm overwhelmed by a
             | sense of how fucking _cool_ this stuff is quite often.
        
         | biobootloader wrote:
         | Hey Paul, I'm a Mentat author.
         | 
         | > I also notice that the instructions prompt that mentat uses
         | seems to be inspired by the aider benchmark? Glad to see others
         | adopting similar benchmarking approaches.
         | 
         | We were inspired by you to use Exercism as a benchmark, thank
         | you! We will add attribution for that. We switched our original
         | instruction prompts for that benchmark to be similar to Aiders
         | to allow for fair comparison.
         | 
         | > After looking around a bit, there seems to be a bunch of
         | aider code in your repo. Some attribution would be appreciated.
         | 
         | We have an unused implementation of your output response format
         | (https://github.com/AbanteAI/mentat/blob/main/mentat/parsers/..
         | .), but I don't know what else you are seeing? We implemented
         | that to compare with our response formats and didn't find much
         | difference in performance.
        
           | anotherpaulg wrote:
           | I didn't spend much time looking, but your benchmark
           | prompting inspired me to search your repo for "aider". The
           | results were 3 PRs where aider was mentioned in the
           | conversations [0].
           | 
           | The "code map" PR in particular mentions being "inspired by
           | aider", links to aider and seems to include a bunch of code
           | from aider's old ctags based "repo map" implementation. This
           | isn't an insignificant component of an AI coding tool.
           | 
           | Aider is open source and I try and share my learnings as I'm
           | building it. So it's great when other projects get
           | inspiration from aider! But it is polite to provide
           | attribution for such inspiration, especially if you crib from
           | code with an attribution license.
           | 
           | [0] https://github.com/search?q=repo%3AAbanteAI%2Fmentat+aide
           | r&t...
        
         | ja3k wrote:
         | Sorry about that. We updated the blog with attribution and put
         | an attributing comment in our code base where we use your
         | benchmarking prompts. We'll probably delete our implementation
         | of your response format later today since we just had it for
         | benchmarking.
        
       | mufasachan wrote:
       | > Although the author OCR'ed the SAT questions and believes that
       | they weren't in the training data
       | 
       | I agree that the author of the tweet fairly underestimates the
       | potential portion of OCR'ed contents in OpenAI's training data.
       | In late August, Nougat[1] is released by Meta, this is an OCR
       | model. Its performance are wild and the model is open source.
       | 
       | I hardly believe that OpenAI does not spend effort on getting
       | more training from OCR content. I also hardly believes that
       | OpenAI waits for a Meta paper to have an internal performant OCR
       | model.
       | 
       | [1]: https://arxiv.org/abs/2308.13418
        
       | msp26 wrote:
       | I'm interested in more testing on the context side of things.
       | 
       | For my NLP pipelines, I batch n-articles together to process
       | (extract fields from) in one prompt (final output is something
       | like this {"1":[{}], "2": [{},{}]...}) in one message. Compute-
       | wise it's inefficient but OpenAI charges by the token so it
       | doesn't matter. It's very reliable on gpt-4 8k.
       | 
       | I was also pretty happy with the results on 4-turbo initially but
       | it seems that once you go past 30k-ish tokens in context (needs
       | way more testing), it shits itself. The indexes don't match
       | anymore and n_final_output is different from n_articles.
       | 
       | Still, great model and even if the limits are lower in practice I
       | suspect I'll get good use out of it.
       | 
       | Edit: With better prompting, it feels stable at n=42, ~42000
       | prompt tokens.
        
         | phillipcarter wrote:
         | Interesting. I was skeptical about some of their claims
         | regarding longer context, since it's been my experience that
         | these models just get lost after enough of it.
        
           | msp26 wrote:
           | Yeah, degraded performance on long contexts has been observed
           | in plenty of other models [https://arxiv.org/abs/2307.03172]
           | so I was cautious too. Unfortunately I don't have access to
           | 4-32k. I would have liked to test that out too.
        
       | FeepingCreature wrote:
       | I wonder how often a human could guess the exercise based on just
       | the function stub.
        
         | ConnorMooneyhan wrote:
         | yeah, some of the exercises are like the following:
         | 
         | ```
         | 
         | function helloWorld() {                 return "";
         | 
         | }
         | 
         | helloWorld()
         | 
         | ```
         | 
         | but those sorts of obvious examples are mostly in the beginner
         | exercises, so I wonder what the distribution of the correct
         | answers was. If it was guessing based on function stubs, the
         | prediction would be that correct answers would be clustered
         | around the beginner exercises, and that as the exercises
         | advanced in difficulty, there were fewer correct answers.
        
       | og_kalu wrote:
       | I think it's interesting that forcing models out of memorization
       | don't always show a steep drop in ability.
       | 
       | I've definitely had instances where 4 memorized a common puzzle
       | and failed a subtly altered variant but then got the variant
       | after changing variable names or otherwise making it look
       | different from what it would have memorized.
        
       | Havoc wrote:
       | Is it know what exactly OpenAI does in the background when they
       | make these turbo editions?
       | 
       | Seems like sacrificing some quality for large gains on speed and
       | cost but anyone know more detail?
        
         | thorax wrote:
         | Don't think so, but there were some guesses on 3.5-turbo-- i.e.
         | training a much smaller model on quality questions/answers from
         | GPT-4. Same tactic worked again and again for other LLMs.
         | 
         | I'm definitely curious on the context window increase-- I'm
         | having a hard time telling if it's 'real' vs a fast specially
         | trained summarization prework step. That being said, it's been
         | doing a rather solid job not losing info in that context window
         | in my minor anecdotal use cases.
        
       | maciejgryka wrote:
       | I have similar conclusions so far. We have a custom data set
       | (basically visual Q&A about web apps) and `gpt4` gets roughly 90%
       | correct, while `gpt-4-1106-preview` only 86%. It's a little noisy
       | (I didn't yet check out the new seeds functionality), but roughly
       | consistent.
       | 
       | Since I created this dataset by hand, it can't really be
       | memorized. I'm sure there's _similar_ data in the training set,
       | but answering correctly still requires some reasoning-like
       | capabilities.
        
       | m3kw9 wrote:
       | Now do a programming task that requires more than 32k of context
       | and see who's "better". If you don't bench mark that you cannot
       | get an overall pic. GitHub copilot for example could benefit big
       | from the increased context
        
         | broast wrote:
         | Obviously it's a drawback but the silver lining of the small
         | context window is it forces me to decouple everything and have
         | very sensible and strict api's where I just write the docs and
         | it writes the code.
        
         | biobootloader wrote:
         | we are working on creating "real world" benchmarks that require
         | a lot of context, and will report when we have results!
        
       | m3kw9 wrote:
       | That's why is called 4 turbo, not "4.5". But the context length
       | is a bigger cargo space
        
       | minihat wrote:
       | GPT-4 Turbo is dramatically worse at one task I often try:
       | 
       | Read the following passage from [new ML article]. Identify their
       | assumptions, and tell me which mathematical operations or
       | procedures they use depend upon these assumptions.
       | 
       | GPT-4: Usually correctly identifies the assumptions, and often
       | quotes the correct mathematics in its reply.
       | 
       | GPT-4 Turbo: Sometimes identifies the assumptions, and is
       | guaranteed to stop trying at that point and then give me a
       | Wikipedia-like summary about the assumptions rather than finish
       | the task. Further prompting will not improve its result.
        
         | thorax wrote:
         | Do you have a link or gist of an example run you tried? I'd be
         | curious to try something similar.
        
       | jphoward wrote:
       | The problem is the discussed results are comparing proportions of
       | a relatively small number - 67 questions. If you model this as a
       | binomial distribution, then 62/67 which GPT4-turbo got gives a
       | 95% confidence interval of the 'true' performance of 83.4% to
       | 97.5%, ie it comfortably includes the proportion that GPT4
       | achieved (64/67=95.5%).
       | 
       | I think the evidence from these tests are not strong enough to
       | draw conclusions from.
        
         | epups wrote:
         | Yes. I see people make this mistake time and again when
         | evaluating LLMs. For a proper comparison, it's not enough to
         | simply throw less than a hundred questions at it and point to a
         | single digit difference. Not to mention that LLMs have some
         | inherent randomness, so even if you passed the exact same tasks
         | to the same model you would expect some variance.
         | 
         | I see a lot of room of improvement in how we apply statistics
         | to understanding LLM performance.
        
         | Racing0461 wrote:
         | > I think the evidence from these tests are not strong enough
         | to draw conclusions from.
         | 
         | I've used gpt4 turbo for some coding problems yesterday. It was
         | worse. That's enough to draw conclusions for me.
        
         | Maxion wrote:
         | I'm not surprised, most people can't even tell the median from
         | the mean.
        
       | iJohnDoe wrote:
       | A little bit off topic question. When people are talking about
       | costs with GPT, like the following link. Does the cost concern
       | only apply to the API? If you're using the WebUI and have a Plus
       | account, is it always just the flat $20 amount?
       | 
       | https://news.ycombinator.com/item?id=38193978
        
         | tosh wrote:
         | usually, yes (either cost of the API or cost to serve for
         | OpenAI)
        
       | geraltofrivia wrote:
       | In my day job we use GPT4 quite a bit and we shifted to GPT4
       | Turbo today. We got a 2-5% performance increase, and quite a bit
       | of speed increase as well.
       | 
       | Not to say that the parent post is incorrect, of course. Only
       | that its not as cut and dry as a "GPT4 Turbo is distilled (read:
       | watered down) GPT4".
        
         | hu3 wrote:
         | Interesting. What do you use it for?
        
           | geraltofrivia wrote:
           | Currently only for unstructured (OCR) text to structured text
           | conversion.
           | 
           | We're transitioning from a legacy codebase full of regexes
           | and undocumented functions that are understood only by the
           | developer and god. The developers left and I don't believe in
           | god. We tried throwing the unstructed mess to GPT, alongwith
           | a few examples and got surprisingly good results.
        
       | Satam wrote:
       | Very interesting and basically confirms that GPT-4 turbo is a
       | faster but dumber model. When a task doesn't rely on memorization
       | of the training set, it reasons similarly well to GPT-4. Where
       | memorization is helpful, it performs worse (due to quantization-
       | induced "memory loss").
       | 
       | This also makes me look at GPT-4 as a "weak reasoner with a lot
       | of knowledge". That really aligns with my experience where it is
       | immensely helpful and has a superhuman knowledge base but still
       | needs handholding to solve real problems.
        
       | jessenaser wrote:
       | The thing is why does the GPT-4 Turbo and the Updated GPT 3.5
       | Turbo have only an output of 4,096 tokens?
       | 
       | Previous Model: gpt-3.5-turbo-16k, 16385 tokens context and
       | completion (shared)
       | 
       | New Model: gpt-3.5-turbo-1106, 16385 tokens context, 4096 tokens
       | completion
       | 
       | Previous Model: gpt-4, 8192 tokens context and completion
       | (shared)
       | 
       | New Model: gpt-4-1106-preview, 128000 tokens context, 4096 tokens
       | completion
       | 
       | Why would the same size of a 16K GPT-3.5 model now not allow
       | larger completion sizes?
       | 
       | Why would the new GPT-4 reduce the completion tokens as well,
       | gpt-4 can do 8192 and gpt-4-32k can do 32768 completion tokens.
       | Now the limit is 4096.
       | 
       | So you would need to change the way you prompt (split the
       | responses) to be able to get a longer response.
       | 
       | ---
       | 
       | So are these new models taking the old base models of 4K tokens
       | context and completion and changing the context to 128000 but
       | leaving the completion the same? If they could get gpt-4 to have
       | gpt-4-8k and gpt-4-32k, why couldn't have it been 128000 context
       | and 32768 completion?
        
         | srdjanr wrote:
         | Probably because it's too expensive. Prompt can be processed
         | quickly but output tokens are much slower (and that means more
         | expensive).
         | 
         | From my local test on a 13B model, output tokens are 20-30x
         | more expensive than input tokens. So OpenAI's pricing structure
         | is based on expectation that there's much more input than
         | output tokens in an average response. It didn't matter too much
         | if a small percentage of requests used all 4k tokens for
         | output, but with 128k it's a different story.
        
         | Racing0461 wrote:
         | I believe openai wants to lower the time it takes for requests
         | to finish to be able to accept more requests per server/gpu. ie
         | money.
        
         | qup wrote:
         | if i'm not mistaken, the model has to be trained for a specific
         | context window
        
       | cultureswitch wrote:
       | Everyone who has any knowledge of machine learning knows that you
       | don't evaluate your model by testing it on parts of its training
       | data. The issue is, the training data for GPT-4 appears to be
       | "everything".
        
       | driverdan wrote:
       | This is a big problem with independent LLM testing. You need to
       | make sure your test set isn't included in the training set which
       | isn't easy with closed source models.
       | 
       | This makes me think of how hardware manufacturers optimize for
       | benchmarks. Closed source LLMs can intentionally include likely
       | test data in their training set to artificially inflate results.
       | I'm not saying they are intentionally doing that now, but they
       | could.
        
       | Racing0461 wrote:
       | OpenAI should be covered in lawsuits for nerfing a product people
       | paid for and expect to not degrade over time while still paying
       | the same amount.
        
       | maxrmk wrote:
       | I've always been skeptical of benchmarking because of the
       | memorization problem. I recently made up my own (simple) date
       | reasoning benchmark to test this, and found that GPT-4 Turbo
       | actually outperformed GPT-4:
       | https://open.substack.com/pub/talcai/p/making-up-a-new-llm-b...
        
         | tornato7 wrote:
         | I like the test but do you take multiple samples / runs of a
         | result? IMO for a proper benchmark you should ask it the same
         | question 10+ times and show a confidence interval, otherwise
         | you don't know if it's just a fluke or a lucky guess.
        
           | maxrmk wrote:
           | Ahh good suggestion, I should clarify this in the article. I
           | tried to compensate with volume -- I used a set of 200
           | questions for the testing. I was using temperature 0, so I'd
           | get the same answer if I ran a single question multiple
           | times.
        
       ___________________________________________________________________
       (page generated 2023-11-09 23:01 UTC)