[HN Gopher] Benchmarking GPT-4 Turbo - A Cautionary Tale
___________________________________________________________________
Benchmarking GPT-4 Turbo - A Cautionary Tale
Author : ja3k
Score : 202 points
Date : 2023-11-09 13:00 UTC (10 hours ago)
(HTM) web link (blog.mentat.ai)
(TXT) w3m dump (blog.mentat.ai)
| Kuinox wrote:
| tldr: GPT-4 Turbo have worse score on synthetic benchmark of the
| first attempt because they speculate it's a smaller model, and
| isn't able to memorize as well the response.
| sunpazed wrote:
| This reflects my anecdotal findings in a very narrow domain:
| GPT-4 consistently resolves complex SQL from natural language,
| whereas GPT-4-Turbo is hit-and-miss, and similar to 3.5-Turbo in
| performance.
| Semaphor wrote:
| But that's not what the article is saying at all?
| boxed wrote:
| > We designed a test for this theory: we reran the benchmarks
| without showing the models the instructions to each exercise.
| Instead, we just told them that they were Exercism exercises, and
| gave them the exercise names and function stubs.
|
| This summarizes all my skepticism agains the AI field. Pretty
| clear that they aren't solving the problem, they have them
| memorized.
| mavhc wrote:
| LLMs are lossy compression
| stoicfungi wrote:
| All models are, including humain brain.
| ChatGTP wrote:
| The human brain is a model?
| rocgf wrote:
| It models the world around it, so it's fairly similar to
| what GPT does, especially with the newly-added image
| capabilities and stuff.
| FrustratedMonky wrote:
| Consciousness itself is a model of the world.
|
| Our experience of the world is a model executing.
|
| Comparing the latest neuroscience to latest neural
| networks. They look and behave very similarly.
| og_kalu wrote:
| "They memorized all the problems" is not what was found here
| and still a wrong overcorrection.
| dartos wrote:
| "Gpt-4 has more problems memorized than gpt-4 turbo" was
| exactly what was found here.
|
| That doesn't mean it's only able to solve problems in its
| training set (tho it's much better at that obviously.)
| boxed wrote:
| If you are shown only the title of a coding problem and the
| site name where it's from, and you manage to solve it you are
| showing that you either cheated or knew the answer.
| og_kalu wrote:
| I mean sure, it memorized some of the answers. I'm not
| denying that. Clearly, it didn't memorize all of them.
| boxed wrote:
| When people say "oh look how amazing, it can solve
| programming problems!" when in fact it has only seen the
| models CHEAT, is an enormous problem.
|
| For cases where finding the answer it's perfectly fine,
| but it's not fine for claims that it can code. There's a
| huge difference.
| FeepingCreature wrote:
| I've seen it code on completely novel tasks, so I'm not
| sure what you're suggesting here. The model can
| unquestionably code.
| og_kalu wrote:
| Okay... Funny how forcing it to not CHEAT did not
| increase apparent ability.
|
| It can code and it has memorized some coding questions
| are not mutually exclusive.
| boxed wrote:
| > Okay... Funny how forcing it to not CHEAT did not
| increase apparent ability.
|
| The article did the opposite. It forced the models to
| cheat to solve the problems. Which it did happily. It
| should have stated "there is no actual problem to solve
| here, you must supply a problem for me to solve".
|
| > It can code and it has memorized some coding questions
| are not mutually exclusive
|
| This I will give you. Many humans try to cheat at basic
| math because they are lazy, so will this model. Maybe
| that's a sign of intelligence :P
| raincole wrote:
| Me: What's 6x6?
|
| You: 36
|
| Me: You cheated! You just cited the answer you memorized!
| You should have started from addition.
|
| You: ...okay? 6+6=12, 12+6=18, 18+...
|
| Me: You cheated again! You just have 6+6=12 memorized!
| You should make the rule of addition out of Peano axioms.
|
| You: ...you're being annoying, but okay? First axiom, we
| define 0 as...
|
| Me: You cheated _again_! You memorized Peano Axioms!
| Jesus Christ, is there any intelligent creature left?
| boxed wrote:
| But in this case it's not like that at all. They only saw
| the NAME of the problem. Like if I said "Page 23 of
| Mathbook Y, problem number 3". Which happens to be 6x6.
| raincole wrote:
| Me: I was being in such a blah blah situation... does the
| article 3 of Digital Government Act applies here?
|
| My lawyer: Hmm the article 3 says--
|
| Me: I knew it! Lawyers are not intelligent...
| furyofantares wrote:
| It said they gave the exercise name, which doesn't sound
| like just the exercise number but probably mildly
| descriptive -- and they also gave it function stubs.
| stnmtn wrote:
| If I gave you a programming problem and all I told you
| was that the problem name was Traveling Salesman, you
| might be able to solve it based on that.
|
| If not that, then if I just said "fizzbuzz" to you, I'm
| sure you would be able to give the solution without me
| needing to say any other descriptions of the problem
| boxed wrote:
| Again, because of memorization, not being able to code.
| qup wrote:
| I know this is deep down a bad comment thread, but I
| thought I'd chime in here.
|
| I have been writing function names and test names, and
| then telling gpt to fill in the test, which is usually
| does how I want (maybe with errors, but it tests the
| correct thing), and then I tell it to fill out the
| answers.
|
| this is in a thing I'm building that's never been built,
| with names that I made up (but describe the functionality
| well)
|
| It cannot have this spot memorized, I just invented it
| myself
| jsight wrote:
| TBH, people underestimate how much of coding is just
| memorization. I'm guessing those of us with bad memories
| understand this more than the ones with good memories. :)
|
| I can't remember how many times I've googled, "how do I
| create a directory in Python?". Now bard often generates
| an inline answer for me.
| hanselot wrote:
| Though this is exactly what happened. The initial test
| was ran on a model that "Cheated" (aka has memorized the
| answers). The second test was run on a model that didn't
| "Cheat" as much, yet still got only 2% less score. So,
| the question is not resolved really. How much did the
| first model cheat, and how much did the second? If the
| second model "cheats" less, then it wins.
|
| Also, I don't understand your obsession with the word
| cheating. If you have solved a problem before on a
| different website and solve it again, did you cheat? Or
| did you just use your brain to store the solution for
| later?
| boxed wrote:
| > Also, I don't understand your obsession with the word
| cheating.
|
| It's all about the rule set yea. Since the rule set is
| not defined, technically nothing is cheating. I just
| interpret the rule set as "can it code?" and for this
| rule set, it seems to me that it's cheating.
| boxed wrote:
| > How much did the first model cheat, and how much did
| the second? If the second model "cheats" less, then it
| wins.
|
| They both cheated 100%. Because they both never saw the
| problem. AT ALL. They just saw the title and the name of
| the website.
| raincole wrote:
| Almost 2024 and people still can't accept that LLM can
| code...
| danielmarkbruce wrote:
| Of course they can't. And self-driving cars also don't
| exist, it's like 10 years away at best.
| notnaut wrote:
| It can generate never-before-seen strings of
| comprehensible language. It can react to the inherent
| logic embedded in words and text and provide a brute
| forced version of what a human could. That it can "solve"
| a problem only through "cheating" is an anthropomorphism
| that betrays the magic that is evident to anyone who has
| used these things.
| Terretta wrote:
| On the contrary, it could mean you were, to some percentage
| of success, able to guess what problem is, and then, to
| some multiplier percentage of success, solve it.
|
| The key is, can you guess the problem from the title and
| the function name? I'd argue, sure, at least half the
| time?, why not...
| DecayingOrganic wrote:
| Memorization often gets a bad rap as the underachiever's
| shortcut. However, it's a fundamental component of any learning
| process! Our ability to reason, solve problems, and innovate is
| all built upon a foundation of memorized information. In fact,
| it's precisely the reason humans have thrived for so long; we
| were able to memorize and pass down knowledge culturally long
| before the written word, not because we were 100 times smarter
| than our nearest cousins. Without memorization, be it in our
| brains or AI algorithms, there's no foundation to build upon
| for higher reasoning.
| viraptor wrote:
| It's hard to decide for me without seeing the data. Even if you
| don't know the exact exercise, seeing the title and the
| function name/parameters is often enough for me to guess what
| the challenge is. I checked the public questions on exercism
| and almost all of those (that I spot checked) that contained
| the function name were extremely obvious. Knowing it's a
| programming challenge would also improve my guessing chances.
|
| For example the function stubs I can find are
| "value_of_card(<card>)" in exercise "Black Jack", or
| "generate_seat_letters(<number>)" in exercise "Plane Tickets".
| I think I could guess those without seeing the rest of the
| question.
| GaggiX wrote:
| So how can it solve novel problems? Internet does not have all
| combinations for every possible task with any random
| programming language, library or constraints. It can even solve
| problems with non-existing programming languages and libraries
| if you describe them, if that's just memorization then I don't
| know what it isn't.
| m3kw9 wrote:
| From a black box point of view and one angle, gpt is a web
| filter where it will try to find you the exact thing you are
| looking for but from memory. Vs google you have to distill all
| the info into what you need
| ren_engineer wrote:
| even if it's not true AI or even an architecture with the
| potential to become AI, LLMs are already good enough to provide
| real world value. Obviously "super autocomplete" isn't as sexy
| as true AI, but still very useful
| drcode wrote:
| You can call it whatever you want, all I know is I used to
| write programs in lines of code, then blocks of code at a time,
| spit out by LLMs
|
| Using GPT-4 Turbo yesterday, I feel like I'm moving to pages of
| code at a time now.
|
| Taking the ideas in my head and turning them into reality is so
| easy now
| blovescoffee wrote:
| Ok but you understand there's a body of literature that shows
| that LLMs don't "just" memorize
| gloosx wrote:
| +100 to that. My biggest scepticism is people actually creating
| a new problem while thinking they are solving problem. Don't
| get me wrong, translating natural language ideas into code is
| fun and all, the truth it is also code, yet in ambiguous
| language format given to the machine.
|
| When did natural language became better for expressing
| development ideas than code? I know - when you don't know how
| to code in the first place. Then you will have to bet on all of
| the ambiguities of the language, cultural and meta-physical
| which words carry in order to hack your thing together instead
| of expressing yourself directly and explicitly.
|
| Finally what is beautiful about strict code format we are so
| used to - it is truly the fastest and shortest path to get your
| thing done, in case you possess the knowledge needed.
| ericrallen wrote:
| That sounds a lot like gatekeeping.
|
| These tools will empower folks who aren't developers to build
| stuff and maybe learn a bit more about how programming works.
|
| They will enable folks who have ideas, but can't express
| them, to actually be able to create what they are imagining.
|
| That's awesome.
|
| Code isn't beautiful (except for a few rare exceptions).
| Creating something with code is.
| doug_durham wrote:
| Natural language isn't superior to computer languages. NL
| allows you to describe a software concept in a computer
| language and framework neutral way. The LLM generates the
| code. The real benefit is when you work across languages and
| frameworks. It is difficult to keep all of the details of all
| of the framework calls in your head all of the time.
| caesil wrote:
| If that's your takeaway from this then you really missed the
| point. The implication here is that gpt-4 to gpt-4-turbo
| represents a leap away from memorization and toward better
| reasoning with a more complete world model.
| nathanfig wrote:
| "memorize" implies they can only recite things verbatim and
| that's ignoring the massive leap in being able to synthesize
| disjoint "memories" in new ways.
| mccrory wrote:
| GPT -4 Turbo is still in preview, maybe wait until it is fully
| released before judging?
| msp26 wrote:
| The point of a preview phase is to test the model in real world
| use.
| seanhunter wrote:
| This isn't really real-world use any more than putting these
| same problems to people as a whiteboard coding exercise in an
| interview is real-world coding, yet seemingly a lot of people
| seem to be generalising from this tiny sample to all manner
| of overarching statements about performance of the model in
| general "it's faster but dumber", "this proves it only
| memorises" etc.
| KaoruAoiShiho wrote:
| Why are all the comments here so negative... this is a good
| thing, turbo has less memorization but keeps the same reasoning
| ability. That's excellent and a relief.
| boxed wrote:
| Or the programming quiz problems it tried to "solve" were in
| fact posted elsewhere also so it cheated on the ones it got
| right too.
| CamperBob2 wrote:
| People here spent a lot of time (and money) in school learning
| to do things that can now be automated. The whining is just
| beginning.
| anotherpaulg wrote:
| Aider has had an Exercism benchmarking suite for quite some time.
|
| Interestingly, my benchmark results of GPT 4 Turbo show an
| opposite result: the new gpt-4-1106-preview did significantly
| _better_ on the first try than the March and June models.
|
| https://aider.chat/docs/benchmarks-1106.html
|
| Aider benchmarks against the 133 Exercism python exercises, not
| js exercises that mentat's benchmark uses. So this is not an
| apples-to-apples comparison, but there doesn't seem to be a
| strong reason to expect qualitatively different results.
|
| I also notice that the instructions prompt that mentat uses seems
| to be _inspired by_ the aider benchmark? Glad to see others
| adopting similar benchmarking approaches.
|
| https://github.com/AbanteAI/mentat/blob/main/tests/benchmark...
|
| https://github.com/paul-gauthier/aider/blob/main/benchmark/p...
|
| Edit: Not sure if the mentat authors are in this thread? After
| looking around a bit, there seems to be a bunch of aider code in
| your repo. Some attribution would be appreciated. It might even
| be required under aider's Apache 2.0 license?
| naiv wrote:
| I am also noticing a massive improvement over the old model
| hanselot wrote:
| Isn't it a good thing that of the benchmarks they ran, the
| newer model has fewer of the answers memorized (aka, its
| parroting less)?
|
| Wouldn't this actually be exactly proof that the model has
| improved over its predecessor by having to solve the problem
| itself rather than rely on memory?
|
| What use is a model that memorizes the answers to all the
| benchmarks (see the 7b models on open llm leaderboard for more
| info on that).
| derwiki wrote:
| I've been using the new model with Aider since it was released,
| and my anecdata agrees--the "edits applied successfully "
| failure rate is much lower than classic gpt4.
|
| Also THANK YOU for Aider! I talk it up to all my programmer
| friends; it really feels like a glimpse into the future of
| coding.
| strich wrote:
| Does aider work with c# at all?
| anotherpaulg wrote:
| Yes!
|
| Thanks for asking. I've been meaning to address these kinds
| of questions in the aider FAQ [0]. Here's the entry I just
| added:
|
| Aider supports pretty much all the popular coding languages.
| This is partly because GPT-4 is fluent in most mainstream
| languages, and familiar with popular libraries, packages and
| frameworks.
|
| In fact, coding with aider is sometimes the most magical when
| you're working in a language that you are less familiar with.
| GPT often knows the language better than you, and can
| generate all the boilerplate to get to the heart of your
| problem. GPT will often solve your problem in an elegant way
| using a library or package that you weren't even aware of.
|
| Aider uses tree-sitter to do code analysis and help GPT
| navigate larger code bases by producing a repository map [1].
|
| Aider can currently produce repository maps for most
| mainstream languages, listed below. But aider should work
| quite well for other languages, even without repo map
| support. - C - C# - C++ -
| Emacs Lisp - Elixir - Elm - Go - Java
| - Javascript - OCaml - PHP - Python -
| QL - Ruby - Rust - Typescript
|
| [0] https://aider.chat/docs/faq.html#what-code-languages-
| does-ai...
|
| [1] https://aider.chat/docs/repomap.html
| lemming wrote:
| I was actually wondering this myself yesterday. So it's not
| possible to plug a different tree-sitter implementation in
| for a niche language?
| anotherpaulg wrote:
| It should be possible, but not currently. Aider would
| need a bit more configurability to be able to load up
| arbitrary tree-sitter language implementations at
| runtime.
|
| There's an open issue you might want to follow for
| updates:
|
| https://github.com/paul-gauthier/aider/issues/321
| epiccoleman wrote:
| I've _just_ started playing with aider this week, and I
| find it extremely fun and exciting. But I will say that I
| 've had middling results with an Elixir / Phoenix app. I
| don't think this has anything to do with aider - rather, I
| think that the GPT models haven't quite internalized the
| new approaches in Phoenix 1.7, since up until Turbo their
| training data was fairly old and probably still contains
| more pre 1.7 Phoenix examples than post 1.7.
|
| In spite of these frustrations, I have had some genuinely
| amazing moments coding with GPT-4 lately though. I upgraded
| to ChatGPT plus lately and it's just mindblowing how
| helpful it can be in the right contexts. I'm hoping that as
| I get better with aider I might just drop the ChatGPT sub
| and stick to API usage.
|
| I totally understand the skepticism many have, because this
| stuff is still a bit finicky - but I'm overwhelmed by a
| sense of how fucking _cool_ this stuff is quite often.
| biobootloader wrote:
| Hey Paul, I'm a Mentat author.
|
| > I also notice that the instructions prompt that mentat uses
| seems to be inspired by the aider benchmark? Glad to see others
| adopting similar benchmarking approaches.
|
| We were inspired by you to use Exercism as a benchmark, thank
| you! We will add attribution for that. We switched our original
| instruction prompts for that benchmark to be similar to Aiders
| to allow for fair comparison.
|
| > After looking around a bit, there seems to be a bunch of
| aider code in your repo. Some attribution would be appreciated.
|
| We have an unused implementation of your output response format
| (https://github.com/AbanteAI/mentat/blob/main/mentat/parsers/..
| .), but I don't know what else you are seeing? We implemented
| that to compare with our response formats and didn't find much
| difference in performance.
| anotherpaulg wrote:
| I didn't spend much time looking, but your benchmark
| prompting inspired me to search your repo for "aider". The
| results were 3 PRs where aider was mentioned in the
| conversations [0].
|
| The "code map" PR in particular mentions being "inspired by
| aider", links to aider and seems to include a bunch of code
| from aider's old ctags based "repo map" implementation. This
| isn't an insignificant component of an AI coding tool.
|
| Aider is open source and I try and share my learnings as I'm
| building it. So it's great when other projects get
| inspiration from aider! But it is polite to provide
| attribution for such inspiration, especially if you crib from
| code with an attribution license.
|
| [0] https://github.com/search?q=repo%3AAbanteAI%2Fmentat+aide
| r&t...
| ja3k wrote:
| Sorry about that. We updated the blog with attribution and put
| an attributing comment in our code base where we use your
| benchmarking prompts. We'll probably delete our implementation
| of your response format later today since we just had it for
| benchmarking.
| mufasachan wrote:
| > Although the author OCR'ed the SAT questions and believes that
| they weren't in the training data
|
| I agree that the author of the tweet fairly underestimates the
| potential portion of OCR'ed contents in OpenAI's training data.
| In late August, Nougat[1] is released by Meta, this is an OCR
| model. Its performance are wild and the model is open source.
|
| I hardly believe that OpenAI does not spend effort on getting
| more training from OCR content. I also hardly believes that
| OpenAI waits for a Meta paper to have an internal performant OCR
| model.
|
| [1]: https://arxiv.org/abs/2308.13418
| msp26 wrote:
| I'm interested in more testing on the context side of things.
|
| For my NLP pipelines, I batch n-articles together to process
| (extract fields from) in one prompt (final output is something
| like this {"1":[{}], "2": [{},{}]...}) in one message. Compute-
| wise it's inefficient but OpenAI charges by the token so it
| doesn't matter. It's very reliable on gpt-4 8k.
|
| I was also pretty happy with the results on 4-turbo initially but
| it seems that once you go past 30k-ish tokens in context (needs
| way more testing), it shits itself. The indexes don't match
| anymore and n_final_output is different from n_articles.
|
| Still, great model and even if the limits are lower in practice I
| suspect I'll get good use out of it.
|
| Edit: With better prompting, it feels stable at n=42, ~42000
| prompt tokens.
| phillipcarter wrote:
| Interesting. I was skeptical about some of their claims
| regarding longer context, since it's been my experience that
| these models just get lost after enough of it.
| msp26 wrote:
| Yeah, degraded performance on long contexts has been observed
| in plenty of other models [https://arxiv.org/abs/2307.03172]
| so I was cautious too. Unfortunately I don't have access to
| 4-32k. I would have liked to test that out too.
| FeepingCreature wrote:
| I wonder how often a human could guess the exercise based on just
| the function stub.
| ConnorMooneyhan wrote:
| yeah, some of the exercises are like the following:
|
| ```
|
| function helloWorld() { return "";
|
| }
|
| helloWorld()
|
| ```
|
| but those sorts of obvious examples are mostly in the beginner
| exercises, so I wonder what the distribution of the correct
| answers was. If it was guessing based on function stubs, the
| prediction would be that correct answers would be clustered
| around the beginner exercises, and that as the exercises
| advanced in difficulty, there were fewer correct answers.
| og_kalu wrote:
| I think it's interesting that forcing models out of memorization
| don't always show a steep drop in ability.
|
| I've definitely had instances where 4 memorized a common puzzle
| and failed a subtly altered variant but then got the variant
| after changing variable names or otherwise making it look
| different from what it would have memorized.
| Havoc wrote:
| Is it know what exactly OpenAI does in the background when they
| make these turbo editions?
|
| Seems like sacrificing some quality for large gains on speed and
| cost but anyone know more detail?
| thorax wrote:
| Don't think so, but there were some guesses on 3.5-turbo-- i.e.
| training a much smaller model on quality questions/answers from
| GPT-4. Same tactic worked again and again for other LLMs.
|
| I'm definitely curious on the context window increase-- I'm
| having a hard time telling if it's 'real' vs a fast specially
| trained summarization prework step. That being said, it's been
| doing a rather solid job not losing info in that context window
| in my minor anecdotal use cases.
| maciejgryka wrote:
| I have similar conclusions so far. We have a custom data set
| (basically visual Q&A about web apps) and `gpt4` gets roughly 90%
| correct, while `gpt-4-1106-preview` only 86%. It's a little noisy
| (I didn't yet check out the new seeds functionality), but roughly
| consistent.
|
| Since I created this dataset by hand, it can't really be
| memorized. I'm sure there's _similar_ data in the training set,
| but answering correctly still requires some reasoning-like
| capabilities.
| m3kw9 wrote:
| Now do a programming task that requires more than 32k of context
| and see who's "better". If you don't bench mark that you cannot
| get an overall pic. GitHub copilot for example could benefit big
| from the increased context
| broast wrote:
| Obviously it's a drawback but the silver lining of the small
| context window is it forces me to decouple everything and have
| very sensible and strict api's where I just write the docs and
| it writes the code.
| biobootloader wrote:
| we are working on creating "real world" benchmarks that require
| a lot of context, and will report when we have results!
| m3kw9 wrote:
| That's why is called 4 turbo, not "4.5". But the context length
| is a bigger cargo space
| minihat wrote:
| GPT-4 Turbo is dramatically worse at one task I often try:
|
| Read the following passage from [new ML article]. Identify their
| assumptions, and tell me which mathematical operations or
| procedures they use depend upon these assumptions.
|
| GPT-4: Usually correctly identifies the assumptions, and often
| quotes the correct mathematics in its reply.
|
| GPT-4 Turbo: Sometimes identifies the assumptions, and is
| guaranteed to stop trying at that point and then give me a
| Wikipedia-like summary about the assumptions rather than finish
| the task. Further prompting will not improve its result.
| thorax wrote:
| Do you have a link or gist of an example run you tried? I'd be
| curious to try something similar.
| jphoward wrote:
| The problem is the discussed results are comparing proportions of
| a relatively small number - 67 questions. If you model this as a
| binomial distribution, then 62/67 which GPT4-turbo got gives a
| 95% confidence interval of the 'true' performance of 83.4% to
| 97.5%, ie it comfortably includes the proportion that GPT4
| achieved (64/67=95.5%).
|
| I think the evidence from these tests are not strong enough to
| draw conclusions from.
| epups wrote:
| Yes. I see people make this mistake time and again when
| evaluating LLMs. For a proper comparison, it's not enough to
| simply throw less than a hundred questions at it and point to a
| single digit difference. Not to mention that LLMs have some
| inherent randomness, so even if you passed the exact same tasks
| to the same model you would expect some variance.
|
| I see a lot of room of improvement in how we apply statistics
| to understanding LLM performance.
| Racing0461 wrote:
| > I think the evidence from these tests are not strong enough
| to draw conclusions from.
|
| I've used gpt4 turbo for some coding problems yesterday. It was
| worse. That's enough to draw conclusions for me.
| Maxion wrote:
| I'm not surprised, most people can't even tell the median from
| the mean.
| iJohnDoe wrote:
| A little bit off topic question. When people are talking about
| costs with GPT, like the following link. Does the cost concern
| only apply to the API? If you're using the WebUI and have a Plus
| account, is it always just the flat $20 amount?
|
| https://news.ycombinator.com/item?id=38193978
| tosh wrote:
| usually, yes (either cost of the API or cost to serve for
| OpenAI)
| geraltofrivia wrote:
| In my day job we use GPT4 quite a bit and we shifted to GPT4
| Turbo today. We got a 2-5% performance increase, and quite a bit
| of speed increase as well.
|
| Not to say that the parent post is incorrect, of course. Only
| that its not as cut and dry as a "GPT4 Turbo is distilled (read:
| watered down) GPT4".
| hu3 wrote:
| Interesting. What do you use it for?
| geraltofrivia wrote:
| Currently only for unstructured (OCR) text to structured text
| conversion.
|
| We're transitioning from a legacy codebase full of regexes
| and undocumented functions that are understood only by the
| developer and god. The developers left and I don't believe in
| god. We tried throwing the unstructed mess to GPT, alongwith
| a few examples and got surprisingly good results.
| Satam wrote:
| Very interesting and basically confirms that GPT-4 turbo is a
| faster but dumber model. When a task doesn't rely on memorization
| of the training set, it reasons similarly well to GPT-4. Where
| memorization is helpful, it performs worse (due to quantization-
| induced "memory loss").
|
| This also makes me look at GPT-4 as a "weak reasoner with a lot
| of knowledge". That really aligns with my experience where it is
| immensely helpful and has a superhuman knowledge base but still
| needs handholding to solve real problems.
| jessenaser wrote:
| The thing is why does the GPT-4 Turbo and the Updated GPT 3.5
| Turbo have only an output of 4,096 tokens?
|
| Previous Model: gpt-3.5-turbo-16k, 16385 tokens context and
| completion (shared)
|
| New Model: gpt-3.5-turbo-1106, 16385 tokens context, 4096 tokens
| completion
|
| Previous Model: gpt-4, 8192 tokens context and completion
| (shared)
|
| New Model: gpt-4-1106-preview, 128000 tokens context, 4096 tokens
| completion
|
| Why would the same size of a 16K GPT-3.5 model now not allow
| larger completion sizes?
|
| Why would the new GPT-4 reduce the completion tokens as well,
| gpt-4 can do 8192 and gpt-4-32k can do 32768 completion tokens.
| Now the limit is 4096.
|
| So you would need to change the way you prompt (split the
| responses) to be able to get a longer response.
|
| ---
|
| So are these new models taking the old base models of 4K tokens
| context and completion and changing the context to 128000 but
| leaving the completion the same? If they could get gpt-4 to have
| gpt-4-8k and gpt-4-32k, why couldn't have it been 128000 context
| and 32768 completion?
| srdjanr wrote:
| Probably because it's too expensive. Prompt can be processed
| quickly but output tokens are much slower (and that means more
| expensive).
|
| From my local test on a 13B model, output tokens are 20-30x
| more expensive than input tokens. So OpenAI's pricing structure
| is based on expectation that there's much more input than
| output tokens in an average response. It didn't matter too much
| if a small percentage of requests used all 4k tokens for
| output, but with 128k it's a different story.
| Racing0461 wrote:
| I believe openai wants to lower the time it takes for requests
| to finish to be able to accept more requests per server/gpu. ie
| money.
| qup wrote:
| if i'm not mistaken, the model has to be trained for a specific
| context window
| cultureswitch wrote:
| Everyone who has any knowledge of machine learning knows that you
| don't evaluate your model by testing it on parts of its training
| data. The issue is, the training data for GPT-4 appears to be
| "everything".
| driverdan wrote:
| This is a big problem with independent LLM testing. You need to
| make sure your test set isn't included in the training set which
| isn't easy with closed source models.
|
| This makes me think of how hardware manufacturers optimize for
| benchmarks. Closed source LLMs can intentionally include likely
| test data in their training set to artificially inflate results.
| I'm not saying they are intentionally doing that now, but they
| could.
| Racing0461 wrote:
| OpenAI should be covered in lawsuits for nerfing a product people
| paid for and expect to not degrade over time while still paying
| the same amount.
| maxrmk wrote:
| I've always been skeptical of benchmarking because of the
| memorization problem. I recently made up my own (simple) date
| reasoning benchmark to test this, and found that GPT-4 Turbo
| actually outperformed GPT-4:
| https://open.substack.com/pub/talcai/p/making-up-a-new-llm-b...
| tornato7 wrote:
| I like the test but do you take multiple samples / runs of a
| result? IMO for a proper benchmark you should ask it the same
| question 10+ times and show a confidence interval, otherwise
| you don't know if it's just a fluke or a lucky guess.
| maxrmk wrote:
| Ahh good suggestion, I should clarify this in the article. I
| tried to compensate with volume -- I used a set of 200
| questions for the testing. I was using temperature 0, so I'd
| get the same answer if I ran a single question multiple
| times.
___________________________________________________________________
(page generated 2023-11-09 23:01 UTC)