[HN Gopher] OpenAI and Microsoft Azure to deprecate GPT-4 32K
___________________________________________________________________
OpenAI and Microsoft Azure to deprecate GPT-4 32K
Author : tosh
Score : 60 points
Date : 2024-06-16 18:16 UTC (4 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| antman wrote:
| So much for code migrations! Although most examples are about
| summarization or needle in a haystack search, applications whose
| output should be the size of input are probably more important
| although less advertised.
|
| Curious if that is a business decision or a technical decision
| aka if optimizations for cheap and fast 128k gpt4o work only for
| small outputs
| xena wrote:
| One of the decent parts about open weights models is that you
| can't have this happen to you. You get to keep access to any
| model you want. This is essential for continuity of products.
| MuffinFlavored wrote:
| what is chatgpt as a product more than just "really good
| weights"?
| xena wrote:
| Really good weights that you don't have to run yourself
| BiteCode_dev wrote:
| Probably a lot that we don't know about, because ChatGPT has
| better reasoning skills than most other systems out there,
| and we know reasoning is not really about the weights.
| Slyfox33 wrote:
| Chatgpt can't reason about anything.
| TeMPOraL wrote:
| It's just _really_ good weights - significantly better
| weights than anyone else has.
| azeemba wrote:
| I am having trouble understanding what the complaint is here.
|
| The docs still mention bigger models with 128k tokens and smaller
| models with 8k tokens. It seems reasonable to optimize for big
| and small use cases differently? I don't see how we are being
| "robbed".
| xena wrote:
| The main limit is that you can have 128k tokens of input, but
| only 4k tokens of output per run. gpt4-32k lets you have up to
| 32k tokens of output per run. Some applications need that much
| output. Especially for token dense things like code and JSON.
| seeknotfind wrote:
| So it's a price concern? Because you could run for 4k output
| 8 times to get 32K? Or does the RLHF stuff prevent you from
| feeding the output back in as more input and still get a
| decent result? The underlying transformers shouldn't care
| because they'll be doing that already effectively.
| xena wrote:
| I'd say it's less a price concern and more a consistency of
| output concern. It doesn't make much sense to continue
| incomplete JSON like that I don't think. I need to do some
| more research.
| peab wrote:
| you can just feed that output into another call, to have the
| next call continue it, since you have more than 28k extra
| context. The output per token is faster anyways right, so
| speed isn't an issue. It's just slightly more dev work
| (really only a couple lines of code)
| DelightOne wrote:
| How do you know it will have the same state of mind? And
| how much does that cost.
| jhgg wrote:
| Because the state of mind is derived from the input
| tokens.
| DelightOne wrote:
| Is there a study or anything that that is guaranteed
| adding an incomplete assistant: response as the input and
| the API taking off exactly the same way on the same
| position?
| sshumaker wrote:
| It's how LLMs work - they are effectively recursive at
| inference time, after each token is sampled, you feed it
| back in. You will end up with the same model state (not
| including noise) as if that had been the original input
| prompt.
| DelightOne wrote:
| LLMs sure. My question is whether it is the same in
| practice for LLMs behind said API. I found no official
| documentation that we will get exactly the same result as
| far as I can tell.
|
| And no one here touched how high a multiple the cost is,
| so I assume its pretty high.
| t-writescode wrote:
| > I am having trouble understanding what the complaint is here.
|
| The appropriate level of due diligence for each LLM model
| transition is to run your various prompts into the new model
| and make sure they still produce the correct output; and, if
| they don't produce good output, to update the prompts so that
| they continue to produce good output.
|
| Just yesterday, I was experimenting with 4o and assumed I could
| do a flat migration for some work. 4o actually provided worse
| results - results I explicitly asked to *not* have in my 4
| output (and that I didn't have in my 4 output).
|
| It's tedious to have to change models after you've already done
| a proper validation suite against one model.
|
| That would be (at least my) complaint.
|
| I've even version-stamped the models I use on purpose to avoid
| surprises.
| hit8run wrote:
| At that point why not use your own hosted open source model
| that is more reproducible for you?
| t-writescode wrote:
| Cost of maintaining architecture, cost of complexity of
| internal infrastructure, knowledge level required for self-
| hosting, complexity of local one-boxing, and a slew of
| other reasons.
| xena wrote:
| Everything's a tradeoff, but it seems that part of the
| tradeoffs include access to tools critical for your
| product to function correctly being taken away without a
| way to get it back. Maybe that can be an acceptable
| tradeoff, but I'd personally not like living with that.
| t-writescode wrote:
| At present, my startup isn't making money - in fact, it's
| not even released. As a result, I'm trying to prioritize
| getting it out of the door while still being affordable
| for myself enough to bootstrap it.
|
| To do this, I've made, and continue to make tradeoffs.
| Among many of the tradeoffs I'm currently making that I
| intend to resolve ASAP is having OpenAI as a single
| source of failure. I intend to have some of the other
| hosted solutions as other options for LLM processing. One
| of the many, many options that will be considered at that
| time is self-hosting it, as well, as another option.
|
| I've already spent more time than I should perfecting
| various smaller pieces, increasing reliability, etc. Each
| time I choose perfection, I lose more time, more runway,
| more potential market share; and, something I've recently
| had to learn:
|
| Each time I lock myself in a previous step to get that
| step perfect, I miss the lessons I'm about to have to
| learn in the next stage of the process, including new
| issues I'll run into that increase the next step's
| complexity above my initial estimates.
|
| Everything is a tradeoff. Choosing to use a commercially
| available solution with known and relatively set costs
| while accepting it may slowly change underfoot (while
| also knowing I have alternatives I can swap to if an
| emergency comes up that should only take a little bit to
| transfer to) is one I've made.
| ComplexSystems wrote:
| Because open source models aren't as good as GPT-4.
| psanford wrote:
| Not directly addressing your point but asking an LLM to not
| include something in its output often doesn't work well. Its
| a bit like saying to someone "whatever you do, don't think
| about elephants."
| jdsully wrote:
| 4o is pretty bad in my experience. They shouldn't have used
| the "4.0" naming for it or said its Lite or something.
| Instead they are trying to market it as roughly equivalent
| which it definitely is not.
| matsemann wrote:
| If you've spent ages fine tuning your prompt/context to have it
| work for your integration, it's not a given it will work
| similarly on a model of a different size. Might have to
| essentially start from scratch.
| kmeisthax wrote:
| I'm confused here. What's an "output context"? My assumption was
| that the context window was shared across input and (prior)
| output. You put everything into the model at once, and then at
| the end the first unused context vector becomes a vector you can
| decode for a single token of output. With multiple token output
| meaning you repeatedly inference, decode, sample, append, and
| repeat until you sample an end token. Is this just a limit of
| OpenAI APIs or something I'm forgetting?
| asabla wrote:
| This isn't unique for OpenAI models. A lot of the open source
| ones has similar limitations.
| xena wrote:
| It's the number of tokens the model can output in one pass.
| There's subtle differences between running it multiple times to
| get a bigger output and running it once to get a bigger output.
| These are things that only really show up when you integrate
| these models into production code.
| ComplexSystems wrote:
| No. Newer models have 128K of input tokens, but only 4096
| output tokens.
| arugulum wrote:
| The long story short is you are technically correct but in
| practice things are a little different. There are 2 factors to
| consider here:
|
| 1. Model Capability
|
| You are right that mechanically, input and output tokens in a
| standard decoder Transformer are "the same". A 32K context
| should mean you can have 1 input tokens and 32K output tokens
| (you actually get 1 bonus token), or 32K input tokens and 1
| output token,
|
| However, if you feed an LM "too much" of its own input (read:
| have too long an output length), it starts to go off the rails,
| empirically. The word "too much" is doing some work here: it's
| a balance of both (1) LLM labs having data that covers that
| many output tokens in an example and (2) LLMs labs having
| empirical tests to have confidence that the model won't
| reasonably go off the rails within some output limit. (Note,
| this isn't pretraining but the instruction tuning/RLHF after,
| so you don't just get examples for free)
|
| In short, labs will often train a model targeting an output
| context length, and put out an offering based on that.
|
| 2. Infrastructure
|
| While mathematically having the model read external input and
| its own output are the same, the infrastructure is wildly
| different. This is one of the first things you learn when
| deploying these models: you basically have a different stack
| for "encoding" and "decoding" (using those terms loosely. This
| is after all still a decoder only model). This means you need
| to set max lengths for both encoding and decoding separately.
|
| So, after a long time of optimizing both the implementation and
| length hyperparameters (or just winging it), the lab will
| decide "we have a good implementation for up to 31K input and
| 1k output" and then go from there. If they wanted to change
| that, there's a bunch of infrastructure work involved. And
| because of the economies of batching, you want many inputs to
| have as close to the same lengths as possible, so you want to
| offer fewer configurations (some of this bucketing may be
| performed hidden from the user). Anyway, this is why it may
| become uneconomical to offer a model at a given length
| configuration (input or output) after some time.
| JoeCortopassi wrote:
| A lot of people here haven't integrated GPT into a customer
| facing production system, and it shows
|
| gpt-4, gpt-4-turbo, and gpt-4o are not the same models. They are
| mostly close enough when you have a human in the loop, and loose
| constraints. But if you are building systems off of the (already
| fragile) prompt based output, you will have to go through a very
| manual process of tuning your prompts to get the same/similar
| output out of the new model. It will break in weird ways that
| makes you feel like you are trying to nail Jello to a tree
|
| There are software tools/services that help with this, and a ton
| more that merely promise to, but most of the tooling around LLMs
| these days gives the illusion of a reliable tool rather than
| results of one. It's the early days of the gold rush still, and
| every one wants to be seen as one of the first
| tmpz22 wrote:
| Maybe we shouldn't be selling products built on such a shaky
| foundation? Like Health Insurance products for example.
|
| [2]: https://insurtechdigital.com/articles/chatgpt-the-risks-
| and-...
|
| --- please disregard [1] it was a terrible initial source I
| pulled of Google
|
| [1]: https://medium.com/artivatic/use-of-chatgpt-4-in-health-
| insu...
| bcrl wrote:
| Minimum Viable Products are pretty much by definition built
| on shaky foundations. At least with software written by
| humans the failure modes are somewhat bounded by the
| architecture of the system as opposed to the who-knows-what-
| the-model-will-hallucinate of AI.
| nerdjon wrote:
| I think that is the key problem, a traditional MVP is a
| mostly known entity. It may be missing some features, some
| bugs, etc. But it is an MVP not because it was necessarily
| rushed out the door (I mean... it was, but differently) but
| because it has some rough edges and is likely missing major
| features.
|
| Where what it seems we are getting with a lot of these
| companies shoving AI into something and calling it a
| product, is an MVP that is an MVP due to an unknown and
| untested nature.
| mewpmewp2 wrote:
| But ultimately we have to test and release things to see
| what works and what doesn't. Very many usecases don't
| require perfect accuracy.
| zerkten wrote:
| The term MVP was cover for shoving poor quality software
| out on the market long before AI became involved. This is
| unfortunate, but inevitable when the term was
| popularized. AI is incredibly easy to tack on now, so
| people are doing that too.
| nerdjon wrote:
| That is true, but I think rushing to add AI features made
| it a completely different situation.
|
| We get a lot of MVP crap before, don't get me wrong. But
| at least it was understood crap. Sure it may have bugs in
| it and that is to be expected. But there was a limit in
| how wrong it could go. Since at the end of the day it was
| still limited to the code within the application and the
| server (if there is one).
|
| Meanwhile when an over-reliance on an LLM goes wrong,
| depending on how it goes wrong could be catastrophic.
|
| As we have seen time and time again just in the last
| couple months, when LLM's are shoved into something we
| seem to get a serious lack of testing under the guise of
| "beta".
| Sharlin wrote:
| Building products on shaky foundations is a tried-and-true
| approach in IT business.
| benreesman wrote:
| For a different point of view from someone with extremely
| credible credentials (learned this stuff from Hinton among
| many other things) and a much more sober and balanced take on
| all this I recommend the following interview with Nick Frosst
| (don't be put off by the clickbait YouTube title, that's a
| very silly caption):
|
| https://youtu.be/4JF1V2hzGKE
| djohnston wrote:
| > It will break in weird ways that makes you feel like you are
| trying to nail Jello to a tree
|
| Probably the best description of working with LLM agents I've
| read
| visarga wrote:
| It gets more interesting when you get to benchmarking your
| prompts for accuracy. If you don't have an evaluation set you
| are flying blind. Any model update or small fix could break
| edge cases while you don't know.
| amluto wrote:
| Make sure you don't upload that evaluation set to any
| service that resells data (or gets scraped) for LLM
| training!
| djohnston wrote:
| We are using benchmarking on our own eval sets, which makes
| it easier to measure the variance that I've found
| impossible to eliminate.
| barrell wrote:
| Came here to say the same thing, it sums it up perfectly
| bbor wrote:
| My naive answer: turn away from Silicon Valley modernity with
| its unicorns and runways and ""marketing"", and embrace the
| boring stuffy academics! https://dspy-docs.vercel.app/
| tbarbugli wrote:
| hosted on Vercel and Github...
| mdp2021 wrote:
| Is it Winter already?
| stavros wrote:
| I never got DSPy. I only tried a brief example, but can
| someone explain why it's better than alternatives? Not that I
| hold LangChain in particularly high regard...
| adamgordonbell wrote:
| I've seen people mention this lib before and I have a hard
| time understanding the use cases Nad how it's used.
| outside1234 wrote:
| Hopefully you built a solid eval system around the core of your
| GenAI usage, otherwise, yes, this is going to be very painful
| :)
| mmastrac wrote:
| Looks like they are just cleaning house of lesser-used models?
| This came via mail last week. Back in June 2023
| and November 2023, we announced the following models will be
| deprecated on June 13th, 2024: gpt-3.5-turbo-0301
| gpt-3.5-turbo-0613 gpt-3.5-turbo-16k-0613 We noticed
| that your organization recently used at least one of these
| models. To help minimize any disruption, we are extending your
| access to these models for an additional 3 month grace period
| until September 13th, 2024. After this date, these models will be
| fully decommissioned.
| nicce wrote:
| Probably more expensive as well.
| aftbit wrote:
| Ah I see, taking a page from the Google playbook and aggressively
| culling less popular products. I wonder how many sales Google's
| reputation for capricious culling has cost them.
| Terretta wrote:
| Arguably it's rather more like taking a page from versioned
| builds, meaning not supporting old builds indefinitely, through
| a process of notice with grace period till deprecation.
___________________________________________________________________
(page generated 2024-06-16 23:01 UTC)