[HN Gopher] OpenAI and Microsoft Azure to deprecate GPT-4 32K
       ___________________________________________________________________
        
       OpenAI and Microsoft Azure to deprecate GPT-4 32K
        
       Author : tosh
       Score  : 60 points
       Date   : 2024-06-16 18:16 UTC (4 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | antman wrote:
       | So much for code migrations! Although most examples are about
       | summarization or needle in a haystack search, applications whose
       | output should be the size of input are probably more important
       | although less advertised.
       | 
       | Curious if that is a business decision or a technical decision
       | aka if optimizations for cheap and fast 128k gpt4o work only for
       | small outputs
        
       | xena wrote:
       | One of the decent parts about open weights models is that you
       | can't have this happen to you. You get to keep access to any
       | model you want. This is essential for continuity of products.
        
         | MuffinFlavored wrote:
         | what is chatgpt as a product more than just "really good
         | weights"?
        
           | xena wrote:
           | Really good weights that you don't have to run yourself
        
           | BiteCode_dev wrote:
           | Probably a lot that we don't know about, because ChatGPT has
           | better reasoning skills than most other systems out there,
           | and we know reasoning is not really about the weights.
        
             | Slyfox33 wrote:
             | Chatgpt can't reason about anything.
        
           | TeMPOraL wrote:
           | It's just _really_ good weights - significantly better
           | weights than anyone else has.
        
       | azeemba wrote:
       | I am having trouble understanding what the complaint is here.
       | 
       | The docs still mention bigger models with 128k tokens and smaller
       | models with 8k tokens. It seems reasonable to optimize for big
       | and small use cases differently? I don't see how we are being
       | "robbed".
        
         | xena wrote:
         | The main limit is that you can have 128k tokens of input, but
         | only 4k tokens of output per run. gpt4-32k lets you have up to
         | 32k tokens of output per run. Some applications need that much
         | output. Especially for token dense things like code and JSON.
        
           | seeknotfind wrote:
           | So it's a price concern? Because you could run for 4k output
           | 8 times to get 32K? Or does the RLHF stuff prevent you from
           | feeding the output back in as more input and still get a
           | decent result? The underlying transformers shouldn't care
           | because they'll be doing that already effectively.
        
             | xena wrote:
             | I'd say it's less a price concern and more a consistency of
             | output concern. It doesn't make much sense to continue
             | incomplete JSON like that I don't think. I need to do some
             | more research.
        
           | peab wrote:
           | you can just feed that output into another call, to have the
           | next call continue it, since you have more than 28k extra
           | context. The output per token is faster anyways right, so
           | speed isn't an issue. It's just slightly more dev work
           | (really only a couple lines of code)
        
             | DelightOne wrote:
             | How do you know it will have the same state of mind? And
             | how much does that cost.
        
               | jhgg wrote:
               | Because the state of mind is derived from the input
               | tokens.
        
               | DelightOne wrote:
               | Is there a study or anything that that is guaranteed
               | adding an incomplete assistant: response as the input and
               | the API taking off exactly the same way on the same
               | position?
        
               | sshumaker wrote:
               | It's how LLMs work - they are effectively recursive at
               | inference time, after each token is sampled, you feed it
               | back in. You will end up with the same model state (not
               | including noise) as if that had been the original input
               | prompt.
        
               | DelightOne wrote:
               | LLMs sure. My question is whether it is the same in
               | practice for LLMs behind said API. I found no official
               | documentation that we will get exactly the same result as
               | far as I can tell.
               | 
               | And no one here touched how high a multiple the cost is,
               | so I assume its pretty high.
        
         | t-writescode wrote:
         | > I am having trouble understanding what the complaint is here.
         | 
         | The appropriate level of due diligence for each LLM model
         | transition is to run your various prompts into the new model
         | and make sure they still produce the correct output; and, if
         | they don't produce good output, to update the prompts so that
         | they continue to produce good output.
         | 
         | Just yesterday, I was experimenting with 4o and assumed I could
         | do a flat migration for some work. 4o actually provided worse
         | results - results I explicitly asked to *not* have in my 4
         | output (and that I didn't have in my 4 output).
         | 
         | It's tedious to have to change models after you've already done
         | a proper validation suite against one model.
         | 
         | That would be (at least my) complaint.
         | 
         | I've even version-stamped the models I use on purpose to avoid
         | surprises.
        
           | hit8run wrote:
           | At that point why not use your own hosted open source model
           | that is more reproducible for you?
        
             | t-writescode wrote:
             | Cost of maintaining architecture, cost of complexity of
             | internal infrastructure, knowledge level required for self-
             | hosting, complexity of local one-boxing, and a slew of
             | other reasons.
        
               | xena wrote:
               | Everything's a tradeoff, but it seems that part of the
               | tradeoffs include access to tools critical for your
               | product to function correctly being taken away without a
               | way to get it back. Maybe that can be an acceptable
               | tradeoff, but I'd personally not like living with that.
        
               | t-writescode wrote:
               | At present, my startup isn't making money - in fact, it's
               | not even released. As a result, I'm trying to prioritize
               | getting it out of the door while still being affordable
               | for myself enough to bootstrap it.
               | 
               | To do this, I've made, and continue to make tradeoffs.
               | Among many of the tradeoffs I'm currently making that I
               | intend to resolve ASAP is having OpenAI as a single
               | source of failure. I intend to have some of the other
               | hosted solutions as other options for LLM processing. One
               | of the many, many options that will be considered at that
               | time is self-hosting it, as well, as another option.
               | 
               | I've already spent more time than I should perfecting
               | various smaller pieces, increasing reliability, etc. Each
               | time I choose perfection, I lose more time, more runway,
               | more potential market share; and, something I've recently
               | had to learn:
               | 
               | Each time I lock myself in a previous step to get that
               | step perfect, I miss the lessons I'm about to have to
               | learn in the next stage of the process, including new
               | issues I'll run into that increase the next step's
               | complexity above my initial estimates.
               | 
               | Everything is a tradeoff. Choosing to use a commercially
               | available solution with known and relatively set costs
               | while accepting it may slowly change underfoot (while
               | also knowing I have alternatives I can swap to if an
               | emergency comes up that should only take a little bit to
               | transfer to) is one I've made.
        
             | ComplexSystems wrote:
             | Because open source models aren't as good as GPT-4.
        
           | psanford wrote:
           | Not directly addressing your point but asking an LLM to not
           | include something in its output often doesn't work well. Its
           | a bit like saying to someone "whatever you do, don't think
           | about elephants."
        
           | jdsully wrote:
           | 4o is pretty bad in my experience. They shouldn't have used
           | the "4.0" naming for it or said its Lite or something.
           | Instead they are trying to market it as roughly equivalent
           | which it definitely is not.
        
         | matsemann wrote:
         | If you've spent ages fine tuning your prompt/context to have it
         | work for your integration, it's not a given it will work
         | similarly on a model of a different size. Might have to
         | essentially start from scratch.
        
       | kmeisthax wrote:
       | I'm confused here. What's an "output context"? My assumption was
       | that the context window was shared across input and (prior)
       | output. You put everything into the model at once, and then at
       | the end the first unused context vector becomes a vector you can
       | decode for a single token of output. With multiple token output
       | meaning you repeatedly inference, decode, sample, append, and
       | repeat until you sample an end token. Is this just a limit of
       | OpenAI APIs or something I'm forgetting?
        
         | asabla wrote:
         | This isn't unique for OpenAI models. A lot of the open source
         | ones has similar limitations.
        
         | xena wrote:
         | It's the number of tokens the model can output in one pass.
         | There's subtle differences between running it multiple times to
         | get a bigger output and running it once to get a bigger output.
         | These are things that only really show up when you integrate
         | these models into production code.
        
         | ComplexSystems wrote:
         | No. Newer models have 128K of input tokens, but only 4096
         | output tokens.
        
         | arugulum wrote:
         | The long story short is you are technically correct but in
         | practice things are a little different. There are 2 factors to
         | consider here:
         | 
         | 1. Model Capability
         | 
         | You are right that mechanically, input and output tokens in a
         | standard decoder Transformer are "the same". A 32K context
         | should mean you can have 1 input tokens and 32K output tokens
         | (you actually get 1 bonus token), or 32K input tokens and 1
         | output token,
         | 
         | However, if you feed an LM "too much" of its own input (read:
         | have too long an output length), it starts to go off the rails,
         | empirically. The word "too much" is doing some work here: it's
         | a balance of both (1) LLM labs having data that covers that
         | many output tokens in an example and (2) LLMs labs having
         | empirical tests to have confidence that the model won't
         | reasonably go off the rails within some output limit. (Note,
         | this isn't pretraining but the instruction tuning/RLHF after,
         | so you don't just get examples for free)
         | 
         | In short, labs will often train a model targeting an output
         | context length, and put out an offering based on that.
         | 
         | 2. Infrastructure
         | 
         | While mathematically having the model read external input and
         | its own output are the same, the infrastructure is wildly
         | different. This is one of the first things you learn when
         | deploying these models: you basically have a different stack
         | for "encoding" and "decoding" (using those terms loosely. This
         | is after all still a decoder only model). This means you need
         | to set max lengths for both encoding and decoding separately.
         | 
         | So, after a long time of optimizing both the implementation and
         | length hyperparameters (or just winging it), the lab will
         | decide "we have a good implementation for up to 31K input and
         | 1k output" and then go from there. If they wanted to change
         | that, there's a bunch of infrastructure work involved. And
         | because of the economies of batching, you want many inputs to
         | have as close to the same lengths as possible, so you want to
         | offer fewer configurations (some of this bucketing may be
         | performed hidden from the user). Anyway, this is why it may
         | become uneconomical to offer a model at a given length
         | configuration (input or output) after some time.
        
       | JoeCortopassi wrote:
       | A lot of people here haven't integrated GPT into a customer
       | facing production system, and it shows
       | 
       | gpt-4, gpt-4-turbo, and gpt-4o are not the same models. They are
       | mostly close enough when you have a human in the loop, and loose
       | constraints. But if you are building systems off of the (already
       | fragile) prompt based output, you will have to go through a very
       | manual process of tuning your prompts to get the same/similar
       | output out of the new model. It will break in weird ways that
       | makes you feel like you are trying to nail Jello to a tree
       | 
       | There are software tools/services that help with this, and a ton
       | more that merely promise to, but most of the tooling around LLMs
       | these days gives the illusion of a reliable tool rather than
       | results of one. It's the early days of the gold rush still, and
       | every one wants to be seen as one of the first
        
         | tmpz22 wrote:
         | Maybe we shouldn't be selling products built on such a shaky
         | foundation? Like Health Insurance products for example.
         | 
         | [2]: https://insurtechdigital.com/articles/chatgpt-the-risks-
         | and-...
         | 
         | --- please disregard [1] it was a terrible initial source I
         | pulled of Google
         | 
         | [1]: https://medium.com/artivatic/use-of-chatgpt-4-in-health-
         | insu...
        
           | bcrl wrote:
           | Minimum Viable Products are pretty much by definition built
           | on shaky foundations. At least with software written by
           | humans the failure modes are somewhat bounded by the
           | architecture of the system as opposed to the who-knows-what-
           | the-model-will-hallucinate of AI.
        
             | nerdjon wrote:
             | I think that is the key problem, a traditional MVP is a
             | mostly known entity. It may be missing some features, some
             | bugs, etc. But it is an MVP not because it was necessarily
             | rushed out the door (I mean... it was, but differently) but
             | because it has some rough edges and is likely missing major
             | features.
             | 
             | Where what it seems we are getting with a lot of these
             | companies shoving AI into something and calling it a
             | product, is an MVP that is an MVP due to an unknown and
             | untested nature.
        
               | mewpmewp2 wrote:
               | But ultimately we have to test and release things to see
               | what works and what doesn't. Very many usecases don't
               | require perfect accuracy.
        
               | zerkten wrote:
               | The term MVP was cover for shoving poor quality software
               | out on the market long before AI became involved. This is
               | unfortunate, but inevitable when the term was
               | popularized. AI is incredibly easy to tack on now, so
               | people are doing that too.
        
               | nerdjon wrote:
               | That is true, but I think rushing to add AI features made
               | it a completely different situation.
               | 
               | We get a lot of MVP crap before, don't get me wrong. But
               | at least it was understood crap. Sure it may have bugs in
               | it and that is to be expected. But there was a limit in
               | how wrong it could go. Since at the end of the day it was
               | still limited to the code within the application and the
               | server (if there is one).
               | 
               | Meanwhile when an over-reliance on an LLM goes wrong,
               | depending on how it goes wrong could be catastrophic.
               | 
               | As we have seen time and time again just in the last
               | couple months, when LLM's are shoved into something we
               | seem to get a serious lack of testing under the guise of
               | "beta".
        
           | Sharlin wrote:
           | Building products on shaky foundations is a tried-and-true
           | approach in IT business.
        
           | benreesman wrote:
           | For a different point of view from someone with extremely
           | credible credentials (learned this stuff from Hinton among
           | many other things) and a much more sober and balanced take on
           | all this I recommend the following interview with Nick Frosst
           | (don't be put off by the clickbait YouTube title, that's a
           | very silly caption):
           | 
           | https://youtu.be/4JF1V2hzGKE
        
         | djohnston wrote:
         | > It will break in weird ways that makes you feel like you are
         | trying to nail Jello to a tree
         | 
         | Probably the best description of working with LLM agents I've
         | read
        
           | visarga wrote:
           | It gets more interesting when you get to benchmarking your
           | prompts for accuracy. If you don't have an evaluation set you
           | are flying blind. Any model update or small fix could break
           | edge cases while you don't know.
        
             | amluto wrote:
             | Make sure you don't upload that evaluation set to any
             | service that resells data (or gets scraped) for LLM
             | training!
        
             | djohnston wrote:
             | We are using benchmarking on our own eval sets, which makes
             | it easier to measure the variance that I've found
             | impossible to eliminate.
        
           | barrell wrote:
           | Came here to say the same thing, it sums it up perfectly
        
         | bbor wrote:
         | My naive answer: turn away from Silicon Valley modernity with
         | its unicorns and runways and ""marketing"", and embrace the
         | boring stuffy academics! https://dspy-docs.vercel.app/
        
           | tbarbugli wrote:
           | hosted on Vercel and Github...
        
           | mdp2021 wrote:
           | Is it Winter already?
        
           | stavros wrote:
           | I never got DSPy. I only tried a brief example, but can
           | someone explain why it's better than alternatives? Not that I
           | hold LangChain in particularly high regard...
        
           | adamgordonbell wrote:
           | I've seen people mention this lib before and I have a hard
           | time understanding the use cases Nad how it's used.
        
         | outside1234 wrote:
         | Hopefully you built a solid eval system around the core of your
         | GenAI usage, otherwise, yes, this is going to be very painful
         | :)
        
       | mmastrac wrote:
       | Looks like they are just cleaning house of lesser-used models?
       | This came via mail last week.                 Back in June 2023
       | and November 2023, we announced the following models will be
       | deprecated on June 13th, 2024:        gpt-3.5-turbo-0301
       | gpt-3.5-turbo-0613        gpt-3.5-turbo-16k-0613       We noticed
       | that your organization recently used at least one of these
       | models. To help minimize any disruption, we are extending your
       | access to these models for an additional 3 month grace period
       | until September 13th, 2024. After this date, these models will be
       | fully decommissioned.
        
         | nicce wrote:
         | Probably more expensive as well.
        
       | aftbit wrote:
       | Ah I see, taking a page from the Google playbook and aggressively
       | culling less popular products. I wonder how many sales Google's
       | reputation for capricious culling has cost them.
        
         | Terretta wrote:
         | Arguably it's rather more like taking a page from versioned
         | builds, meaning not supporting old builds indefinitely, through
         | a process of notice with grace period till deprecation.
        
       ___________________________________________________________________
       (page generated 2024-06-16 23:01 UTC)