[HN Gopher] Who is working on forward and backward compatibility...
       ___________________________________________________________________
        
       Who is working on forward and backward compatibility for LLMs?
        
       Author : nijfranck
       Score  : 74 points
       Date   : 2023-06-09 15:02 UTC (7 hours ago)
        
 (HTM) web link (huyenchip.com)
 (TXT) w3m dump (huyenchip.com)
        
       | brucethemoose2 wrote:
       | I dunno...
       | 
       | This sounds like making diffusion backwards compatible with
       | ESRGAN. _Technically_ they are both upscaling denoisers (with
       | finetunes for specific tasks), and you can set up objective tests
       | compatible with both, but actual way they are used is so
       | different that its not even a good performance measurement.
       | 
       | The same thing applies to recent LLMs, and the structural changes
       | are only going to get more drastic and fundamental. For instance,
       | what about LLMs with seperate instruction and data context? Or
       | multimodal LLMs with multiple inputs/outputs? Or LLMs that
       | finetune themselves during inference? That is just scratching the
       | surface.
        
         | TeMPOraL wrote:
         | > _what about LLMs with seperate instruction and data context?_
         | 
         | Do such architectures exist? Isn't this separation _impossible_
         | , for fundamental reasons?
        
       | beepbooptheory wrote:
       | Like 4 months ago people were saying the Singularity has pretty
       | much already happened and everything is going to change/the world
       | is over, but here we are now dealing with hard and very boring
       | problems around versioning/hardening already somewhat counter-
       | intuitive and highly-engineered prompts in order to hopefully eek
       | out a single piece of consistent functionality, maybe.
        
       | aldousd666 wrote:
       | Meta is getting it done for free by releasing their models open
       | source. Now everyone is building things that work with their
       | models.
        
       | HarHarVeryFunny wrote:
       | OpenAI have some degree of versioning with the models used by
       | their APIs, but it seems they are perhaps still updating (fine
       | tuning) models without changing the model name/version. For
       | ChatGPT itself (not the APIs) many people have reported recent
       | regressions in capability, so it seems the model is being changed
       | there too.
       | 
       | As people start to use these API's in production, there needs to
       | be stricter version control, especially given how complex
       | (impossible unless you are only using a fixed set of prompts) it
       | is for anyone to test for backwards compatibility. Maybe
       | something like Ubuntu's stable long-term releases vs bleeding
       | edge ones would work. Have some models that are guaranteed not to
       | change for a specified amount of time, and others that will be
       | periodically updated for people who want cutting edge behavior
       | and care less about backwards compatibility.
        
         | d_watt wrote:
         | Regarding your comment on updating models, are you saying you
         | think they're updating the "pinned" models. EGT gpt4-0314?
         | 
         | Otherwise, I think effectively they already have what you're
         | describing with LTS models as pinned versions, and the
         | unversioned model is effectively the "nightly."
         | 
         | From what I've see in the materials, it seems like if you pay
         | for dedicated compute, you can also have some control over your
         | model versions.
        
           | HarHarVeryFunny wrote:
           | I don't know. There have been a lot of recent complaints
           | about changed behavior, but not sure which model versions
           | people are talking about.
        
         | cj wrote:
         | I agree with this.
         | 
         | Although there are APIs used in production whose output
         | changes/evolves over time.
         | 
         | First one that comes to mind is Google Translate. We spend
         | 6-figures with the Google Translate API annually, and recently
         | we went back and checked if Google Translate is
         | improving/changing the translations over time, and found that
         | they indeed are (presumably as they improve their internal
         | models, which aren't versioned and isn't exposed in a changelog
         | anywhere). The majority of translations were different for the
         | same content today compared to 6 months ago.
         | 
         | I don't particularly agree with this approach. Speaking as a
         | power user of Google Translate API, it would be nice to be able
         | to pin to specific version/model and then manually upgrade
         | versions (with a changelog to understand what's changing under
         | the hood).
        
           | ladberg wrote:
           | At this point the changelog is surely just stuff along the
           | lines of "retrained with more data" and "slightly tweaked
           | model architecture in a random way that improved
           | performance".
           | 
           | And out of curiosity: as someone with a lot of expertise and
           | money on the line, how would you compare Google Translate
           | with LLMs? And also smaller self-hosted models with bigger
           | ones that require API access like OpenAI? Do they perform
           | better or worse and are they cheaper or more expensive?
        
         | [deleted]
        
         | nico wrote:
         | My guess is that they are doing something like:
         | 
         | * use some version of GPT, to preprocess the prompts
         | 
         | * send preprocessed prompt to "real" model for inference
         | 
         | * post process result to filter out undesired output
         | 
         | Then they can honestly say they haven't changed the version of
         | the model, without telling you that they have probably changed
         | a lot of their pipeline to process your prompts and deliver a
         | result for you
        
           | jstarfish wrote:
           | In the last few years I've noticed lying by omission has
           | become the new fun corporate/gen-z internet trend (see also:
           | overfunding/gofundme fraud). Like priests and fund managers,
           | their product is a black box, and there's a lot of mischief
           | you can get into when you're entrusted with one of those.
           | They play a fucked-up game of Rumpelstiltskin, where they
           | mislead by default and only admit the truth if you can guess
           | the right question to ask.
           | 
           | You're on the right track, and I too think that's what their
           | actual pipeline looks like, but you're missing a step. I
           | think there's another step where they effectively alter the
           | output of production models by hot-swapping different LORAs
           | (or whatever) to them.
           | 
           | This lets them plausibly claim they haven't changed the
           | version of the model, because they _haven 't_ messed with the
           | model. They messed with middleware, which nobody knows enough
           | about to press them on. You ask them if anything changed with
           | the model/API, they say no, and leave you to think you're
           | going crazy because shit's just not working like it was last
           | week.
           | 
           | Nobody's asking them about changes to middleware though,
           | which genuinely surprises me. I am never the smartest person
           | in the room-- only the most skeptical.
        
         | deeviant wrote:
         | FYI, OpenAI has denied that there have been any unannounced
         | changes to the models.
         | 
         | And it seems more than possible that people see what they want
         | to see in LLM output. So I would be careful to make completely
         | unsupported claims.
        
           | ineedasername wrote:
           | Sure but they could be playing with semantics a bit as well.
           | When they say "models" they might just mean the LLM that it's
           | all based on. But there's a long more going on in the
           | pipeline to turn that into a consumer facing service like
           | ChatGPT. They might have changed any combination of the
           | following:
           | 
           | 1) Fine Tuning
           | 
           | 2) Embedding
           | 
           | 3) The initializing prompt
           | 
           | 4) Filtering a prompt prior to ingestion & tokenization of
           | the prompt
           | 
           | 5) Filtering the output from the application after it has
           | generated a response.
           | 
           | The statement "we have no unannounced changes to the models"
           | can be _true_ while still substantially changing
           | functionality  & response quality through any of the 5 above
           | areas, and probably some I missed.
        
           | qumpis wrote:
           | I'd be so pleasantly surprised if the same model has remained
           | the same. Everytime I see speedups in its generation speed I
           | assume they've distilled the model further. The outputs
           | subjectively also feel weaker. Surely someone has compared
           | the outputs from the beginning and now?
        
       | ITB wrote:
       | I suggest this is the wrong way to think about this. Alexa tried
       | for a very long time to agree on a "Alexa Ontology" and it just
       | doesn't work for large enough surface areas. Testing that new
       | versions of LLMs work is better than trying to make everything
       | backward compatible. Also, the "structured" component of the
       | response (e.g.: send your answer in JSON format), should be
       | something not super brittle. In fact if the structure takes a lot
       | of prompting to work, you are probably setting yourself up.
        
       | nijfranck wrote:
       | When a newer LLM model comes (e.g GPT3.5 to GPT4), your old
       | prompts become obsolete. How are you solving this problem in your
       | company? Are there companies working on solving this problem?
        
         | [deleted]
        
         | devit wrote:
         | What? Prompts are natural language, so I don't see how they can
         | possibly become obsolete.
        
           | ineedasername wrote:
           | I have to prompt engineer a lot more with 3.5 than 4. The way
           | I asked questions and convey what I want tends to be much
           | more structured with 3.5, in a less natural way than I can do
           | with 4. Hopefully 4 would be even better at answering a
           | structured prompt like that, but also maybe not: For quick
           | questions with a short answer 3.5 will sometimes give a
           | simpler answer than 4, but 3.5 is correct. 4 isn't
           | necessarily wrong, but it sort of reads into the question a
           | bit more, the the answer is less succinct, more caveats and
           | nuances explained, etc. In examples like this even though
           | both give a correct answer, the one from 4 may be
           | undesirable. You don't want to have to read through an extra
           | paraph to pick out the answer to your question. There's more
           | frictions.
           | 
           | Of course the above scenario is easily solved: Change your
           | prompt to include "Be Brief", but that's exactly the
           | argument-- the old prompt is at least in part obsolete and
           | much changes to achieve functional equivalency in 4. And the
           | you need to check for unanticipated changes to the answer
           | that "be brief" would cause: maybe it would now be too brief!
           | Maybe not, but you have to have some method of checking.
        
           | tyree731 wrote:
           | They can. Results change as you change your models, and
           | results aren't always strictly better or worse, which is why
           | testing gold-standard results with any prompt and model
           | changes is so important for applications utilizing LLMs.
        
           | SkyPuncher wrote:
           | They're loosely natural language.
           | 
           | There's a lot of tweaking and non-natural language that goes
           | into them to get the exact results you expect.
        
           | TeMPOraL wrote:
           | Prompts are natural language, but you're using them with the
           | model in a way similar to getting a split-second gut feel
           | reaction from a human - that reaction may very well vary
           | between people.
        
           | electroly wrote:
           | The performance of the model can be improved with tweaks to
           | the prompt, but the tweaks end up being model-specific. This
           | is why "prompt engineering" exists for productionized use
           | cases instead of people just spitting words semi-randomly
           | into a textbox. Your old prompts probably won't completely
           | fail but they'll behave differently under a different model.
        
           | cosmojg wrote:
           | RLHF and fine-tuning! While these methods make prompting more
           | accessible and approachable to people unfamiliar with LLMs
           | and otherwise expecting an omniscient chatbot, they make the
           | underlying dynamics a lot more unstable. Personally, I prefer
           | the untuned base models. In fact, I depend upon a set of
           | high-quality prompts (none of which are questions or
           | instructions) which perform similarly across _different_ base
           | models of _different_ sizes (e.g., GPT-2-1.5B, code-
           | davinci-002, LLaMA-65B, etc.) but frequently break between
           | different instruction-tuned models and different versions of
           | the _same_ instruction-tuned model (I think Google 's
           | Flan-T5-XXL has been the only standout exception in my tests,
           | consistently outperforming its corresponding base model, and
           | although it's not saying much, I admit that GPT-4 does do a
           | lot better than GPT-3.5-turbo in remaining consistent across
           | updates).
        
         | iLoveOncall wrote:
         | We solve this by not building a company whose sole value
         | proposition is a thousand characters that you feed in an AI
         | that you don't control.
        
           | IanCal wrote:
           | You can use it for processes or features without having your
           | whole company be based on it.
        
       | lachlan_gray wrote:
       | LMQL helps a lot with this kind of thing. It makes it really easy
       | to swap prompts and models out, and in general it allows you to
       | maintain your prompt workflows in whatever way you maintain the
       | rest of your python code.
       | 
       | I'm expecting there will be more examples soon, but you can check
       | out my tree of thoughts implementation below to see what I mean
       | 
       | https://github.com/LachlanGray/lmql-tree-of-thoughts
        
       | netruk44 wrote:
       | > If you expect the models you use to change at all, it's
       | important to unit-test all your prompts using evaluation
       | examples.
       | 
       | It's mentioned earlier in the article, but I'd like to emphasize
       | that if you go down this route that you should either do
       | _multiple_ evaluations per prompt and come up with some kind of
       | averaged result, or set the temperature to 0.
       | 
       | FTA:
       | 
       | > LLMs are stochastic - there's no guarantee that an LLM will
       | give you the same output for the same input every time.
       | 
       | > You can force an LLM to give the same response by setting
       | temperature = 0, which is, in general, a good practice.
        
         | bequanna wrote:
         | Serious question: how do you unit test variable text output
         | from an LLM model?
        
         | jerpint wrote:
         | Temperature = 0 will give deterministic results, but might not
         | be as "creative". Also it's not enough to guarantee determinism
         | , hardware executing the LLM can lead to different results as
         | well
        
           | netruk44 wrote:
           | In terms of being part of a test suite, I think determinism >
           | creativity in the response. But I would agree there's
           | probably rough edges there, it's possible that some prompts
           | never perform well with temperature set to 0.
        
         | morelisp wrote:
         | Even setting temp to 0 retains some nondeterminism.
        
       ___________________________________________________________________
       (page generated 2023-06-09 23:01 UTC)