[HN Gopher] Who is working on forward and backward compatibility...
___________________________________________________________________
Who is working on forward and backward compatibility for LLMs?
Author : nijfranck
Score : 74 points
Date : 2023-06-09 15:02 UTC (7 hours ago)
(HTM) web link (huyenchip.com)
(TXT) w3m dump (huyenchip.com)
| brucethemoose2 wrote:
| I dunno...
|
| This sounds like making diffusion backwards compatible with
| ESRGAN. _Technically_ they are both upscaling denoisers (with
| finetunes for specific tasks), and you can set up objective tests
| compatible with both, but actual way they are used is so
| different that its not even a good performance measurement.
|
| The same thing applies to recent LLMs, and the structural changes
| are only going to get more drastic and fundamental. For instance,
| what about LLMs with seperate instruction and data context? Or
| multimodal LLMs with multiple inputs/outputs? Or LLMs that
| finetune themselves during inference? That is just scratching the
| surface.
| TeMPOraL wrote:
| > _what about LLMs with seperate instruction and data context?_
|
| Do such architectures exist? Isn't this separation _impossible_
| , for fundamental reasons?
| beepbooptheory wrote:
| Like 4 months ago people were saying the Singularity has pretty
| much already happened and everything is going to change/the world
| is over, but here we are now dealing with hard and very boring
| problems around versioning/hardening already somewhat counter-
| intuitive and highly-engineered prompts in order to hopefully eek
| out a single piece of consistent functionality, maybe.
| aldousd666 wrote:
| Meta is getting it done for free by releasing their models open
| source. Now everyone is building things that work with their
| models.
| HarHarVeryFunny wrote:
| OpenAI have some degree of versioning with the models used by
| their APIs, but it seems they are perhaps still updating (fine
| tuning) models without changing the model name/version. For
| ChatGPT itself (not the APIs) many people have reported recent
| regressions in capability, so it seems the model is being changed
| there too.
|
| As people start to use these API's in production, there needs to
| be stricter version control, especially given how complex
| (impossible unless you are only using a fixed set of prompts) it
| is for anyone to test for backwards compatibility. Maybe
| something like Ubuntu's stable long-term releases vs bleeding
| edge ones would work. Have some models that are guaranteed not to
| change for a specified amount of time, and others that will be
| periodically updated for people who want cutting edge behavior
| and care less about backwards compatibility.
| d_watt wrote:
| Regarding your comment on updating models, are you saying you
| think they're updating the "pinned" models. EGT gpt4-0314?
|
| Otherwise, I think effectively they already have what you're
| describing with LTS models as pinned versions, and the
| unversioned model is effectively the "nightly."
|
| From what I've see in the materials, it seems like if you pay
| for dedicated compute, you can also have some control over your
| model versions.
| HarHarVeryFunny wrote:
| I don't know. There have been a lot of recent complaints
| about changed behavior, but not sure which model versions
| people are talking about.
| cj wrote:
| I agree with this.
|
| Although there are APIs used in production whose output
| changes/evolves over time.
|
| First one that comes to mind is Google Translate. We spend
| 6-figures with the Google Translate API annually, and recently
| we went back and checked if Google Translate is
| improving/changing the translations over time, and found that
| they indeed are (presumably as they improve their internal
| models, which aren't versioned and isn't exposed in a changelog
| anywhere). The majority of translations were different for the
| same content today compared to 6 months ago.
|
| I don't particularly agree with this approach. Speaking as a
| power user of Google Translate API, it would be nice to be able
| to pin to specific version/model and then manually upgrade
| versions (with a changelog to understand what's changing under
| the hood).
| ladberg wrote:
| At this point the changelog is surely just stuff along the
| lines of "retrained with more data" and "slightly tweaked
| model architecture in a random way that improved
| performance".
|
| And out of curiosity: as someone with a lot of expertise and
| money on the line, how would you compare Google Translate
| with LLMs? And also smaller self-hosted models with bigger
| ones that require API access like OpenAI? Do they perform
| better or worse and are they cheaper or more expensive?
| [deleted]
| nico wrote:
| My guess is that they are doing something like:
|
| * use some version of GPT, to preprocess the prompts
|
| * send preprocessed prompt to "real" model for inference
|
| * post process result to filter out undesired output
|
| Then they can honestly say they haven't changed the version of
| the model, without telling you that they have probably changed
| a lot of their pipeline to process your prompts and deliver a
| result for you
| jstarfish wrote:
| In the last few years I've noticed lying by omission has
| become the new fun corporate/gen-z internet trend (see also:
| overfunding/gofundme fraud). Like priests and fund managers,
| their product is a black box, and there's a lot of mischief
| you can get into when you're entrusted with one of those.
| They play a fucked-up game of Rumpelstiltskin, where they
| mislead by default and only admit the truth if you can guess
| the right question to ask.
|
| You're on the right track, and I too think that's what their
| actual pipeline looks like, but you're missing a step. I
| think there's another step where they effectively alter the
| output of production models by hot-swapping different LORAs
| (or whatever) to them.
|
| This lets them plausibly claim they haven't changed the
| version of the model, because they _haven 't_ messed with the
| model. They messed with middleware, which nobody knows enough
| about to press them on. You ask them if anything changed with
| the model/API, they say no, and leave you to think you're
| going crazy because shit's just not working like it was last
| week.
|
| Nobody's asking them about changes to middleware though,
| which genuinely surprises me. I am never the smartest person
| in the room-- only the most skeptical.
| deeviant wrote:
| FYI, OpenAI has denied that there have been any unannounced
| changes to the models.
|
| And it seems more than possible that people see what they want
| to see in LLM output. So I would be careful to make completely
| unsupported claims.
| ineedasername wrote:
| Sure but they could be playing with semantics a bit as well.
| When they say "models" they might just mean the LLM that it's
| all based on. But there's a long more going on in the
| pipeline to turn that into a consumer facing service like
| ChatGPT. They might have changed any combination of the
| following:
|
| 1) Fine Tuning
|
| 2) Embedding
|
| 3) The initializing prompt
|
| 4) Filtering a prompt prior to ingestion & tokenization of
| the prompt
|
| 5) Filtering the output from the application after it has
| generated a response.
|
| The statement "we have no unannounced changes to the models"
| can be _true_ while still substantially changing
| functionality & response quality through any of the 5 above
| areas, and probably some I missed.
| qumpis wrote:
| I'd be so pleasantly surprised if the same model has remained
| the same. Everytime I see speedups in its generation speed I
| assume they've distilled the model further. The outputs
| subjectively also feel weaker. Surely someone has compared
| the outputs from the beginning and now?
| ITB wrote:
| I suggest this is the wrong way to think about this. Alexa tried
| for a very long time to agree on a "Alexa Ontology" and it just
| doesn't work for large enough surface areas. Testing that new
| versions of LLMs work is better than trying to make everything
| backward compatible. Also, the "structured" component of the
| response (e.g.: send your answer in JSON format), should be
| something not super brittle. In fact if the structure takes a lot
| of prompting to work, you are probably setting yourself up.
| nijfranck wrote:
| When a newer LLM model comes (e.g GPT3.5 to GPT4), your old
| prompts become obsolete. How are you solving this problem in your
| company? Are there companies working on solving this problem?
| [deleted]
| devit wrote:
| What? Prompts are natural language, so I don't see how they can
| possibly become obsolete.
| ineedasername wrote:
| I have to prompt engineer a lot more with 3.5 than 4. The way
| I asked questions and convey what I want tends to be much
| more structured with 3.5, in a less natural way than I can do
| with 4. Hopefully 4 would be even better at answering a
| structured prompt like that, but also maybe not: For quick
| questions with a short answer 3.5 will sometimes give a
| simpler answer than 4, but 3.5 is correct. 4 isn't
| necessarily wrong, but it sort of reads into the question a
| bit more, the the answer is less succinct, more caveats and
| nuances explained, etc. In examples like this even though
| both give a correct answer, the one from 4 may be
| undesirable. You don't want to have to read through an extra
| paraph to pick out the answer to your question. There's more
| frictions.
|
| Of course the above scenario is easily solved: Change your
| prompt to include "Be Brief", but that's exactly the
| argument-- the old prompt is at least in part obsolete and
| much changes to achieve functional equivalency in 4. And the
| you need to check for unanticipated changes to the answer
| that "be brief" would cause: maybe it would now be too brief!
| Maybe not, but you have to have some method of checking.
| tyree731 wrote:
| They can. Results change as you change your models, and
| results aren't always strictly better or worse, which is why
| testing gold-standard results with any prompt and model
| changes is so important for applications utilizing LLMs.
| SkyPuncher wrote:
| They're loosely natural language.
|
| There's a lot of tweaking and non-natural language that goes
| into them to get the exact results you expect.
| TeMPOraL wrote:
| Prompts are natural language, but you're using them with the
| model in a way similar to getting a split-second gut feel
| reaction from a human - that reaction may very well vary
| between people.
| electroly wrote:
| The performance of the model can be improved with tweaks to
| the prompt, but the tweaks end up being model-specific. This
| is why "prompt engineering" exists for productionized use
| cases instead of people just spitting words semi-randomly
| into a textbox. Your old prompts probably won't completely
| fail but they'll behave differently under a different model.
| cosmojg wrote:
| RLHF and fine-tuning! While these methods make prompting more
| accessible and approachable to people unfamiliar with LLMs
| and otherwise expecting an omniscient chatbot, they make the
| underlying dynamics a lot more unstable. Personally, I prefer
| the untuned base models. In fact, I depend upon a set of
| high-quality prompts (none of which are questions or
| instructions) which perform similarly across _different_ base
| models of _different_ sizes (e.g., GPT-2-1.5B, code-
| davinci-002, LLaMA-65B, etc.) but frequently break between
| different instruction-tuned models and different versions of
| the _same_ instruction-tuned model (I think Google 's
| Flan-T5-XXL has been the only standout exception in my tests,
| consistently outperforming its corresponding base model, and
| although it's not saying much, I admit that GPT-4 does do a
| lot better than GPT-3.5-turbo in remaining consistent across
| updates).
| iLoveOncall wrote:
| We solve this by not building a company whose sole value
| proposition is a thousand characters that you feed in an AI
| that you don't control.
| IanCal wrote:
| You can use it for processes or features without having your
| whole company be based on it.
| lachlan_gray wrote:
| LMQL helps a lot with this kind of thing. It makes it really easy
| to swap prompts and models out, and in general it allows you to
| maintain your prompt workflows in whatever way you maintain the
| rest of your python code.
|
| I'm expecting there will be more examples soon, but you can check
| out my tree of thoughts implementation below to see what I mean
|
| https://github.com/LachlanGray/lmql-tree-of-thoughts
| netruk44 wrote:
| > If you expect the models you use to change at all, it's
| important to unit-test all your prompts using evaluation
| examples.
|
| It's mentioned earlier in the article, but I'd like to emphasize
| that if you go down this route that you should either do
| _multiple_ evaluations per prompt and come up with some kind of
| averaged result, or set the temperature to 0.
|
| FTA:
|
| > LLMs are stochastic - there's no guarantee that an LLM will
| give you the same output for the same input every time.
|
| > You can force an LLM to give the same response by setting
| temperature = 0, which is, in general, a good practice.
| bequanna wrote:
| Serious question: how do you unit test variable text output
| from an LLM model?
| jerpint wrote:
| Temperature = 0 will give deterministic results, but might not
| be as "creative". Also it's not enough to guarantee determinism
| , hardware executing the LLM can lead to different results as
| well
| netruk44 wrote:
| In terms of being part of a test suite, I think determinism >
| creativity in the response. But I would agree there's
| probably rough edges there, it's possible that some prompts
| never perform well with temperature set to 0.
| morelisp wrote:
| Even setting temp to 0 retains some nondeterminism.
___________________________________________________________________
(page generated 2023-06-09 23:01 UTC)