[HN Gopher] Show HN: Faster LLM evaluation with Bayesian optimiz...
___________________________________________________________________
Show HN: Faster LLM evaluation with Bayesian optimization
Recently I've been working on making LLM evaluations fast by using
bayesian optimization to select a sensible subset. Bayesian
optimization is used because it's good for exploration /
exploitation of expensive black box (paraphrase, LLM). I would love
to hear your thoughts and suggestions on this!
Author : renchuw
Score : 94 points
Date : 2024-02-13 15:21 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| renchuw wrote:
| Side note:
|
| OP here, I came up with this cool idea because I was chatting
| with a friend about how to make LLM evaluations fast (which is so
| painfully slow on large datasets) and realized that somehow no
| one has tried it. So I decided to give it a go!
| enonimal wrote:
| This is a cool idea -- is this an inner-loop process (i.e. after
| each LLM evaluation, the output is considered to choose the next
| sample) or a pre-loop process (get a subset of samples before
| tests are run)?
| renchuw wrote:
| This would be an inner loop process. However, the selection is
| _way_ faster than LLMs so it shouldn 't be noticable
| (hopefully).
| ReD_CoDE wrote:
| It seems that you're the only one who understood the idea. I
| don't know current LLMs use such a method or not, but the idea
| could be 10 times faster
| enonimal wrote:
| AFAICT, this is a more advanced way of using Embeddings
| (which can encode for the _vibes similarity_ (not an official
| term) of prompts) to determine where you get the most "bang
| for your buck" in terms of testing.
|
| For instance, if there are three conversations that you can
| use to test if your AI is working correctly:
|
| (1) HUMAN: "Please say hello" AI: "Hello!"
|
| (2) HUMAN: "Please say goodbye" AI:
| "Goodbye!"
|
| (3) HUMAN: "What is 2 + 2?" AI: "4!"
|
| Let's say you can only pick two conversations to evaluate how
| good your AI is. Would you pick 1 & 2? Probably not. You'd
| pick 1 & 3, or 2 & 3.
|
| Because Embeddings allow us to determine how _similar in
| vibes_ things are, we have a tool with which we can
| automatically search over our dataset for things that have
| _very different vibes_ , meaning that each evaluation run is
| more likely to return _new information_ about how well the
| model is doing.
|
| My question to the OP was mostly about whether or not this
| "vibe differentiated dataset" was constructed prior to the
| evaluation run, or populated gradually, based on each
| individual test case result.
|
| so anyway it's just vibes man
| ShamelessC wrote:
| biodigital jazz man
| abhgh wrote:
| That's probably the intent, but I don't know if this
| actually achieves this (I have another comment that's about
| the use of bayesopt here). But even if it did, bayesopt
| operates sequentially (it's a Sequential Model-based
| Optimizer or SMBO) and so the trajectory of queries
| different LLMs evaluate would be different. Unless there is
| something to correct this cascading bias I don't know if
| you could use this to compare LLMs. Or obtain a score
| that's comparable to standard reported numbers.
|
| On a different note, if all we want is a diverse set of
| representative samples (based on embeddings), there are
| algorithms like DivRank that do that quite well.
| azinman2 wrote:
| What I don't get from the webpage is what are you evaluating,
| exactly?
| observationist wrote:
| This, exactly - what is meant by evaluate in this context? Is
| this more efficient inference using approximation, so you can
| create novel generations, or is it some test of model
| attributes?
|
| What the OP is doing here is completely opaque to the rest of
| us.
| swalsh wrote:
| This is becoming so common in AI discussions. Everyone with a
| real use case is opaque, or just flat out doesn't talk. The
| ones who are talking have toy use cases. I think its because
| it's so hard to build a moat, and techniques are one of the
| ways to build one.
| renchuw wrote:
| Hi, OP here. I would kind of have to disagree here. You
| raised some interesting points, but I don't think something
| can be qualified as *moat* if it is overcome-able by just
| sharing the use cases. For example, we all know Google's
| use cases is to search, but no one has built one as well as
| they do. Their moat is in their technology and brand
| recognision.
| littlestymaar wrote:
| Not to disagree with your argument as a whole, but
| Google's most hasn't been technological for years, but
| instead comes from their ability to be the default search
| engine everywhere they can, including if they need to pay
| Apple billions for that position.
| renchuw wrote:
| Fair question.
|
| Evaluate refers to the phase after training to check if the
| training is good.
|
| Usually the flow goes training -> evaluation -> deployment
| (what you called inference). This project is aimed for
| evaluation. Evaluation can be slow (might even be slower than
| training if you're finetuning on a small domain specific
| subset)!
|
| So there are
| [quite](https://github.com/microsoft/promptbench)
| [a](https://github.com/confident-ai/deepeval)
| [few](https://github.com/openai/evals)
| [frameworks](https://github.com/EleutherAI/lm-evaluation-
| harness) working on evaluation, however, all of them are
| quite slow, because LLM are slow if you don't have infinite
| money. [This](https://github.com/open-compass/opencompass)
| one tries to speed up by parallelizing on multiple computers,
| but none of them takes advantage of the fact that many
| evaluation queries might be similar and all try to evaluate
| on all given queries. And that's where this project might
| come in handy.
| observationist wrote:
| Your explanations are still unclear.
|
| I know what evaluation is, and inference, and training.
| Deployment means to deploy - to put a model in production.
| It does not mean inference. Inference means to input a
| prompt into a model and get the next token, or tokens as
| the case may be. Training and inference are closely
| related, since during training, inference is run and the
| error given by the difference between the prediction and
| target is backpropagated, etc.
|
| Evaluation is running inference over a suite of tests and
| comparing the outcomes to some target ideal. An evaluation
| on the MMLU dataset lets you run inference on zero and few
| shot prompts to test the knowledge and function acquisition
| of your model, for example.
|
| So is your code using Bayesian Optimization to select a
| subset of a corpus, like a small chunk of the MMLU dataset,
| that is representative of the whole, so you can test on
| that subset instead of the whole thing?
| renchuw wrote:
| Hi, OP here, sorry for late reply. I am not actually
| "evaluating", but rather using the "side effects" of bayesian
| optimization that allows zoning in/out on some regions on the
| latent space. Since embedders are so fast compared to LLM, it
| saves time by saving LLMs from evaluating on similar queries.
| Hope that makes sense!
| azinman2 wrote:
| But aren't you really just evaluating the embeddings /
| quality of the latent space then?
| PheonixPharts wrote:
| "Evaluation" has a pretty standard meaning in the LLM community
| the same way that "unit test" does in software. Evaluations are
| suites of challenges presented to an LLM to _evaluate_ how well
| it does as a form of bench-marking.
|
| Nobody would chime in on an article on "faster unit testing in
| software with..." and complain that it's not clear because "is
| it a history unit? a science unit? what kind of tests are those
| students taking!?", so I find it odd that on HN people often
| complain about something similar for a very popular niche in
| this community.
|
| If you're interested in LLMs, the term "evaluation" should be
| very familiar, and if you're _not_ interested in LLMs then this
| post likely isn 't for you.
| waldrews wrote:
| Unit testing isn't an overloaded term. Evaluation by itself
| is overloaded, though "LLM evaluation" disambiguates it. I
| first parsed the title as 'faster inference' rather than
| 'faster evaluation' even being aware of what LLM evaluation
| is, because that's a probable path given 'show' 'faster' and
| 'LLM' in the context window.
|
| That misreading could also suggest some interesting research
| directions. Bayesian optimization to choose some parameters
| which guide which subset of the neurons to include in the
| inference calculation? Why not.
| azinman2 wrote:
| There's lots to evaluate. If you're evaluating model quality,
| there are many benchmarks all trying to measure different
| things... accuracy in translation, common sense reasoning,
| how well it stays on topic, can you regurgitate a reference
| in the prompt text, how biased is the output along a societal
| dimension, other safety measures, etc. I'm in the field but
| not an LLM researcher per se, so perhaps this is more
| meaningful to others, but given the post it seems useful to
| answer my question which was what _exactly_ is being
| evaluated?
|
| In particular this is only working off the encoded sentences
| so it seems to me that things that involve attention etc
| aren't being evaluated here.
| skyde wrote:
| what do they mean by "evaluating the model on corpus." and
| "Evalutes the corpus on the model".
|
| I know what a LLM is and I know very well what is Bayesian
| Optimization. But I don't understand what this library is trying
| to do.
|
| I am guessing it's tryng to test the model's ability to generate
| correct and relevant responses to a given input.
|
| But who is the judge ?
| causal wrote:
| Same. "Evaluate" and "corpus" need to be defined. I don't think
| OP intended this to be clickbait but without clarification it
| sounds like they're claiming 10x faster inference, which I'm
| pretty sure it's not.
| renchuw wrote:
| Hi, OP here. It's not 10 times faster inference, but faster
| evaluation. You use evaluation on a dataset to check if your
| model is performing well. This takes a lot of time (might be
| more than training if you are just finetuning a pre-trained
| model on a small dataset)!
|
| So the pipeline goes training -> evaluation -> deployment
| (inference).
|
| Hope that explanation helps!
| ragona wrote:
| The "eval" phase is done after a model is trained to assess its
| performance on whatever tasks you wanted it to do. I think this
| is basically saying, "don't evaluate on the entire corpus, find
| a smart subset."
| deckar01 wrote:
| Evaluate is referring to measuring the accuracy of a model on a
| standard dataset for the purpose of comparing model
| performance. AKA benchmark.
|
| https://rentruewang.github.io/bocoel/research/
| skyde wrote:
| Right I guess I am not familiar how automated Benchmarks for
| LLM work. I assumed to decide if an LLM answer was good
| required Human Evaluation.
| MacsHeadroom wrote:
| Multiple choice tests, LM Eval (e.g. have GPT-4 rate an
| answer, or use M-of-N GPT-4 ratings as pass/fail),
| perplexity (i.e. how accurately can it reproduce a corpus
| that it was trained on).
|
| Lots of ways to evaluate without humans. Most (nearly all)
| LLM benchmarks are fully automated, without any humans
| involved.
| renchuw wrote:
| Hi, OP here. So you evaluate LLMs on corpuses to evaluate their
| performance right? Bayesian optimization is here to select
| points (in the latent space) and tell the LLM where to evaluate
| next. To be precise, entropy search is used here (coupled with
| some latent space reduction techniques like N-sphere
| representation and embedding whitening). Hope that makes sense!
| hackerlight wrote:
| The definition of "evaluate" isn't clear. Do you mean
| inference?
| renchuw wrote:
| Perhaps I should clarify it in the project README. It's the
| phase to evaluate how well your model is performing. So the
| pipeline goes training -> evaluation -> deployment
| (inference) corresponding to the datasets in supervised
| training, training (training) -> evaluation (validation) ->
| deployment (testing).
| anentropic wrote:
| is this an alternative way of doing RAG ?
| renchuw wrote:
| Hi, OP here. I would say not really because the goals are
| different. Although both uses retrieval techniques, RAG wants
| to augment your query with factual information, where here we
| retrieve in order to evaluate on as few queries as possible
| (with performance guaranteed by bayesian optimization)
| abhgh wrote:
| What's the BayesOpt maximizing? As in it identifies a subset
| based on what criteria?
| renchuw wrote:
| I designed 2 modes in the project, _exploration_ mode and
| _exploitation_ mode.
|
| Exploration mode uses entropy search to explore the latent
| space (used for evaluating the LLM on the selected corpus to
| evaluate), and eploitation mode is used to figure out how well
| / bad the model is performing on what regions of the selected
| corpus.
|
| For accurate evaluations, exploration is used. However, I'm
| also working on a visualization too s.t. users can see how well
| the model is performing at what region (courtesy of gaussian
| process models built in by bayesian optimization) and that is
| where exploitation mode can come in handy.
|
| Sorry for the slightly messy explanation. Hope it clarifies
| things!
| abhgh wrote:
| Thanks for the explanation!
|
| I don't entirely understand what two models mean here,
| because typically the search strategy (or acquisition
| function) in bayesopt - which in your case seems to be some
| form of entropy search (ES) - decides the explore-vs-exploit
| tradeoff for itself (possibly with some additional
| hyperparams ofc). For ex., ES would do this one way, Expected
| Improvement (EI) would do it differently, etc. - all this in
| the service of the bayesopt objective you want to maximize
| (or minimize).
|
| Assuming that you mean this objective when you mention
| exploitation, which here is based on the model performing
| well, wouldn't it just pick queries that the model can (or is
| likely to) answer correctly? This would be a very optimistic
| evaluation of the LLM.
| eximius wrote:
| This is "evaluating" LLMs in the sense of benchmarking how good
| they are, not improving LLM inference in speed or quality, yes?
| renchuw wrote:
| Correct.
| pama wrote:
| Does this method build assumptions about the distribution of the
| evaluation dataset and make the bit-level reproduction of an
| evaluation unlikely?
| renchuw wrote:
| Well, this method is based on the assumption that embeddings
| can accurately represent the texts and their structural
| relations are preserved.
|
| So long as you have all the random seeds fixed, I think
| reproduction should be straight forward.
| endernac wrote:
| I looked through the github.io documentation and skimmed through
| the code and research article draft. Correct me if I am wrong.
| What I think you are doing (at a high level) is you are you
| create a corpus of QA tasks, embeddings, and similarity metrics.
| Then you are somehow using NLP scoring and Bayesian Optimization
| to find a subset of the corpus that best matches a particular
| evaluation task. Then you can jut evaluate the LLM on this subset
| rather than the entire corpus, which is much faster.
|
| I agree with the other comments. You need to do a much better job
| of motivating and contextualizing the research problem, as well
| as explaining your method in specific precise language in the
| README and other documentation. (Preferably in the README) You
| should make it clear that you are using GLUE and and Big-Bench
| for the evaluation (as well as any other evaluation benchmarks
| that you are using). You should also be explicit which LLM models
| and embedding you have tested and what datasets you used to train
| and evaluate on. You should also must add graphs and tables
| showing your method's speed and evaluation performance compared
| to the SOTA. I like the reference/overview section that shows the
| diagram (I think you should put it in the README to make it more
| visible to first time viewers). However, the description of the
| classes are cryptic. For example the Score class said "Evaluate
| the target with respect to the references." I had no idea what
| that meant, and I had to just google some of the class names to
| get an idea of what score was trying to do. That's true for
| pretty much all the classes. Also, you need to explain what
| factory class are and how they differ from the models classes,
| e.g. why does the bocoel.models.adaptors class require a score
| and a corpus (from overview), but factories.adaptor require
| "GLUE", lm, and choices (looking at the code from
| examples/getting_started/__main__.py)? However, I do like the
| fact that you have an example (although I haven't tried running
| it).
| renchuw wrote:
| Thanks for the feedback! The reason the "code" part is more
| complete than the "research" part is because I originally
| planned for it to just be a hobby project and only very later
| on decided to perhaps try to be serious and make it a research
| work.
|
| Not trying to make excuses tho. Your points are very valid and
| I would take them into account!
| marclave wrote:
| this is unreal! i was just thinking about this on a walk
| yesterday for our internal evals on our new models we are
| building.
|
| big kudos for this, so wonderfully excited to see this on HN and
| we will be using this
___________________________________________________________________
(page generated 2024-02-13 23:00 UTC)