[HN Gopher] Show HN: Faster LLM evaluation with Bayesian optimiz...
       ___________________________________________________________________
        
       Show HN: Faster LLM evaluation with Bayesian optimization
        
       Recently I've been working on making LLM evaluations fast by using
       bayesian optimization to select a sensible subset. Bayesian
       optimization is used because it's good for exploration /
       exploitation of expensive black box (paraphrase, LLM). I would love
       to hear your thoughts and suggestions on this!
        
       Author : renchuw
       Score  : 94 points
       Date   : 2024-02-13 15:21 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | renchuw wrote:
       | Side note:
       | 
       | OP here, I came up with this cool idea because I was chatting
       | with a friend about how to make LLM evaluations fast (which is so
       | painfully slow on large datasets) and realized that somehow no
       | one has tried it. So I decided to give it a go!
        
       | enonimal wrote:
       | This is a cool idea -- is this an inner-loop process (i.e. after
       | each LLM evaluation, the output is considered to choose the next
       | sample) or a pre-loop process (get a subset of samples before
       | tests are run)?
        
         | renchuw wrote:
         | This would be an inner loop process. However, the selection is
         | _way_ faster than LLMs so it shouldn 't be noticable
         | (hopefully).
        
         | ReD_CoDE wrote:
         | It seems that you're the only one who understood the idea. I
         | don't know current LLMs use such a method or not, but the idea
         | could be 10 times faster
        
           | enonimal wrote:
           | AFAICT, this is a more advanced way of using Embeddings
           | (which can encode for the _vibes similarity_ (not an official
           | term) of prompts) to determine where you get the most  "bang
           | for your buck" in terms of testing.
           | 
           | For instance, if there are three conversations that you can
           | use to test if your AI is working correctly:
           | 
           | (1) HUMAN: "Please say hello"                   AI: "Hello!"
           | 
           | (2) HUMAN: "Please say goodbye"                   AI:
           | "Goodbye!"
           | 
           | (3) HUMAN: "What is 2 + 2?"                   AI: "4!"
           | 
           | Let's say you can only pick two conversations to evaluate how
           | good your AI is. Would you pick 1 & 2? Probably not. You'd
           | pick 1 & 3, or 2 & 3.
           | 
           | Because Embeddings allow us to determine how _similar in
           | vibes_ things are, we have a tool with which we can
           | automatically search over our dataset for things that have
           | _very different vibes_ , meaning that each evaluation run is
           | more likely to return _new information_ about how well the
           | model is doing.
           | 
           | My question to the OP was mostly about whether or not this
           | "vibe differentiated dataset" was constructed prior to the
           | evaluation run, or populated gradually, based on each
           | individual test case result.
           | 
           | so anyway it's just vibes man
        
             | ShamelessC wrote:
             | biodigital jazz man
        
             | abhgh wrote:
             | That's probably the intent, but I don't know if this
             | actually achieves this (I have another comment that's about
             | the use of bayesopt here). But even if it did, bayesopt
             | operates sequentially (it's a Sequential Model-based
             | Optimizer or SMBO) and so the trajectory of queries
             | different LLMs evaluate would be different. Unless there is
             | something to correct this cascading bias I don't know if
             | you could use this to compare LLMs. Or obtain a score
             | that's comparable to standard reported numbers.
             | 
             | On a different note, if all we want is a diverse set of
             | representative samples (based on embeddings), there are
             | algorithms like DivRank that do that quite well.
        
       | azinman2 wrote:
       | What I don't get from the webpage is what are you evaluating,
       | exactly?
        
         | observationist wrote:
         | This, exactly - what is meant by evaluate in this context? Is
         | this more efficient inference using approximation, so you can
         | create novel generations, or is it some test of model
         | attributes?
         | 
         | What the OP is doing here is completely opaque to the rest of
         | us.
        
           | swalsh wrote:
           | This is becoming so common in AI discussions. Everyone with a
           | real use case is opaque, or just flat out doesn't talk. The
           | ones who are talking have toy use cases. I think its because
           | it's so hard to build a moat, and techniques are one of the
           | ways to build one.
        
             | renchuw wrote:
             | Hi, OP here. I would kind of have to disagree here. You
             | raised some interesting points, but I don't think something
             | can be qualified as *moat* if it is overcome-able by just
             | sharing the use cases. For example, we all know Google's
             | use cases is to search, but no one has built one as well as
             | they do. Their moat is in their technology and brand
             | recognision.
        
               | littlestymaar wrote:
               | Not to disagree with your argument as a whole, but
               | Google's most hasn't been technological for years, but
               | instead comes from their ability to be the default search
               | engine everywhere they can, including if they need to pay
               | Apple billions for that position.
        
           | renchuw wrote:
           | Fair question.
           | 
           | Evaluate refers to the phase after training to check if the
           | training is good.
           | 
           | Usually the flow goes training -> evaluation -> deployment
           | (what you called inference). This project is aimed for
           | evaluation. Evaluation can be slow (might even be slower than
           | training if you're finetuning on a small domain specific
           | subset)!
           | 
           | So there are
           | [quite](https://github.com/microsoft/promptbench)
           | [a](https://github.com/confident-ai/deepeval)
           | [few](https://github.com/openai/evals)
           | [frameworks](https://github.com/EleutherAI/lm-evaluation-
           | harness) working on evaluation, however, all of them are
           | quite slow, because LLM are slow if you don't have infinite
           | money. [This](https://github.com/open-compass/opencompass)
           | one tries to speed up by parallelizing on multiple computers,
           | but none of them takes advantage of the fact that many
           | evaluation queries might be similar and all try to evaluate
           | on all given queries. And that's where this project might
           | come in handy.
        
             | observationist wrote:
             | Your explanations are still unclear.
             | 
             | I know what evaluation is, and inference, and training.
             | Deployment means to deploy - to put a model in production.
             | It does not mean inference. Inference means to input a
             | prompt into a model and get the next token, or tokens as
             | the case may be. Training and inference are closely
             | related, since during training, inference is run and the
             | error given by the difference between the prediction and
             | target is backpropagated, etc.
             | 
             | Evaluation is running inference over a suite of tests and
             | comparing the outcomes to some target ideal. An evaluation
             | on the MMLU dataset lets you run inference on zero and few
             | shot prompts to test the knowledge and function acquisition
             | of your model, for example.
             | 
             | So is your code using Bayesian Optimization to select a
             | subset of a corpus, like a small chunk of the MMLU dataset,
             | that is representative of the whole, so you can test on
             | that subset instead of the whole thing?
        
         | renchuw wrote:
         | Hi, OP here, sorry for late reply. I am not actually
         | "evaluating", but rather using the "side effects" of bayesian
         | optimization that allows zoning in/out on some regions on the
         | latent space. Since embedders are so fast compared to LLM, it
         | saves time by saving LLMs from evaluating on similar queries.
         | Hope that makes sense!
        
           | azinman2 wrote:
           | But aren't you really just evaluating the embeddings /
           | quality of the latent space then?
        
         | PheonixPharts wrote:
         | "Evaluation" has a pretty standard meaning in the LLM community
         | the same way that "unit test" does in software. Evaluations are
         | suites of challenges presented to an LLM to _evaluate_ how well
         | it does as a form of bench-marking.
         | 
         | Nobody would chime in on an article on "faster unit testing in
         | software with..." and complain that it's not clear because "is
         | it a history unit? a science unit? what kind of tests are those
         | students taking!?", so I find it odd that on HN people often
         | complain about something similar for a very popular niche in
         | this community.
         | 
         | If you're interested in LLMs, the term "evaluation" should be
         | very familiar, and if you're _not_ interested in LLMs then this
         | post likely isn 't for you.
        
           | waldrews wrote:
           | Unit testing isn't an overloaded term. Evaluation by itself
           | is overloaded, though "LLM evaluation" disambiguates it. I
           | first parsed the title as 'faster inference' rather than
           | 'faster evaluation' even being aware of what LLM evaluation
           | is, because that's a probable path given 'show' 'faster' and
           | 'LLM' in the context window.
           | 
           | That misreading could also suggest some interesting research
           | directions. Bayesian optimization to choose some parameters
           | which guide which subset of the neurons to include in the
           | inference calculation? Why not.
        
           | azinman2 wrote:
           | There's lots to evaluate. If you're evaluating model quality,
           | there are many benchmarks all trying to measure different
           | things... accuracy in translation, common sense reasoning,
           | how well it stays on topic, can you regurgitate a reference
           | in the prompt text, how biased is the output along a societal
           | dimension, other safety measures, etc. I'm in the field but
           | not an LLM researcher per se, so perhaps this is more
           | meaningful to others, but given the post it seems useful to
           | answer my question which was what _exactly_ is being
           | evaluated?
           | 
           | In particular this is only working off the encoded sentences
           | so it seems to me that things that involve attention etc
           | aren't being evaluated here.
        
       | skyde wrote:
       | what do they mean by "evaluating the model on corpus." and
       | "Evalutes the corpus on the model".
       | 
       | I know what a LLM is and I know very well what is Bayesian
       | Optimization. But I don't understand what this library is trying
       | to do.
       | 
       | I am guessing it's tryng to test the model's ability to generate
       | correct and relevant responses to a given input.
       | 
       | But who is the judge ?
        
         | causal wrote:
         | Same. "Evaluate" and "corpus" need to be defined. I don't think
         | OP intended this to be clickbait but without clarification it
         | sounds like they're claiming 10x faster inference, which I'm
         | pretty sure it's not.
        
           | renchuw wrote:
           | Hi, OP here. It's not 10 times faster inference, but faster
           | evaluation. You use evaluation on a dataset to check if your
           | model is performing well. This takes a lot of time (might be
           | more than training if you are just finetuning a pre-trained
           | model on a small dataset)!
           | 
           | So the pipeline goes training -> evaluation -> deployment
           | (inference).
           | 
           | Hope that explanation helps!
        
         | ragona wrote:
         | The "eval" phase is done after a model is trained to assess its
         | performance on whatever tasks you wanted it to do. I think this
         | is basically saying, "don't evaluate on the entire corpus, find
         | a smart subset."
        
         | deckar01 wrote:
         | Evaluate is referring to measuring the accuracy of a model on a
         | standard dataset for the purpose of comparing model
         | performance. AKA benchmark.
         | 
         | https://rentruewang.github.io/bocoel/research/
        
           | skyde wrote:
           | Right I guess I am not familiar how automated Benchmarks for
           | LLM work. I assumed to decide if an LLM answer was good
           | required Human Evaluation.
        
             | MacsHeadroom wrote:
             | Multiple choice tests, LM Eval (e.g. have GPT-4 rate an
             | answer, or use M-of-N GPT-4 ratings as pass/fail),
             | perplexity (i.e. how accurately can it reproduce a corpus
             | that it was trained on).
             | 
             | Lots of ways to evaluate without humans. Most (nearly all)
             | LLM benchmarks are fully automated, without any humans
             | involved.
        
         | renchuw wrote:
         | Hi, OP here. So you evaluate LLMs on corpuses to evaluate their
         | performance right? Bayesian optimization is here to select
         | points (in the latent space) and tell the LLM where to evaluate
         | next. To be precise, entropy search is used here (coupled with
         | some latent space reduction techniques like N-sphere
         | representation and embedding whitening). Hope that makes sense!
        
           | hackerlight wrote:
           | The definition of "evaluate" isn't clear. Do you mean
           | inference?
        
             | renchuw wrote:
             | Perhaps I should clarify it in the project README. It's the
             | phase to evaluate how well your model is performing. So the
             | pipeline goes training -> evaluation -> deployment
             | (inference) corresponding to the datasets in supervised
             | training, training (training) -> evaluation (validation) ->
             | deployment (testing).
        
       | anentropic wrote:
       | is this an alternative way of doing RAG ?
        
         | renchuw wrote:
         | Hi, OP here. I would say not really because the goals are
         | different. Although both uses retrieval techniques, RAG wants
         | to augment your query with factual information, where here we
         | retrieve in order to evaluate on as few queries as possible
         | (with performance guaranteed by bayesian optimization)
        
       | abhgh wrote:
       | What's the BayesOpt maximizing? As in it identifies a subset
       | based on what criteria?
        
         | renchuw wrote:
         | I designed 2 modes in the project, _exploration_ mode and
         | _exploitation_ mode.
         | 
         | Exploration mode uses entropy search to explore the latent
         | space (used for evaluating the LLM on the selected corpus to
         | evaluate), and eploitation mode is used to figure out how well
         | / bad the model is performing on what regions of the selected
         | corpus.
         | 
         | For accurate evaluations, exploration is used. However, I'm
         | also working on a visualization too s.t. users can see how well
         | the model is performing at what region (courtesy of gaussian
         | process models built in by bayesian optimization) and that is
         | where exploitation mode can come in handy.
         | 
         | Sorry for the slightly messy explanation. Hope it clarifies
         | things!
        
           | abhgh wrote:
           | Thanks for the explanation!
           | 
           | I don't entirely understand what two models mean here,
           | because typically the search strategy (or acquisition
           | function) in bayesopt - which in your case seems to be some
           | form of entropy search (ES) - decides the explore-vs-exploit
           | tradeoff for itself (possibly with some additional
           | hyperparams ofc). For ex., ES would do this one way, Expected
           | Improvement (EI) would do it differently, etc. - all this in
           | the service of the bayesopt objective you want to maximize
           | (or minimize).
           | 
           | Assuming that you mean this objective when you mention
           | exploitation, which here is based on the model performing
           | well, wouldn't it just pick queries that the model can (or is
           | likely to) answer correctly? This would be a very optimistic
           | evaluation of the LLM.
        
       | eximius wrote:
       | This is "evaluating" LLMs in the sense of benchmarking how good
       | they are, not improving LLM inference in speed or quality, yes?
        
         | renchuw wrote:
         | Correct.
        
       | pama wrote:
       | Does this method build assumptions about the distribution of the
       | evaluation dataset and make the bit-level reproduction of an
       | evaluation unlikely?
        
         | renchuw wrote:
         | Well, this method is based on the assumption that embeddings
         | can accurately represent the texts and their structural
         | relations are preserved.
         | 
         | So long as you have all the random seeds fixed, I think
         | reproduction should be straight forward.
        
       | endernac wrote:
       | I looked through the github.io documentation and skimmed through
       | the code and research article draft. Correct me if I am wrong.
       | What I think you are doing (at a high level) is you are you
       | create a corpus of QA tasks, embeddings, and similarity metrics.
       | Then you are somehow using NLP scoring and Bayesian Optimization
       | to find a subset of the corpus that best matches a particular
       | evaluation task. Then you can jut evaluate the LLM on this subset
       | rather than the entire corpus, which is much faster.
       | 
       | I agree with the other comments. You need to do a much better job
       | of motivating and contextualizing the research problem, as well
       | as explaining your method in specific precise language in the
       | README and other documentation. (Preferably in the README) You
       | should make it clear that you are using GLUE and and Big-Bench
       | for the evaluation (as well as any other evaluation benchmarks
       | that you are using). You should also be explicit which LLM models
       | and embedding you have tested and what datasets you used to train
       | and evaluate on. You should also must add graphs and tables
       | showing your method's speed and evaluation performance compared
       | to the SOTA. I like the reference/overview section that shows the
       | diagram (I think you should put it in the README to make it more
       | visible to first time viewers). However, the description of the
       | classes are cryptic. For example the Score class said "Evaluate
       | the target with respect to the references." I had no idea what
       | that meant, and I had to just google some of the class names to
       | get an idea of what score was trying to do. That's true for
       | pretty much all the classes. Also, you need to explain what
       | factory class are and how they differ from the models classes,
       | e.g. why does the bocoel.models.adaptors class require a score
       | and a corpus (from overview), but factories.adaptor require
       | "GLUE", lm, and choices (looking at the code from
       | examples/getting_started/__main__.py)? However, I do like the
       | fact that you have an example (although I haven't tried running
       | it).
        
         | renchuw wrote:
         | Thanks for the feedback! The reason the "code" part is more
         | complete than the "research" part is because I originally
         | planned for it to just be a hobby project and only very later
         | on decided to perhaps try to be serious and make it a research
         | work.
         | 
         | Not trying to make excuses tho. Your points are very valid and
         | I would take them into account!
        
       | marclave wrote:
       | this is unreal! i was just thinking about this on a walk
       | yesterday for our internal evals on our new models we are
       | building.
       | 
       | big kudos for this, so wonderfully excited to see this on HN and
       | we will be using this
        
       ___________________________________________________________________
       (page generated 2024-02-13 23:00 UTC)