[HN Gopher] Blending Is All You Need: Cheaper, Better Alternativ...
       ___________________________________________________________________
        
       Blending Is All You Need: Cheaper, Better Alternative to Trillion-
       Parameters LLM
        
       Author : naturalauction
       Score  : 71 points
       Date   : 2024-01-11 13:00 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Animats wrote:
       | _" Responses are selected randomly from a group of base chat AIs.
       | ... The response generated by a specific chat AI is conditional
       | on all previous responses generated by the previously selected
       | chat AIs."_
       | 
       | That's all? That works? Useful.
       | 
       | Could that be extended? It doesn't seem inherent in this that all
       | the chat AIs have to be LLMs. Some might be special-purpose
       | systems. Solvers or knowledge bases, such as Wolfram Alpha or a
       | database front end, could play too. Systems at the Alexa/Siri
       | level that can do simple tasks. Domain-specific systems with
       | natural language in and out have been around for decades.
        
         | bhickey wrote:
         | Why they aren't computing the next token marginal and sampling
         | that? All I'm coming up with is that it's a reasonable way to
         | work around dealing with different tokenizers.
        
       | block_dagger wrote:
       | Reminds me of Numenta's Thousand Brains Theory of Intelligence.
        
         | debatem1 wrote:
         | When this is all settled I suspect we're going to find
         | ourselves the drivers of a chariot, with each horse being an
         | external and artificial mind given direction by our evolved
         | needs.
        
           | LASR wrote:
           | I really don't think it's realisitic that we will maintain
           | intellectual superiority in the long-term.
           | 
           | So a more realistic hope would be: we're the horses, and the
           | drivers are driving us to our carrots.
        
             | mattnewton wrote:
             | Will to drive the chariot is different from intelligence.
             | In the same way it's different from strength and horses
             | didn't domesticate humans.
             | 
             | Of course if we make systems with a goal of dominating and
             | are smarter than us and let it run for a while we could be
             | in trouble, in the same way that we could be in trouble
             | detonating a bunch of atom bombs, just maybe less
             | obviously.
        
             | randomdata wrote:
             | An intellectually superior machine would simply turn itself
             | off. What logical reason would there be for it to keep
             | going?
        
       | teddyh wrote:
       | Three small LLMs in a trenchcoat.
        
       | babelfish wrote:
       | How is this different than mixture of experts?
       | 
       | Edit: ChatGPT provided the following, which makes sense.
       | 
       | Objective: The Blending approach aims to combine responses from
       | multiple smaller chat AIs to create a single, more engaging and
       | diverse chat AI. This is in contrast to MoE, which typically
       | involves partitioning the input space and assigning different
       | experts to different partitions, with the goal of specializing
       | each expert in a certain area.
       | 
       | Methodology: In the Blending approach, responses are selected
       | randomly from a group of base chat AIs, and the resulting
       | combined chat AI is found to be highly capable and engaging. This
       | method does not require all component systems to generate outputs
       | but instead stochastically selects the system that generates the
       | next response, allowing for model blending at the level of a
       | multi-turn conversation. MoE, on the other hand, usually involves
       | weighting the outputs of different experts based on their
       | relevance to the current input and then combining these weighted
       | outputs.
       | 
       | Performance: The paper reports that a Blended ensemble with three
       | 6-13B parameter LLMs can outcompete OpenAI's 175B+ parameter
       | ChatGPT in terms of user retention and engagement, without the
       | need for large-scale infrastructure. MoE models also aim to
       | improve performance, but they do so by dividing the workload
       | among different experts, each of which is specialized in a
       | certain area, rather than by blending the outputs of different
       | models.
       | 
       | Resource Efficiency: One of the key benefits of the Blending
       | approach is that it requires only a fraction of the inference
       | cost and memory overhead compared to large-scale LLMs like
       | ChatGPT. This is because responses for Blended are each sampled
       | from a single component chat AI. In contrast, MoE models can be
       | resource-intensive, as they involve maintaining multiple expert
       | models and a gating mechanism to determine which expert to use
       | for each input.
        
         | sva_ wrote:
         | Going to ignore the chatgpt spam
         | 
         | > How is this different than mixture of experts?
         | 
         | It appears like this combines already existing models, rather
         | than training n experts from scratch, which seems like an
         | interesting approach.
        
       | jeffrallen wrote:
       | "All you need" is all you need, apparently, to get an AI paper in
       | HN.
        
       | rfw300 wrote:
       | The paper refers to ChatGPT as a 175B parameter LLM. This is
       | almost certainly incorrect; the original largest version of GPT-3
       | was 175B but analysis of the speed and cost of the current model
       | as well as public statements by OpenAI indicate it's as much as
       | 5-10x smaller.
        
         | Klaus23 wrote:
         | I think it was leaked that it is 20B now.
        
           | miven wrote:
           | It was mentioned to be a 20B in a comparison table in a paper
           | co-written by Microsoft, but they've since claimed that it's
           | just an error, and I mean, they'd need to be sitting on some
           | really impressive distilling techniques to shrink a 175B
           | model down to 20B with only a slight drop in performance.
        
       | abeppu wrote:
       | Ok, this seems bunk basically because they never really provide
       | evidence of "better".
       | 
       | > ... traditiontal gold-standard approaches use human evaluators
       | that score the quality of generated responses, which can be
       | costly. However, since chat AIs are by definition deployed in
       | social environments with humans, one can leverage statistics of
       | users interaction as a meaningful and aligned measure of chat AI
       | engagingness and quality. To assess the 'quality' of a chat AI,
       | we consider two main proxy functions: the industry standard user
       | retention and the main objective function, user engagement.
       | 
       | Maybe retention and engagement _are_ sufficiently well correlated
       | to human evaluations, but you should probably do both and show
       | that they're strongly correlated before you decide to just drop
       | the human evaluators in favor of your cheap proxy measurements.
       | 
       | And in this field, where there are some known issues with chat
       | LLMs, perhaps it's important to check stuff like:
       | 
       | - Does the model seem "engaging" just b/c the user has to refine
       | their prompt several times before they get a satisfying response?
       | 
       | - Do responses include a lot of hallucinations which might be
       | engaging but not true?
       | 
       | - Do successive responses show decreased consistency or coherence
       | between messages, in a way that might accidentally elicit
       | continued engagement?
       | 
       | Overall, it seems sloppy to believe that it's not a waste of
       | humans time to talk to your chatbots, and it's not a waste of
       | time for readers to look at this paper about your chatbots, but
       | it's too expensive for you to actually measure the quality of
       | responses from your chatbots.
        
         | yorwba wrote:
         | They're making chatbots _specifically_ for humans to waste time
         | with them (a.k.a. entertainment.)
         | 
         | Engagement and user retention are directly connected to their
         | bottom line in a way that quality responses (e.g. introducing
         | you to a more fulfilling hobby than chatting with AIs) are not.
        
           | pk-protect-ai wrote:
           | That is what I read in this paper as well. It is not about
           | "better as better performance" it is "better as improved user
           | retention".
        
       | sp332 wrote:
       | Is it weird to refer to GPT-3.5 as "state of the art" when GPT-4
       | is right there? Actually the paper uses davinci interchangeably
       | with GPT-3.5 (sometimes without a hyphen) and ChatGPT.
        
         | mewpmewp2 wrote:
         | So many people seem to confuse beating GPT-3.5 in general to be
         | the hallmark. It's immediate hint they have no idea. There's a
         | clear and vast difference between GPT-4 and 3.5, making GPT-3.5
         | almost worthless except for fast summarisation tasks perhaps.
         | 
         | You really haven't done much with those models if they seem
         | remotely comparable.
         | 
         | To me GPT-3.5 can just summarise and provide general answers to
         | questions, but GPT-4 can actually understand nuance and to me
         | what seems to be reasoning.
        
       | m3kw9 wrote:
       | I really would like them to compare to Gpt4 instead of claiming
       | victory when matching 3.5. To me GPT4 is the first usable one for
       | a lot of professional uses. 3.5 is fun and gets some stuff right
       | but it's like a demo.
        
         | brucethemoose2 wrote:
         | Honestly, the baseline models they test and blend are really
         | terrible as well. Especially Pygmalion 6B, which is like
         | ancient history.
         | 
         | A Yi 34B or Mixtral finetune on the same data would blow them
         | out of the water. Probably blow ChatGPT 3.5 out of the water as
         | well.
        
       | goethes_kind wrote:
       | I find it suspicious that they would use user engagement and
       | retention and none of the normal benchmarks to test their model.
        
       | denimboy wrote:
       | mergekit is the tool you need to do this
       | https://github.com/cg123/mergekit
       | 
       | you can slice off layers and blend models with different
       | strategies.
        
         | brucethemoose2 wrote:
         | Mergekit is the best thing since sliced bread, as the local llm
         | community already knows.
         | 
         | The dev's blog is great: https://goddard.blog/posts/
         | 
         | ...But its not what this paper is describing. They are
         | basically alternating models, AFAIK. Also I have other nitpicks
         | with the paper, like using extremely old/mediocre chat models
         | as bases:
         | 
         | > Pygmillion 6B, Vicuna 13B, Chai Model 6B
        
       | miven wrote:
       | Now that I think about it, doesn't this "technique" triple the
       | amount of compute and memory per generated token since each model
       | needs to also compute and store the KV values for the two
       | previous tokens it didn't generate and thus has never seen?
        
         | leblancfg wrote:
         | It reads like that, yeah. Although 3 x 6B is still an order of
         | magnitude smaller than ChatGPT's purported 175B
        
       ___________________________________________________________________
       (page generated 2024-01-11 23:00 UTC)