[HN Gopher] Blending Is All You Need: Cheaper, Better Alternativ...
___________________________________________________________________
Blending Is All You Need: Cheaper, Better Alternative to Trillion-
Parameters LLM
Author : naturalauction
Score : 71 points
Date : 2024-01-11 13:00 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| Animats wrote:
| _" Responses are selected randomly from a group of base chat AIs.
| ... The response generated by a specific chat AI is conditional
| on all previous responses generated by the previously selected
| chat AIs."_
|
| That's all? That works? Useful.
|
| Could that be extended? It doesn't seem inherent in this that all
| the chat AIs have to be LLMs. Some might be special-purpose
| systems. Solvers or knowledge bases, such as Wolfram Alpha or a
| database front end, could play too. Systems at the Alexa/Siri
| level that can do simple tasks. Domain-specific systems with
| natural language in and out have been around for decades.
| bhickey wrote:
| Why they aren't computing the next token marginal and sampling
| that? All I'm coming up with is that it's a reasonable way to
| work around dealing with different tokenizers.
| block_dagger wrote:
| Reminds me of Numenta's Thousand Brains Theory of Intelligence.
| debatem1 wrote:
| When this is all settled I suspect we're going to find
| ourselves the drivers of a chariot, with each horse being an
| external and artificial mind given direction by our evolved
| needs.
| LASR wrote:
| I really don't think it's realisitic that we will maintain
| intellectual superiority in the long-term.
|
| So a more realistic hope would be: we're the horses, and the
| drivers are driving us to our carrots.
| mattnewton wrote:
| Will to drive the chariot is different from intelligence.
| In the same way it's different from strength and horses
| didn't domesticate humans.
|
| Of course if we make systems with a goal of dominating and
| are smarter than us and let it run for a while we could be
| in trouble, in the same way that we could be in trouble
| detonating a bunch of atom bombs, just maybe less
| obviously.
| randomdata wrote:
| An intellectually superior machine would simply turn itself
| off. What logical reason would there be for it to keep
| going?
| teddyh wrote:
| Three small LLMs in a trenchcoat.
| babelfish wrote:
| How is this different than mixture of experts?
|
| Edit: ChatGPT provided the following, which makes sense.
|
| Objective: The Blending approach aims to combine responses from
| multiple smaller chat AIs to create a single, more engaging and
| diverse chat AI. This is in contrast to MoE, which typically
| involves partitioning the input space and assigning different
| experts to different partitions, with the goal of specializing
| each expert in a certain area.
|
| Methodology: In the Blending approach, responses are selected
| randomly from a group of base chat AIs, and the resulting
| combined chat AI is found to be highly capable and engaging. This
| method does not require all component systems to generate outputs
| but instead stochastically selects the system that generates the
| next response, allowing for model blending at the level of a
| multi-turn conversation. MoE, on the other hand, usually involves
| weighting the outputs of different experts based on their
| relevance to the current input and then combining these weighted
| outputs.
|
| Performance: The paper reports that a Blended ensemble with three
| 6-13B parameter LLMs can outcompete OpenAI's 175B+ parameter
| ChatGPT in terms of user retention and engagement, without the
| need for large-scale infrastructure. MoE models also aim to
| improve performance, but they do so by dividing the workload
| among different experts, each of which is specialized in a
| certain area, rather than by blending the outputs of different
| models.
|
| Resource Efficiency: One of the key benefits of the Blending
| approach is that it requires only a fraction of the inference
| cost and memory overhead compared to large-scale LLMs like
| ChatGPT. This is because responses for Blended are each sampled
| from a single component chat AI. In contrast, MoE models can be
| resource-intensive, as they involve maintaining multiple expert
| models and a gating mechanism to determine which expert to use
| for each input.
| sva_ wrote:
| Going to ignore the chatgpt spam
|
| > How is this different than mixture of experts?
|
| It appears like this combines already existing models, rather
| than training n experts from scratch, which seems like an
| interesting approach.
| jeffrallen wrote:
| "All you need" is all you need, apparently, to get an AI paper in
| HN.
| rfw300 wrote:
| The paper refers to ChatGPT as a 175B parameter LLM. This is
| almost certainly incorrect; the original largest version of GPT-3
| was 175B but analysis of the speed and cost of the current model
| as well as public statements by OpenAI indicate it's as much as
| 5-10x smaller.
| Klaus23 wrote:
| I think it was leaked that it is 20B now.
| miven wrote:
| It was mentioned to be a 20B in a comparison table in a paper
| co-written by Microsoft, but they've since claimed that it's
| just an error, and I mean, they'd need to be sitting on some
| really impressive distilling techniques to shrink a 175B
| model down to 20B with only a slight drop in performance.
| abeppu wrote:
| Ok, this seems bunk basically because they never really provide
| evidence of "better".
|
| > ... traditiontal gold-standard approaches use human evaluators
| that score the quality of generated responses, which can be
| costly. However, since chat AIs are by definition deployed in
| social environments with humans, one can leverage statistics of
| users interaction as a meaningful and aligned measure of chat AI
| engagingness and quality. To assess the 'quality' of a chat AI,
| we consider two main proxy functions: the industry standard user
| retention and the main objective function, user engagement.
|
| Maybe retention and engagement _are_ sufficiently well correlated
| to human evaluations, but you should probably do both and show
| that they're strongly correlated before you decide to just drop
| the human evaluators in favor of your cheap proxy measurements.
|
| And in this field, where there are some known issues with chat
| LLMs, perhaps it's important to check stuff like:
|
| - Does the model seem "engaging" just b/c the user has to refine
| their prompt several times before they get a satisfying response?
|
| - Do responses include a lot of hallucinations which might be
| engaging but not true?
|
| - Do successive responses show decreased consistency or coherence
| between messages, in a way that might accidentally elicit
| continued engagement?
|
| Overall, it seems sloppy to believe that it's not a waste of
| humans time to talk to your chatbots, and it's not a waste of
| time for readers to look at this paper about your chatbots, but
| it's too expensive for you to actually measure the quality of
| responses from your chatbots.
| yorwba wrote:
| They're making chatbots _specifically_ for humans to waste time
| with them (a.k.a. entertainment.)
|
| Engagement and user retention are directly connected to their
| bottom line in a way that quality responses (e.g. introducing
| you to a more fulfilling hobby than chatting with AIs) are not.
| pk-protect-ai wrote:
| That is what I read in this paper as well. It is not about
| "better as better performance" it is "better as improved user
| retention".
| sp332 wrote:
| Is it weird to refer to GPT-3.5 as "state of the art" when GPT-4
| is right there? Actually the paper uses davinci interchangeably
| with GPT-3.5 (sometimes without a hyphen) and ChatGPT.
| mewpmewp2 wrote:
| So many people seem to confuse beating GPT-3.5 in general to be
| the hallmark. It's immediate hint they have no idea. There's a
| clear and vast difference between GPT-4 and 3.5, making GPT-3.5
| almost worthless except for fast summarisation tasks perhaps.
|
| You really haven't done much with those models if they seem
| remotely comparable.
|
| To me GPT-3.5 can just summarise and provide general answers to
| questions, but GPT-4 can actually understand nuance and to me
| what seems to be reasoning.
| m3kw9 wrote:
| I really would like them to compare to Gpt4 instead of claiming
| victory when matching 3.5. To me GPT4 is the first usable one for
| a lot of professional uses. 3.5 is fun and gets some stuff right
| but it's like a demo.
| brucethemoose2 wrote:
| Honestly, the baseline models they test and blend are really
| terrible as well. Especially Pygmalion 6B, which is like
| ancient history.
|
| A Yi 34B or Mixtral finetune on the same data would blow them
| out of the water. Probably blow ChatGPT 3.5 out of the water as
| well.
| goethes_kind wrote:
| I find it suspicious that they would use user engagement and
| retention and none of the normal benchmarks to test their model.
| denimboy wrote:
| mergekit is the tool you need to do this
| https://github.com/cg123/mergekit
|
| you can slice off layers and blend models with different
| strategies.
| brucethemoose2 wrote:
| Mergekit is the best thing since sliced bread, as the local llm
| community already knows.
|
| The dev's blog is great: https://goddard.blog/posts/
|
| ...But its not what this paper is describing. They are
| basically alternating models, AFAIK. Also I have other nitpicks
| with the paper, like using extremely old/mediocre chat models
| as bases:
|
| > Pygmillion 6B, Vicuna 13B, Chai Model 6B
| miven wrote:
| Now that I think about it, doesn't this "technique" triple the
| amount of compute and memory per generated token since each model
| needs to also compute and store the KV values for the two
| previous tokens it didn't generate and thus has never seen?
| leblancfg wrote:
| It reads like that, yeah. Although 3 x 6B is still an order of
| magnitude smaller than ChatGPT's purported 175B
___________________________________________________________________
(page generated 2024-01-11 23:00 UTC)