[HN Gopher] Tree Attention: Topology-Aware Decoding for Long-Con...
___________________________________________________________________
Tree Attention: Topology-Aware Decoding for Long-Context
Author : diwank
Score : 76 points
Date : 2024-08-11 20:02 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| brrrrrm wrote:
| how does this approach differ from Nvidia's 2019 writeup on using
| trees to improve allreduce operations?
| https://developer.nvidia.com/blog/massively-scale-deep-learn...
| mrfox321 wrote:
| Similar, but specialized to softmax(QK)V computation
| Narhem wrote:
| How often do papers like this make it to industry
| applications/published research. Seems stuck in between the two.
| mithametacs wrote:
| This is going to shred when it reaches industry.
|
| But yeah very math heavy for a software engineering paper.
| ynniv wrote:
| How long can a page of python take? https://github.com/Zyphra
| /tree_attention/blob/main/tree_shar...
| mithametacs wrote:
| You seem to know your stuff.
|
| Will this technique work with existing model weights?
| kasmura wrote:
| Yes, it is just a way of computing the self-attention in
| a distributed way
| mjburgess wrote:
| It's from industry. It's a funded research startup.
| mjburgess wrote:
| I recall reading recently that someone went back and trained an
| RNN at a similar scale to a GPT and got similar performance on
| modern hardware (perhaps someone can link me that paper?).
|
| ie., the innovation in statistical AI isn't in making the
| algorithms "smarter", it's finding ways to align the computation
| with modern GPU hardware -- this has been the story since 2012.
|
| In the end, the function all such algs are approximating is a
| conditional probability. ie., the perfect answer to any prompt is
| to ignore training entirely, and at inference time, compute an
| expectation across all historical data. All training does is
| essentially optimally cache a large part of that computation.
|
| This is very different to how it's typically sold/understood, in
| the sense that there's an appearance that at inference-time some
| unbounded computation is going on, ie.,
| "thinking"/"reasoning"/etc. But at inference time _for any
| prompt_ the same amount of computation is used, regardless of the
| question complexity. So the system will appear to reason (etc.)
| if it can sample convincingly from its pre-cached computation.
|
| This means "innovation" here follows a moore's law S-curve for
| GPU hardware.
| doctorpangloss wrote:
| But often, less training performs better.
| mjburgess wrote:
| You'd have to give me an example.
|
| If less training performs better, it's because it's
| overfitting. In the case of NLP/CV you, in general, cannot
| really "overfit" because most of the goal isnt predictive,
| but rather representational -- so training=templating, rather
| than predicting. In these cases training for longer should
| mostly improve the quality of the output, since there isnt
| any ground truth to overfit against.
| kasmura wrote:
| I disagree. Ring attention and tree attention are so general
| that the core ideas are independent of details about modern
| GPUs. Maybe so for flash attention but not this. I also
| disagree because these algorithms are fundamentally about
| enabling long context by distributing across gpus and this
| would not be enabled by "moored law for gpu hardware"
| viraptor wrote:
| > But at inference time for any prompt the same amount of
| computation is used, regardless of the question complexity.
|
| That's not true and that's why the "think step by step" works.
| The output becomes part of the context. Forcing that way of
| answering essentially merges two queries into one: split
| question into subtasks and solve each subtask separately.
|
| The complex question does cause more computation and provides
| better quality answers.
| mjburgess wrote:
| Sure, it's not quite true, but it's close enough for the
| analysis to be correct.
|
| ie., if we take a family of problems: P1..n each statable
| with prompts P1..n, each having an 'essential complexity to
| solve' O(P1)...O(Pn) then it's trivial to show that no curve-
| fitting statistical algorithm has the 'correct' computational
| budgets associated to each. The budget is basically constant,
| and is not increasing correctly as the problem complexity
| increases.
|
| (Eg., consider the prompt: by applying Dijkstra's algorithm,
| find a path....on graph G1, G2..Gn -- with ever more complex
| graphs).
|
| It is right to say that step-by-step prompting expands the
| computational resources available to the LLM and therefore
| often improves the quality of the answer.. but it is
| incorrect to say that it does so by _allocating_ computation
| according to that complexity. At best you could say that it
| is we, the human prompter, seeing the answer is bad, are
| increasing its budget -- so _we_ are partially modelling the
| complexity of the problem _for the LLM_.
|
| If you could create an LLM that "prompted itself" according
| to the complexity of the problem, and measured its
| energy/time/etc. to be as we would expect.. then you'd be on
| firmer grounds with this claim. But i'd be inclined to deny
| this was possible given what an LLM is...
|
| ie., I claim that I can always find a family of problems
| P1..Pn where the computational distribution will rule out
| that reasoning is taking place. Why? Because the LLM is only
| sampling from a compression of training data, I deny it is at
| all sensitive to the meaning of the terms.
|
| So whereas a person will understand a novel problem (family)
| on the basis of its meaning, and expend time appropriately,
| the LLM is limited to sampling from prior examples of such
| problems in training data.
| viraptor wrote:
| > ie., I claim that I can always find a family of problems
| P1..Pn where the computational distribution will rule out
| that reasoning is taking place.
|
| Finding edge cases is not a great way to evaluate
| probabilistic processes. You'll almost always find a
| pathological case. I'm sure we can find a series of
| questions of increasing difficulty where humans think
| they're of decreasing difficulty. That wouldn't prove
| humans don't reason.
|
| > If you could create an LLM that "prompted itself"
| according to the complexity of the problem
|
| You don't have to create one. We know that in-context
| learning is roughly equivalent to training, so you can add
| instructions about evaluating task difficulty and different
| approaches depending on the result to the system prompt.
| You'll see it's possible to do that estimate. Agents use
| that all the time to decide when to split a task into
| subtasks and when to proceed with solving.
|
| > So whereas a person will understand a novel problem
| (family) on the basis of its meaning, and expend time
| appropriately, the LLM is limited to sampling from prior
| examples of such problems in training data.
|
| Can you prove humans aren't just sampling from training
| data + preserving context? I don't believe anyone has
| really documented how human thought works at that level.
| I'm not saying it's true or false, just - do we actually
| know enough to validate the claim you made?
| FeepingCreature wrote:
| Rather, the amount of computation for every token is bounded.
| spencerchubb wrote:
| Hopefully people will figure out how to make AI work with
| dynamic compute rather than fixed compute!
| cs702 wrote:
| Interesting.
|
| The authors claim this outperforms Ring Attention for distributed
| computation of self-attention over multiple GPUs.
|
| Distributing computation is necessary whenever context is too
| long for self-attention's computation to fit in a single GPU's
| available memory.
|
| Github link: https://github.com/Zyphra/tree_attention
| tveita wrote:
| The same authors also have a language model at
| https://github.com/Zyphra/Zamba2 but it's not clear to me if that
| model is connected to tree attention.
|
| The announcement at https://www.zyphra.com/post/zamba2-small
| links to this paper, but the paper doesn't actually mention
| Zamba2 anywhere.
___________________________________________________________________
(page generated 2024-08-12 23:02 UTC)