[HN Gopher] Tree Attention: Topology-Aware Decoding for Long-Con...
       ___________________________________________________________________
        
       Tree Attention: Topology-Aware Decoding for Long-Context
        
       Author : diwank
       Score  : 76 points
       Date   : 2024-08-11 20:02 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | brrrrrm wrote:
       | how does this approach differ from Nvidia's 2019 writeup on using
       | trees to improve allreduce operations?
       | https://developer.nvidia.com/blog/massively-scale-deep-learn...
        
         | mrfox321 wrote:
         | Similar, but specialized to softmax(QK)V computation
        
       | Narhem wrote:
       | How often do papers like this make it to industry
       | applications/published research. Seems stuck in between the two.
        
         | mithametacs wrote:
         | This is going to shred when it reaches industry.
         | 
         | But yeah very math heavy for a software engineering paper.
        
           | ynniv wrote:
           | How long can a page of python take? https://github.com/Zyphra
           | /tree_attention/blob/main/tree_shar...
        
             | mithametacs wrote:
             | You seem to know your stuff.
             | 
             | Will this technique work with existing model weights?
        
               | kasmura wrote:
               | Yes, it is just a way of computing the self-attention in
               | a distributed way
        
         | mjburgess wrote:
         | It's from industry. It's a funded research startup.
        
       | mjburgess wrote:
       | I recall reading recently that someone went back and trained an
       | RNN at a similar scale to a GPT and got similar performance on
       | modern hardware (perhaps someone can link me that paper?).
       | 
       | ie., the innovation in statistical AI isn't in making the
       | algorithms "smarter", it's finding ways to align the computation
       | with modern GPU hardware -- this has been the story since 2012.
       | 
       | In the end, the function all such algs are approximating is a
       | conditional probability. ie., the perfect answer to any prompt is
       | to ignore training entirely, and at inference time, compute an
       | expectation across all historical data. All training does is
       | essentially optimally cache a large part of that computation.
       | 
       | This is very different to how it's typically sold/understood, in
       | the sense that there's an appearance that at inference-time some
       | unbounded computation is going on, ie.,
       | "thinking"/"reasoning"/etc. But at inference time _for any
       | prompt_ the same amount of computation is used, regardless of the
       | question complexity. So the system will appear to reason (etc.)
       | if it can sample convincingly from its pre-cached computation.
       | 
       | This means "innovation" here follows a moore's law S-curve for
       | GPU hardware.
        
         | doctorpangloss wrote:
         | But often, less training performs better.
        
           | mjburgess wrote:
           | You'd have to give me an example.
           | 
           | If less training performs better, it's because it's
           | overfitting. In the case of NLP/CV you, in general, cannot
           | really "overfit" because most of the goal isnt predictive,
           | but rather representational -- so training=templating, rather
           | than predicting. In these cases training for longer should
           | mostly improve the quality of the output, since there isnt
           | any ground truth to overfit against.
        
         | kasmura wrote:
         | I disagree. Ring attention and tree attention are so general
         | that the core ideas are independent of details about modern
         | GPUs. Maybe so for flash attention but not this. I also
         | disagree because these algorithms are fundamentally about
         | enabling long context by distributing across gpus and this
         | would not be enabled by "moored law for gpu hardware"
        
         | viraptor wrote:
         | > But at inference time for any prompt the same amount of
         | computation is used, regardless of the question complexity.
         | 
         | That's not true and that's why the "think step by step" works.
         | The output becomes part of the context. Forcing that way of
         | answering essentially merges two queries into one: split
         | question into subtasks and solve each subtask separately.
         | 
         | The complex question does cause more computation and provides
         | better quality answers.
        
           | mjburgess wrote:
           | Sure, it's not quite true, but it's close enough for the
           | analysis to be correct.
           | 
           | ie., if we take a family of problems: P1..n each statable
           | with prompts P1..n, each having an 'essential complexity to
           | solve' O(P1)...O(Pn) then it's trivial to show that no curve-
           | fitting statistical algorithm has the 'correct' computational
           | budgets associated to each. The budget is basically constant,
           | and is not increasing correctly as the problem complexity
           | increases.
           | 
           | (Eg., consider the prompt: by applying Dijkstra's algorithm,
           | find a path....on graph G1, G2..Gn -- with ever more complex
           | graphs).
           | 
           | It is right to say that step-by-step prompting expands the
           | computational resources available to the LLM and therefore
           | often improves the quality of the answer.. but it is
           | incorrect to say that it does so by _allocating_ computation
           | according to that complexity. At best you could say that it
           | is we, the human prompter, seeing the answer is bad, are
           | increasing its budget -- so _we_ are partially modelling the
           | complexity of the problem _for the LLM_.
           | 
           | If you could create an LLM that "prompted itself" according
           | to the complexity of the problem, and measured its
           | energy/time/etc. to be as we would expect.. then you'd be on
           | firmer grounds with this claim. But i'd be inclined to deny
           | this was possible given what an LLM is...
           | 
           | ie., I claim that I can always find a family of problems
           | P1..Pn where the computational distribution will rule out
           | that reasoning is taking place. Why? Because the LLM is only
           | sampling from a compression of training data, I deny it is at
           | all sensitive to the meaning of the terms.
           | 
           | So whereas a person will understand a novel problem (family)
           | on the basis of its meaning, and expend time appropriately,
           | the LLM is limited to sampling from prior examples of such
           | problems in training data.
        
             | viraptor wrote:
             | > ie., I claim that I can always find a family of problems
             | P1..Pn where the computational distribution will rule out
             | that reasoning is taking place.
             | 
             | Finding edge cases is not a great way to evaluate
             | probabilistic processes. You'll almost always find a
             | pathological case. I'm sure we can find a series of
             | questions of increasing difficulty where humans think
             | they're of decreasing difficulty. That wouldn't prove
             | humans don't reason.
             | 
             | > If you could create an LLM that "prompted itself"
             | according to the complexity of the problem
             | 
             | You don't have to create one. We know that in-context
             | learning is roughly equivalent to training, so you can add
             | instructions about evaluating task difficulty and different
             | approaches depending on the result to the system prompt.
             | You'll see it's possible to do that estimate. Agents use
             | that all the time to decide when to split a task into
             | subtasks and when to proceed with solving.
             | 
             | > So whereas a person will understand a novel problem
             | (family) on the basis of its meaning, and expend time
             | appropriately, the LLM is limited to sampling from prior
             | examples of such problems in training data.
             | 
             | Can you prove humans aren't just sampling from training
             | data + preserving context? I don't believe anyone has
             | really documented how human thought works at that level.
             | I'm not saying it's true or false, just - do we actually
             | know enough to validate the claim you made?
        
         | FeepingCreature wrote:
         | Rather, the amount of computation for every token is bounded.
        
         | spencerchubb wrote:
         | Hopefully people will figure out how to make AI work with
         | dynamic compute rather than fixed compute!
        
       | cs702 wrote:
       | Interesting.
       | 
       | The authors claim this outperforms Ring Attention for distributed
       | computation of self-attention over multiple GPUs.
       | 
       | Distributing computation is necessary whenever context is too
       | long for self-attention's computation to fit in a single GPU's
       | available memory.
       | 
       | Github link: https://github.com/Zyphra/tree_attention
        
       | tveita wrote:
       | The same authors also have a language model at
       | https://github.com/Zyphra/Zamba2 but it's not clear to me if that
       | model is connected to tree attention.
       | 
       | The announcement at https://www.zyphra.com/post/zamba2-small
       | links to this paper, but the paper doesn't actually mention
       | Zamba2 anywhere.
        
       ___________________________________________________________________
       (page generated 2024-08-12 23:02 UTC)