[HN Gopher] DenseFormer: Enhancing Information Flow in Transformers
       ___________________________________________________________________
        
       DenseFormer: Enhancing Information Flow in Transformers
        
       Author : tipsytoad
       Score  : 41 points
       Date   : 2024-03-22 18:13 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | sp332 wrote:
       | Even better is the result on page 7 that perplexity drops faster
       | by wall-clock time. Even if you're getting fewer iterations per
       | hour of rented GPU time, you're still coming out ahead in model
       | performance.
        
       | p1esk wrote:
       | This method has only been tested on tiny models (<1B) and tiny
       | dataset (17B tokens). It's not clear if it scales.
        
         | ml_basics wrote:
         | To be fair to the authors they are affiliated with a university
         | and not a big industrial lab, so they may be working with
         | significantly constrained resources. Not sure exactly what the
         | best solution is for this case given that it affects most
         | people outside of a very select few.
        
         | jal278 wrote:
         | But it may scale -- that's science in progress
        
       | ml_basics wrote:
       | Cool paper. Really interesting to see how even quite
       | straightforward architectural modifications haven't yet all been
       | exhausted yet, despite all the resources being poured into LLMs
        
         | samus wrote:
         | The problem is that they have to be tested for 7B models at
         | least to show promise for larger models. And that requires
         | significant compute resources.
        
           | tbalsam wrote:
           | Due to some of my personal experiences over the years w/
           | model development, I believe that this is more due to a
           | failure of the current mainline version of Transformers (the
           | ++ version I believe) not scaling properly, vs an indicator
           | of scale.
           | 
           | If that is the case, then it may well be possible to fix some
           | of the scaling issues more apparent with smaller transformer
           | models (maybe not, though). This is at least some of the
           | reasoning that I've been applying when developing hlb-gpt,
           | for example. It's partially also why I think changing how we
           | use nonlinearities within the network might impact scaling,
           | due to some of the activation spikes used in more linear
           | regions of the network to control network behavior in a way
           | not originally intended.
           | 
           | Agreed that it does require a ton of resources though. But I
           | do think that the problem can be solved on a smaller scale.
           | If we don't have a cleanly logarithmic curve, then I think
           | that something is dearly wrong with our base architecture.
           | (However, of course, I may entirely be missing something
           | here).
        
       | aoeusnth1 wrote:
       | > Impact statement:
       | 
       | > This paper presents work whose goal is to advance the field of
       | Machine Learning. There are many potential societal consequences
       | of our work, none which we feel must be specifically highlighted
       | here.
       | 
       | I found this particularly charming.
        
         | polygamous_bat wrote:
         | AFAIK this was the default, copy paste impact statement by ICML
         | template.
        
       | tbalsam wrote:
       | This is a very interesting idea, with DenseNets there are
       | oftentimes some terrible memory gotchas that have gotten me over
       | the past 7-8 years or so, so a part of me is sorta leaning back
       | waiting for some memory usage shoe to drop not specified in the
       | paper (even with the activation patterns!)
       | 
       | However, maybe this is not the case. I have a bit of a history of
       | messing with residuals in neural networks, seeing more work on it
       | is good. Fast training networks of course are a very slightly
       | mild obsession of mine as well, and very useful to the field.
       | Here's hoping it pans out as a motif, curious to see where it
       | goes.
        
       ___________________________________________________________________
       (page generated 2024-03-22 23:00 UTC)