[HN Gopher] DenseFormer: Enhancing Information Flow in Transformers
___________________________________________________________________
DenseFormer: Enhancing Information Flow in Transformers
Author : tipsytoad
Score : 41 points
Date : 2024-03-22 18:13 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| sp332 wrote:
| Even better is the result on page 7 that perplexity drops faster
| by wall-clock time. Even if you're getting fewer iterations per
| hour of rented GPU time, you're still coming out ahead in model
| performance.
| p1esk wrote:
| This method has only been tested on tiny models (<1B) and tiny
| dataset (17B tokens). It's not clear if it scales.
| ml_basics wrote:
| To be fair to the authors they are affiliated with a university
| and not a big industrial lab, so they may be working with
| significantly constrained resources. Not sure exactly what the
| best solution is for this case given that it affects most
| people outside of a very select few.
| jal278 wrote:
| But it may scale -- that's science in progress
| ml_basics wrote:
| Cool paper. Really interesting to see how even quite
| straightforward architectural modifications haven't yet all been
| exhausted yet, despite all the resources being poured into LLMs
| samus wrote:
| The problem is that they have to be tested for 7B models at
| least to show promise for larger models. And that requires
| significant compute resources.
| tbalsam wrote:
| Due to some of my personal experiences over the years w/
| model development, I believe that this is more due to a
| failure of the current mainline version of Transformers (the
| ++ version I believe) not scaling properly, vs an indicator
| of scale.
|
| If that is the case, then it may well be possible to fix some
| of the scaling issues more apparent with smaller transformer
| models (maybe not, though). This is at least some of the
| reasoning that I've been applying when developing hlb-gpt,
| for example. It's partially also why I think changing how we
| use nonlinearities within the network might impact scaling,
| due to some of the activation spikes used in more linear
| regions of the network to control network behavior in a way
| not originally intended.
|
| Agreed that it does require a ton of resources though. But I
| do think that the problem can be solved on a smaller scale.
| If we don't have a cleanly logarithmic curve, then I think
| that something is dearly wrong with our base architecture.
| (However, of course, I may entirely be missing something
| here).
| aoeusnth1 wrote:
| > Impact statement:
|
| > This paper presents work whose goal is to advance the field of
| Machine Learning. There are many potential societal consequences
| of our work, none which we feel must be specifically highlighted
| here.
|
| I found this particularly charming.
| polygamous_bat wrote:
| AFAIK this was the default, copy paste impact statement by ICML
| template.
| tbalsam wrote:
| This is a very interesting idea, with DenseNets there are
| oftentimes some terrible memory gotchas that have gotten me over
| the past 7-8 years or so, so a part of me is sorta leaning back
| waiting for some memory usage shoe to drop not specified in the
| paper (even with the activation patterns!)
|
| However, maybe this is not the case. I have a bit of a history of
| messing with residuals in neural networks, seeing more work on it
| is good. Fast training networks of course are a very slightly
| mild obsession of mine as well, and very useful to the field.
| Here's hoping it pans out as a motif, curious to see where it
| goes.
___________________________________________________________________
(page generated 2024-03-22 23:00 UTC)