[HN Gopher] Any Deep ReLU Network Is Shallow
       ___________________________________________________________________
        
       Any Deep ReLU Network Is Shallow
        
       Author : headalgorithm
       Score  : 38 points
       Date   : 2023-06-23 19:25 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | [deleted]
        
       | jmole wrote:
       | this is an interesting attempt at explainability, but that's all.
       | 
       | The backward pass of "Shallowed" Deep network would likely result
       | in a useless network with today's training methods.
        
       | kixiQu wrote:
       | The benefits here are not really to do with implementing
       | inference, but rather the improvement in explainability that a
       | shallow network can provide. Skip to section 5 (page 8) for that.
        
         | throwawaymaths wrote:
         | If anything inference will probably get worse. Typically in ML
         | constraining your architecture (which is what you're doing when
         | you make a deep narrow network out of a shallow wide network,
         | or rnns or grus/lstms, or transformers out of fcnns) yields
         | inference and training performance benefits.
        
       | foota wrote:
       | I'm not an expert, but I wonder if the wide network could then be
       | trimmed down for special purposes? E.g., find the parts of the
       | network needed for a certain task by looking at activations on
       | inputs and dump the rest.
        
       | WithinReason wrote:
       | I thought it was known for a long time that any function can be
       | represented with a neural network with a single layer. It's an
       | almost trivial finding if you think about it: imagine you have a
       | steep step function that looks something like this: __/^^ with
       | the non-zero derivative in a small range, e.g. 0.000-0.001 (or
       | [?] if you like). Let's call this f. You can piece together any
       | function from these tiny pieces as [?] cf(ax+b)+d. You just need
       | an infinite number of them. This sum is a neural network with a
       | single hidden layer.
        
         | madmaxmcfly wrote:
         | Isn't that just like a Taylor series?
        
           | charliea0 wrote:
           | Yes - or Fourier series or other decomposition into a sum of
           | orthogonal basis functions.
        
         | throwawaymaths wrote:
         | this is the "universal approximator theorem".
         | 
         | https://cognitivemedium.com/magic_paper/assets/Hornik.pdf
         | 
         | > infinite number of [activations]
         | 
         | I don't think you need an infinite number of them, there is a
         | relationship between "how close you want to get" and "how many
         | of them you need".
        
         | extasia wrote:
         | Thanks for the explanation. I've heard this before but think I
         | get it now!
        
         | curiousllama wrote:
         | > Based on this proof, we provide an algorithm that, given a
         | deep ReLU network, finds the explicit weights of the
         | corresponding shallow network.
         | 
         | I'm out of date on the research, but I suspect the real value
         | here is the algorithm. Sounds like some version of this could
         | eventually help reduce inference time
        
           | tripplyons wrote:
           | I haven't read the paper, but most of the time, in order to
           | achieve matching outputs with less layers, you will need
           | exponentially more neurons per layer.
           | 
           | If you have infinite parallelism I believe the shallow
           | network would be faster, but deep networks will use less
           | total operations and will be faster in practice.
        
           | idontpost wrote:
           | [dead]
        
           | WithinReason wrote:
           | I suspect the trade-off for a shallow network is an
           | exponential explosion in weights and computation.
        
             | throwawaymaths wrote:
             | I'm not sure it's exponential, but it is dramatic.
        
             | tdullien wrote:
             | Yep
        
         | amelius wrote:
         | This is like how you can write any boolean function as a
         | (large) sum of minterms.
        
         | wewxjfq wrote:
         | This is such a bullshit HN comment, lmao.
        
       | version_five wrote:
       | This was posted yesterday:
       | https://news.ycombinator.com/item?id=36430442
       | 
       | Actually I know because I also tried posting it but wasn't
       | allowed because it was a dupe (4 or 5 hours old) - instead I just
       | got taken to that post and was automatically deemed to have voted
       | for it.
        
       | superkuh wrote:
       | So any N-layer network with the relu _/ activation function can
       | be restated in terms of a 3 layer network as long as the 3 layers
       | can have infinite parameters and as long as you have near
       | infinite computing power to convert the big ones. Cool but not
       | yet useful.
        
         | PartiallyTyped wrote:
         | > How well does a classic deep net architecture like AlexNet or
         | VGG19 classify on a standard dataset such as CIFAR-10 when its
         | "width"-- namely, number of channels in convolutional layers,
         | and number of nodes in fully-connected internal layers -- is
         | allowed to increase to infinity? Such questions have come to
         | the forefront in the quest to theoretically understand deep
         | learning and its mysteries about optimization and
         | generalization. They also connect deep learning to notions such
         | as Gaussian processes and kernels. A recent paper [Jacot et
         | al., 2018] introduced the Neural Tangent Kernel (NTK) which
         | captures the behavior of fully-connected deep nets in the
         | infinite width limit trained by gradient descent; this object
         | was implicit in some other recent papers. An attraction of such
         | ideas is that a pure kernel-based method is used to capture the
         | power of a fully-trained deep net of infinite width.
         | 
         | https://openreview.net/pdf?id=rkl4aESeUH,
         | https://github.com/google/neural-tangents
         | 
         | > It has long been known that a single-layer fully-connected
         | neural network with an i.i.d. prior over its parameters is
         | equivalent to a Gaussian process (GP), in the limit of infinite
         | network width.
         | 
         | https://arxiv.org/abs/1711.00165
         | 
         | And of course, one needs to look back at SVMs applying a kernel
         | function and separating with a line, which looks a lot like an
         | ANN with a single hidden layer followed by a linear mapping.
         | 
         | https://stats.stackexchange.com/questions/238635/kernel-meth...
        
         | visarga wrote:
         | > So any N-layer network with the relu _/ activation function
         | 
         | The paper talks only about models with additive operations
         | among activations. It doesn't say anything about more complex
         | networks like transformers.
         | 
         | In transformers there are multiplicative interactions between
         | activations inside the attention matrix, it is unclear if they
         | can be approximated with just a 3 layer ReLU network, or if
         | such conversion would be practical at all.
        
       | sp332 wrote:
       | Nice! I have to admit I didn't follow all the math, but the
       | layers get really wide, right? You still need to describe all the
       | partitions between the pieces of the piecewise linear functions.
       | Does this save any computation at inference time?
        
         | polygamous_bat wrote:
         | My hunch is that you can really easily find (adversarial)
         | counterexamples that may show that in the worst case you will
         | always use more compute by shallow-ing. In my head it should be
         | related to no free lunch theorem.
        
       ___________________________________________________________________
       (page generated 2023-06-23 23:00 UTC)