[HN Gopher] Any Deep ReLU Network Is Shallow
___________________________________________________________________
Any Deep ReLU Network Is Shallow
Author : headalgorithm
Score : 38 points
Date : 2023-06-23 19:25 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| [deleted]
| jmole wrote:
| this is an interesting attempt at explainability, but that's all.
|
| The backward pass of "Shallowed" Deep network would likely result
| in a useless network with today's training methods.
| kixiQu wrote:
| The benefits here are not really to do with implementing
| inference, but rather the improvement in explainability that a
| shallow network can provide. Skip to section 5 (page 8) for that.
| throwawaymaths wrote:
| If anything inference will probably get worse. Typically in ML
| constraining your architecture (which is what you're doing when
| you make a deep narrow network out of a shallow wide network,
| or rnns or grus/lstms, or transformers out of fcnns) yields
| inference and training performance benefits.
| foota wrote:
| I'm not an expert, but I wonder if the wide network could then be
| trimmed down for special purposes? E.g., find the parts of the
| network needed for a certain task by looking at activations on
| inputs and dump the rest.
| WithinReason wrote:
| I thought it was known for a long time that any function can be
| represented with a neural network with a single layer. It's an
| almost trivial finding if you think about it: imagine you have a
| steep step function that looks something like this: __/^^ with
| the non-zero derivative in a small range, e.g. 0.000-0.001 (or
| [?] if you like). Let's call this f. You can piece together any
| function from these tiny pieces as [?] cf(ax+b)+d. You just need
| an infinite number of them. This sum is a neural network with a
| single hidden layer.
| madmaxmcfly wrote:
| Isn't that just like a Taylor series?
| charliea0 wrote:
| Yes - or Fourier series or other decomposition into a sum of
| orthogonal basis functions.
| throwawaymaths wrote:
| this is the "universal approximator theorem".
|
| https://cognitivemedium.com/magic_paper/assets/Hornik.pdf
|
| > infinite number of [activations]
|
| I don't think you need an infinite number of them, there is a
| relationship between "how close you want to get" and "how many
| of them you need".
| extasia wrote:
| Thanks for the explanation. I've heard this before but think I
| get it now!
| curiousllama wrote:
| > Based on this proof, we provide an algorithm that, given a
| deep ReLU network, finds the explicit weights of the
| corresponding shallow network.
|
| I'm out of date on the research, but I suspect the real value
| here is the algorithm. Sounds like some version of this could
| eventually help reduce inference time
| tripplyons wrote:
| I haven't read the paper, but most of the time, in order to
| achieve matching outputs with less layers, you will need
| exponentially more neurons per layer.
|
| If you have infinite parallelism I believe the shallow
| network would be faster, but deep networks will use less
| total operations and will be faster in practice.
| idontpost wrote:
| [dead]
| WithinReason wrote:
| I suspect the trade-off for a shallow network is an
| exponential explosion in weights and computation.
| throwawaymaths wrote:
| I'm not sure it's exponential, but it is dramatic.
| tdullien wrote:
| Yep
| amelius wrote:
| This is like how you can write any boolean function as a
| (large) sum of minterms.
| wewxjfq wrote:
| This is such a bullshit HN comment, lmao.
| version_five wrote:
| This was posted yesterday:
| https://news.ycombinator.com/item?id=36430442
|
| Actually I know because I also tried posting it but wasn't
| allowed because it was a dupe (4 or 5 hours old) - instead I just
| got taken to that post and was automatically deemed to have voted
| for it.
| superkuh wrote:
| So any N-layer network with the relu _/ activation function can
| be restated in terms of a 3 layer network as long as the 3 layers
| can have infinite parameters and as long as you have near
| infinite computing power to convert the big ones. Cool but not
| yet useful.
| PartiallyTyped wrote:
| > How well does a classic deep net architecture like AlexNet or
| VGG19 classify on a standard dataset such as CIFAR-10 when its
| "width"-- namely, number of channels in convolutional layers,
| and number of nodes in fully-connected internal layers -- is
| allowed to increase to infinity? Such questions have come to
| the forefront in the quest to theoretically understand deep
| learning and its mysteries about optimization and
| generalization. They also connect deep learning to notions such
| as Gaussian processes and kernels. A recent paper [Jacot et
| al., 2018] introduced the Neural Tangent Kernel (NTK) which
| captures the behavior of fully-connected deep nets in the
| infinite width limit trained by gradient descent; this object
| was implicit in some other recent papers. An attraction of such
| ideas is that a pure kernel-based method is used to capture the
| power of a fully-trained deep net of infinite width.
|
| https://openreview.net/pdf?id=rkl4aESeUH,
| https://github.com/google/neural-tangents
|
| > It has long been known that a single-layer fully-connected
| neural network with an i.i.d. prior over its parameters is
| equivalent to a Gaussian process (GP), in the limit of infinite
| network width.
|
| https://arxiv.org/abs/1711.00165
|
| And of course, one needs to look back at SVMs applying a kernel
| function and separating with a line, which looks a lot like an
| ANN with a single hidden layer followed by a linear mapping.
|
| https://stats.stackexchange.com/questions/238635/kernel-meth...
| visarga wrote:
| > So any N-layer network with the relu _/ activation function
|
| The paper talks only about models with additive operations
| among activations. It doesn't say anything about more complex
| networks like transformers.
|
| In transformers there are multiplicative interactions between
| activations inside the attention matrix, it is unclear if they
| can be approximated with just a 3 layer ReLU network, or if
| such conversion would be practical at all.
| sp332 wrote:
| Nice! I have to admit I didn't follow all the math, but the
| layers get really wide, right? You still need to describe all the
| partitions between the pieces of the piecewise linear functions.
| Does this save any computation at inference time?
| polygamous_bat wrote:
| My hunch is that you can really easily find (adversarial)
| counterexamples that may show that in the worst case you will
| always use more compute by shallow-ing. In my head it should be
| related to no free lunch theorem.
___________________________________________________________________
(page generated 2023-06-23 23:00 UTC)