[HN Gopher] Do wide and deep networks learn the same things?
       ___________________________________________________________________
        
       Do wide and deep networks learn the same things?
        
       Author : MindGods
       Score  : 82 points
       Date   : 2021-06-02 16:27 UTC (6 hours ago)
        
 (HTM) web link (ai.googleblog.com)
 (TXT) w3m dump (ai.googleblog.com)
        
       | joe_the_user wrote:
       | So, it seems like the "blocks" they're talking about are
       | basically redundancies, duplicated logic. It makes sense to me
       | that since they provide the same functionality, how or how these
       | duplicates exist doesn't matter. But I'm an amateur
        
       | godelski wrote:
       | For more context to people, we have the universal approximation
       | theorem for neural nets that basically says if a network is wide
       | enough it can approximate anything (with at least 2 layers). So a
       | lot of stuff was really wide. Then VGG[0] came out and showed
       | that deep networks were very effective (along with other papers,
       | things happen in unison. Leibniz and Newton). Then you get
       | ResNets[1] with skip connections and move forward to today. Today
       | we've started looking more at what networks are doing and where
       | their biases lie. This is because we're running into some
       | roadblocks with CNNs vs Transformers. They have different
       | inductive biases. Vision transformers still aren't defeating
       | CNNs, but they are close and it is clear they learn different
       | things. So we're seeing more papers doing these types of
       | analyses. ML will likely never be fully interpretable, but we're
       | getting better at understanding. This is good because a lot of
       | times picking your model and network architecture is more art
       | than science (especially when choosing hyper parameters).
       | 
       | [0] https://arxiv.org/abs/1409.1556
       | 
       | [1] https://arxiv.org/abs/1512.03385
        
         | dimatura wrote:
         | I would say AlexNet [1], rather than VGG, was the landmark
         | paper that got the computer vision community to pay attention
         | to deep learning, specifically by winning the 2012 imagenet
         | competition by a large margin. Not that there weren't successes
         | before (specifically, deep nets had also been getting traction
         | in speech processing), and of course deep learning itself is
         | much older than alexnet. But I think most people, specially in
         | vision, would say the 2012 imagenet competition was the
         | watershed moment for DL. By current standards it's not very
         | deep, but at the time it was definitely "deeper"than the
         | popular models at the time (which were mostly not neural
         | networks).
         | 
         | VGG is also super influential of course -- it reinforced the
         | trend towards ever deeper networks, which ResNet also took to
         | another level.
         | 
         | [1] https://papers.nips.cc/paper/4824-imagenet-classification-
         | wi...
        
         | joe_the_user wrote:
         | Deep networks have also been shown to be universal, just FYI,
        
       | sova wrote:
       | At first I thought this had something to do with the classic
       | "breadth vs. depth" notion on learning stuff -- if you're
       | preparing for the MCAT it is better to have breadth that covers
       | all the topics than depth in one or two particulars for the exam,
       | but this is actually just about the dimensions of the neural
       | network used to create representations. Naturally, one would
       | expect a "sweet spot" or series of "sweet spots."
       | 
       | From the paper at https://arxiv.org/pdf/2010.15327.pdf
       | 
       | > As the model gets wider or deeper, we seethe emergence of a
       | distinctive block structure-- a considerable range of hidden
       | layers that have very high representation similarity (seen as a
       | yellow square on the heatmap). This block structure mostly
       | appears in the later layers (the last two stages) of the network.
       | 
       | I wonder if we could do similar analysis on the human brain and
       | find "high representational similarity" for people who do the
       | same task over and over again, such as play chess.
       | 
       | Also, I don't really know what sort of data they are analyzing or
       | looking at with these NN, maybe someone with better scansion can
       | let me know?
        
         | andersource wrote:
         | Haven't read thoroughly but it seems they are investigating
         | ResNet models [0] trained for image classification.
         | 
         | > We apply CKA to a family of ResNets of varying depths and
         | widths, trained on common benchmark datasets (CIFAR-10,
         | CIFAR-100 and ImageNet)
         | 
         | [0] https://arxiv.org/abs/1512.03385
        
         | verdverm wrote:
         | iirc, the human neocortex is only 6 layers deep with some
         | interesting vertical connection structures, perhaps similar to
         | skip connections in NN.
         | 
         | It would be interesting to see where the deep vs wide analysis
         | ends up when many problem types are used. Can a single network
         | be trained on multiple problems at once and perform well on
         | all?
        
           | ubercore wrote:
           | That sounds fascinating. When talking about something as
           | complex and interconnected as a human neocortex, what does
           | "only 6 layers deep" mean?
        
             | mattkrause wrote:
             | If you look at a slice of cortex under the microscope,
             | there appear to be six physical layers (like a cake), owing
             | to the different types, numbers, and arrangement of neurons
             | in each.
             | 
             | Canonically, the cortex is built out of columns, each of
             | which repeat the same motif. Within a cortical column,
             | signals enter a cortical region through layer IV, 'ascend'
             | to other cortical areas via Layers II and III, and project
             | elsewhere in the brain via Layer V/VI. Layer I mostly
             | contains passing fibers going elsewhere. There are also
             | "horizontal" or lateral connections between and within
             | columns.
             | 
             | This is sort of an abstraction though. It's often hard to
             | clearly delineate the boundary between Layer II and III.
             | Layer IV of primary visual cortex has many small sublayers
             | (4C alpha), but it's very very small in others.
        
           | Buttons840 wrote:
           | I'm not sure what skip connections are, but I think I have a
           | good guess as to what they are.
           | 
           | I've wanted to try a neural network where the output of every
           | layer goes into every subsequent layer. Each layer would thus
           | provide a different perspective to subsequent layers.
           | 
           | Anyone know if this has been tried?
        
             | verdverm wrote:
             | Have a look at AlphaStar, it's a pretty interesting network
             | of networks that has some skip functionality.
             | 
             | The DeepMind lecture series on YouTube is pretty great.
             | 
             | You'd likely overdue it with skips everywhere, too many
             | connections to learn and backprop on that learning would
             | likely be difficult
        
             | nomad225 wrote:
             | Densely Connected Convolutional Networks
             | (https://arxiv.org/abs/1608.06993) use the idea you're
             | talking about quite effectively.
        
               | Buttons840 wrote:
               | Yeah. That seems like exactly what I was describing.
               | Thanks.
        
           | dimatura wrote:
           | There's a lot of differences between the mainstream CNNs and
           | biological NNs though. For one, inference in most CNNs is
           | just feed-forward, whereas in the brain the information flow
           | is a lot more complex, modulated by all sorts of feedback
           | loops as well as a dynamic state given by memory,
           | expectation, attention, etc. Biological neurons are also a
           | lot more complicated (and diverse) than the artificial ones.
           | So those six layers aren't really very comparable to six
           | layers in a typical modern CNN.
           | 
           | Of course, I'm talking about just the typical current day
           | CNN. There's a lot of ongoing work in recurrent neural nets,
           | memory for neural nets, attention (though the idea of
           | "attention" that is hot right now is quite simplified
           | compared to what we usually call attention), etc.
        
           | mattkrause wrote:
           | Each cortical _area_ has six layers, but most behaviors
           | require interactions between many cortical areas, so  "input"
           | passes through many more than six layers before it produces
           | an output.
           | 
           | Felleman and Van Essen is a classic paper on the organization
           | of the visual system. Figure 2 (p. 4) might give you good
           | sense for how much of the brain it occupies and Figure 4 (p.
           | 30) is the well-known "wiring diagram.
           | 
           | In the 30 years since that paper was written, we've found a
           | few more boxes and a lot more wires! We've also come to
           | appreciate that there are lots of recurrent loops. V1 -> V2
           | is one of the biggest connections in the brain, V2 -> V1 is a
           | near runner up.
           | 
           | https://cogsci.ucsd.edu/~sereno/201/readings/04.03-MacaqueAr.
           | ..
        
             | verdverm wrote:
             | Indeed, there are many "horizontal" and recurrent
             | connections and simply thinking of it as a 6 layer feed
             | forward network is a gross oversimplification. It's more
             | like a complex network of complex networks... of the
             | spiking variety
        
               | mattkrause wrote:
               | Yup, and that's just the "classical" synaptic
               | transmission.
               | 
               | Mixed in with that, there is also slower signaling via
               | neuromodulators (dopamine, norepinephrine etc),
               | neuroendocrine system, and God only knows whatever the
               | astrocytes are doing. Every neuron has its own internal
               | dynamics too, over scales ranging from milliseconds
               | (channel inactivation) to hours or days (receptor
               | internalization).
               | 
               | There's even the possibility of "ephaptic coupling",
               | wherein the electric fields produced by some neurons
               | affect the activity of others, without making any sort of
               | direct contact. We've collected some of the stronger data
               | in favor of that possibility and yet I remain firmly in
               | denial because it would make the brain so absurdly
               | complicated.
        
       | rajansaini wrote:
       | Those are very interesting empirical results. This lecture
       | explains the deeper vs shallow tradeoff theoretically:
       | https://www.youtube.com/watch?v=qpuLxXrHQB4. He's an amazing
       | lecturer; wish I didn't need subtitles!
       | 
       | (If you're too lazy to watch, it turns out that there exist
       | functions that a shallow network can never approximate)
        
       ___________________________________________________________________
       (page generated 2021-06-02 23:01 UTC)