[HN Gopher] Do vision transformers see like convolutional neural...
       ___________________________________________________________________
        
       Do vision transformers see like convolutional neural networks?
        
       Author : jonbaer
       Score  : 84 points
       Date   : 2021-08-25 15:36 UTC (7 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | kettleballroll wrote:
       | It's always nice to see big labs working more towards building an
       | understanding of things instead of just chasing SOTA. On the
       | other hand, I'm not sure there is a lot of actionable findings in
       | here. I guess that's the trade off with these things though....
        
       | stared wrote:
       | I am much more interested if they fall for the same tricks.
       | 
       | For example, if it is easy to fool them with optical illusions,
       | such as innocent images that look racy at the first glance:
       | https://medium.com/@marekkcichy/does-ai-have-a-dirty-mind-to...
       | CW: Even though it does not contain a single explicit picture, it
       | might be considered NSFW (literally - as at the first glance it
       | looks like nudity); full disclosure: I mentored the project.
        
         | PartiallyTyped wrote:
         | I suggest you take a look at Geometric Deep Learning, but the
         | gist here is that convolutions can be thought of as translation
         | equivariant functions, and pooling operations as permutation
         | invariant combinations, all operating on a graph of as many
         | components as the number of times the operation will be output,
         | each component is composed of the pixels that the operation
         | will act on, so there is local information that is slowly
         | combined through the layers, then relative positioning can be
         | decoded when the representation is transformed into a dense
         | 1d-vector aka flattening.
         | 
         | In contrast, attention mechanisms in transformers can be seen
         | as taking into consideration dense graphs of the whole input
         | (at least in text, I haven't really worked with vision
         | transformers but if an attention mechanism exists then it
         | should be similar), along with some positional encoding and a
         | neighborhood summary.
         | 
         | If they indeed can be thought as stacking neighborhood
         | summaries along with attention mechanisms, then they shouldn't
         | fall for the same tricks since they have access to "global"
         | information instead of disconnected components.
         | 
         | But take this reply with a grain of salt as I am still learning
         | about Geometric DL. If I misunderstood something, please
         | correct me.
        
           | [deleted]
        
           | dpflan wrote:
           | Thank you for bringing up GDL. I've been following its
           | developments, and a really great resource is this site:
           | https://geometricdeeplearning.com/
           | 
           | It contains links to the paper and lectures, and the keynote
           | by M. Bronstein is illuminating and discusses the operations
           | on graphs that lead to equivalence to other network
           | topologies and designs: transformer equivalence, and more.
           | 
           | > Keynote: https://www.youtube.com/watch?v=9cxhvQK9ALQ
           | 
           | How are you learning/studying GDL? Would you like someone
           | else to discuss/learn it with?
        
             | PartiallyTyped wrote:
             | I believe Bronstein is onto something huge here, with
             | massive implications. Illuminating is the best adjective to
             | describe it, as I watched this keynote:
             | 
             | > https://www.youtube.com/watch?v=w6Pw4MOzMuo
             | 
             | Everything clicked into place and I was given a new
             | language to see the world that combined everything together
             | well beyond the way standard DL is taught:
             | 
             | > we do feature extraction using this function that
             | resembles the receptive fields of the visual cortex and
             | then we project the dense feature representation onto
             | multiple other vectors and pass that through stacked non-
             | linearities, and oh by the way we have myriad of different,
             | seemingly disconnected, architectures that we are not sure
             | why they work, but we call it inductive bias.
             | 
             | > https://geometricdeeplearning.com/
             | 
             | That's my main source, along with the papers that lead up
             | to the proto-book, so pretty much Bronstein's work along
             | with related papers found using `connectedpapers.com`. I
             | don't have an appropriate background so I am grinding
             | through abstract algebra, geometric algebra, will then go
             | into geometry and whatever my supervisor suggests I should
             | read. Sure, I would like to have other people to discuss
             | it, but don't expect much just yet.
        
               | dpflan wrote:
               | I agree, this perspective is very interesting and tames
               | the zoo of architectures through mathematical
               | unification. It is indeed exciting!
               | 
               | Good luck with your studies/learning!
        
         | spullara wrote:
         | Since it easy to fool people with optical illusions I doubt
         | that you will be able to train a computer to not be fooled by
         | optical illusions.
        
           | stared wrote:
           | It fools humans only at the first glance. A few seconds later
           | we make a correct assessement.
           | 
           | Typical CNNs miss this second stage.
        
             | heavyset_go wrote:
             | And the humor of the images comes from our initial
             | expectations and how different they are from our actual
             | understanding of what we're seeing.
        
         | _untom_ wrote:
         | Not sure if this is along the lines of what you're thinking,
         | but we tried looking at this a while ago:
         | https://arxiv.org/abs/2103.14586
        
         | codeflo wrote:
         | Those pictures are definitely NSFW when viewed at low res/from
         | far away, which is how coworkers typically see your monitor
         | contents. An argument that starts with "Well, technically" is
         | unlikely to carry much weight in a discussion with HR (and
         | probably rightfully so).
        
           | codeflo wrote:
           | Counter argument in addition to downvote, please?
        
           | gugagore wrote:
           | I think the scientific point here is that visual processing
           | is not a one-shot process. Tasked with object detection, some
           | scenes demand more careful processing and more computation.
           | 
           | Almost all neural network architectures process a given input
           | size in the same amount of time, and some applications and
           | datasets would benefit from an "anytime" approach, where the
           | output is gradually refined given more time.
           | 
           | I understand the point you are making, but it's kind of
           | irrelevant. The task is to produce an answer for the image at
           | the given resolution. It is an accident and coincidence that
           | the neural network produces an answer that is arguably
           | correct for a blurrier version of the image.
        
       | omarhaneef wrote:
       | A useful HN feature would be small space to put in a summary,
       | like the abstract:
       | 
       | Convolutional neural networks (CNNs) have so far been the de-
       | facto model for visual data. Recent work has shown that (Vision)
       | Transformer models (ViT) can achieve comparable or even superior
       | performance on image classification tasks. This raises a central
       | question: how are Vision Transformers solving these tasks?
       | 
       | Are they acting like convolutional networks, or learning entirely
       | different visual representations? Analyzing the internal
       | representation structure of ViTs and CNNs on image classification
       | benchmarks, we find striking differences between the two
       | architectures, such as ViT having more uniform representations
       | across all layers.
       | 
       | We explore how these differences arise, finding crucial roles
       | played by self-attention, which enables early aggregation of
       | global information, and ViT residual connections, which strongly
       | propagate features from lower to higher layers. We study the
       | ramifications for spatial localization, demonstrating ViTs
       | successfully preserve input spatial information, with noticeable
       | effects from different classification methods. Finally, we study
       | the effect of (pretraining) dataset scale on intermediate
       | features and transfer learning, and conclude with a discussion on
       | connections to new architectures such as the MLP-Mixer.
        
         | 6gvONxR4sf7o wrote:
         | Is the idea to not make people click the links before
         | discussing?
        
         | p_j_w wrote:
         | This would be handy, but at the same time, I think I like that
         | not doing so encourages people to click the link and read more
         | of the article than they might otherwise.
        
         | codetrotter wrote:
         | > A useful HN feature would be small space to put in a summary,
         | like the abstract
         | 
         | Interestingly, HN does actually save accompanying text when you
         | submit a link also. It just doesn't show the text on the
         | website.
         | 
         | https://news.ycombinator.com/item?id=28180298
         | 
         | Personally I like it the way that it is. I think showing an
         | accompanying text for links would allow too much for anyone
         | posting a link to "force" everyone to read their comment on it.
         | Leaving it so that comments must be posted separately in order
         | to be visible in the thread makes it so that useful
         | accompanying comments can float to the top, whereas a useless
         | comment from the submitter sinks to the bottom while still
         | allowing the submitted link to be voted on individually.
        
           | omarhaneef wrote:
           | I guess this is really in response to all the other responses
           | as well, but I thought the idea would be:
           | 
           | to help people decide if they want to click the link.
           | 
           | So the title may not be sufficiently informative to let
           | people know if they can understand the article, are
           | interested in it, if it is at the right technical level and
           | so on.
           | 
           | I think you're right that it will be abused in many instances
           | and might not be worth it.
        
         | detaro wrote:
         | There is little point in duplicating the first thing you see on
         | the page on HN too.
        
       | mrfusion wrote:
       | It makes sense to me that attention would be hugely beneficial
       | for vision tasks. We use contextual clues every day to decide
       | what we're looking at.
        
         | sillysaurusx wrote:
         | It may make sense, but it also makes no sense. CNNs already
         | have full view of the entire input image. That's how
         | discriminators are able to discriminate in GANs.
         | 
         | We added attention and observed no benefits at all in our GAN
         | experiments.
        
       | mrfusion wrote:
       | Does tesla use transformers for the auto pilot?
        
         | sillysaurusx wrote:
         | Doubtful. The biggest downside of transformers for vision is
         | how ungodly long they take to produce results. Tesla has to
         | operate in realtime.
        
           | gamegoblin wrote:
           | In Karpathy's recent AI day presentation he specifically
           | stated they use transformers.
           | 
           | But not on the raw camera input -- they use regnets for that.
           | The transformers come higher up the stack:
           | 
           | https://youtu.be/j0z4FweCy4M
           | 
           | Transformers mentioned on the slide at timestamp 1:00:18.
        
             | tandem5000 wrote:
             | They use the key-value lookup/routing mechanism from
             | Transformers to predict pixel-wise labels in bird view
             | (lane, car, obstacle, intersection etc.). The motivation
             | here is that some of the predictions may temporarily be
             | occluded, so for predicting these occluded areas it may be
             | particularly helpful to attend to remote regions in the
             | input images which requires long-range dependencies that
             | highly depend on the input itself (e.g. on whether there is
             | an occlusion), which is exactly where the key-value
             | mechanism excels. Not sure they even process past camera
             | frames at this point. They only mention that later in the
             | pipline they have an LSTM-like NN incorporating past camera
             | frames (Schmidhuber will be proud!!).
             | 
             | Edit: A random observation which just occurred to me is
             | that their predictions seem surprisingly temporally
             | unstable. Observe, for example, the lane layout wildly
             | changing while it drives makes a left-turn at the
             | intersection (https://youtu.be/j0z4FweCy4M?t=2608). You can
             | use the comma and period keys to step through the video
             | frame-by-frame.
        
             | sillysaurusx wrote:
             | Thank you!
        
         | dexter89_kp3 wrote:
         | They do. Karpathy spoke about it in on Tesla AI day. They use
         | it for transforming image space to a vector space.
         | 
         | See: https://youtu.be/j0z4FweCy4M (timestamp 54.40 onwards)
        
           | sillysaurusx wrote:
           | Can you provide a more specific timestamp? 54.40 doesn't seem
           | to mention anything about transformers, and "onwards" is two
           | hours.
           | 
           | I'd be really surprised if they use transformers due to how
           | computationally expensive they are for anything involving
           | vision.
           | 
           | EDIT: Found it. 1h:
           | https://www.youtube.com/watch?v=j0z4FweCy4M?t=1h
           | 
           | Fascinating. I guess transformers are efficient.
        
             | dexter89_kp3 wrote:
             | I gave the timestamp where they start talking about the
             | problem they are trying to solve using transformers. As you
             | said it is around 1hr mark
        
         | [deleted]
        
       | mrfusion wrote:
       | Offtopic sort of, but does anyone know if folks are working on
       | combining vision and natural language in one model? I think that
       | could wield some interesting results.
        
         | abalaji wrote:
         | yeah there has definitely been work done in that space: it's
         | called multi-modal models
         | 
         | not sure if this is the latest work but here's some results
         | from Google's AI Blog
         | 
         | https://ai.googleblog.com/2017/06/multimodel-multi-task-mach...
        
           | im3w1l wrote:
           | What would be really cool is neural networks with routing.
           | Like circuit switching or packet switching. No idea how you
           | would train such a beast though.
           | 
           | Like imagine the vision part making a phonecall to the
           | natural language part to ask it for help with something.
        
             | wittenator wrote:
             | Capsule networks have a routing algorithm as far as I know
        
             | TylerLives wrote:
             | Sounds like The Society of Mind -
             | https://en.m.wikipedia.org/wiki/Society_of_Mind
        
         | sillysaurusx wrote:
         | https://github.com/openai/CLIP
         | 
         | The results are quite interesting:
         | 
         | https://www.reddit.com/r/Art/comments/p866wv/deep_dive_meai_...
        
           | codetrotter wrote:
           | And here is a short guide and a link to a Google Collab
           | notebook that anyone can use to create their own AI-powered
           | art using VQGAN+CLIP:
           | https://sourceful.us/doc/935/introduction-to-vqganclip
        
       ___________________________________________________________________
       (page generated 2021-08-25 23:01 UTC)