[HN Gopher] Do vision transformers see like convolutional neural...
___________________________________________________________________
Do vision transformers see like convolutional neural networks?
Author : jonbaer
Score : 84 points
Date : 2021-08-25 15:36 UTC (7 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| kettleballroll wrote:
| It's always nice to see big labs working more towards building an
| understanding of things instead of just chasing SOTA. On the
| other hand, I'm not sure there is a lot of actionable findings in
| here. I guess that's the trade off with these things though....
| stared wrote:
| I am much more interested if they fall for the same tricks.
|
| For example, if it is easy to fool them with optical illusions,
| such as innocent images that look racy at the first glance:
| https://medium.com/@marekkcichy/does-ai-have-a-dirty-mind-to...
| CW: Even though it does not contain a single explicit picture, it
| might be considered NSFW (literally - as at the first glance it
| looks like nudity); full disclosure: I mentored the project.
| PartiallyTyped wrote:
| I suggest you take a look at Geometric Deep Learning, but the
| gist here is that convolutions can be thought of as translation
| equivariant functions, and pooling operations as permutation
| invariant combinations, all operating on a graph of as many
| components as the number of times the operation will be output,
| each component is composed of the pixels that the operation
| will act on, so there is local information that is slowly
| combined through the layers, then relative positioning can be
| decoded when the representation is transformed into a dense
| 1d-vector aka flattening.
|
| In contrast, attention mechanisms in transformers can be seen
| as taking into consideration dense graphs of the whole input
| (at least in text, I haven't really worked with vision
| transformers but if an attention mechanism exists then it
| should be similar), along with some positional encoding and a
| neighborhood summary.
|
| If they indeed can be thought as stacking neighborhood
| summaries along with attention mechanisms, then they shouldn't
| fall for the same tricks since they have access to "global"
| information instead of disconnected components.
|
| But take this reply with a grain of salt as I am still learning
| about Geometric DL. If I misunderstood something, please
| correct me.
| [deleted]
| dpflan wrote:
| Thank you for bringing up GDL. I've been following its
| developments, and a really great resource is this site:
| https://geometricdeeplearning.com/
|
| It contains links to the paper and lectures, and the keynote
| by M. Bronstein is illuminating and discusses the operations
| on graphs that lead to equivalence to other network
| topologies and designs: transformer equivalence, and more.
|
| > Keynote: https://www.youtube.com/watch?v=9cxhvQK9ALQ
|
| How are you learning/studying GDL? Would you like someone
| else to discuss/learn it with?
| PartiallyTyped wrote:
| I believe Bronstein is onto something huge here, with
| massive implications. Illuminating is the best adjective to
| describe it, as I watched this keynote:
|
| > https://www.youtube.com/watch?v=w6Pw4MOzMuo
|
| Everything clicked into place and I was given a new
| language to see the world that combined everything together
| well beyond the way standard DL is taught:
|
| > we do feature extraction using this function that
| resembles the receptive fields of the visual cortex and
| then we project the dense feature representation onto
| multiple other vectors and pass that through stacked non-
| linearities, and oh by the way we have myriad of different,
| seemingly disconnected, architectures that we are not sure
| why they work, but we call it inductive bias.
|
| > https://geometricdeeplearning.com/
|
| That's my main source, along with the papers that lead up
| to the proto-book, so pretty much Bronstein's work along
| with related papers found using `connectedpapers.com`. I
| don't have an appropriate background so I am grinding
| through abstract algebra, geometric algebra, will then go
| into geometry and whatever my supervisor suggests I should
| read. Sure, I would like to have other people to discuss
| it, but don't expect much just yet.
| dpflan wrote:
| I agree, this perspective is very interesting and tames
| the zoo of architectures through mathematical
| unification. It is indeed exciting!
|
| Good luck with your studies/learning!
| spullara wrote:
| Since it easy to fool people with optical illusions I doubt
| that you will be able to train a computer to not be fooled by
| optical illusions.
| stared wrote:
| It fools humans only at the first glance. A few seconds later
| we make a correct assessement.
|
| Typical CNNs miss this second stage.
| heavyset_go wrote:
| And the humor of the images comes from our initial
| expectations and how different they are from our actual
| understanding of what we're seeing.
| _untom_ wrote:
| Not sure if this is along the lines of what you're thinking,
| but we tried looking at this a while ago:
| https://arxiv.org/abs/2103.14586
| codeflo wrote:
| Those pictures are definitely NSFW when viewed at low res/from
| far away, which is how coworkers typically see your monitor
| contents. An argument that starts with "Well, technically" is
| unlikely to carry much weight in a discussion with HR (and
| probably rightfully so).
| codeflo wrote:
| Counter argument in addition to downvote, please?
| gugagore wrote:
| I think the scientific point here is that visual processing
| is not a one-shot process. Tasked with object detection, some
| scenes demand more careful processing and more computation.
|
| Almost all neural network architectures process a given input
| size in the same amount of time, and some applications and
| datasets would benefit from an "anytime" approach, where the
| output is gradually refined given more time.
|
| I understand the point you are making, but it's kind of
| irrelevant. The task is to produce an answer for the image at
| the given resolution. It is an accident and coincidence that
| the neural network produces an answer that is arguably
| correct for a blurrier version of the image.
| omarhaneef wrote:
| A useful HN feature would be small space to put in a summary,
| like the abstract:
|
| Convolutional neural networks (CNNs) have so far been the de-
| facto model for visual data. Recent work has shown that (Vision)
| Transformer models (ViT) can achieve comparable or even superior
| performance on image classification tasks. This raises a central
| question: how are Vision Transformers solving these tasks?
|
| Are they acting like convolutional networks, or learning entirely
| different visual representations? Analyzing the internal
| representation structure of ViTs and CNNs on image classification
| benchmarks, we find striking differences between the two
| architectures, such as ViT having more uniform representations
| across all layers.
|
| We explore how these differences arise, finding crucial roles
| played by self-attention, which enables early aggregation of
| global information, and ViT residual connections, which strongly
| propagate features from lower to higher layers. We study the
| ramifications for spatial localization, demonstrating ViTs
| successfully preserve input spatial information, with noticeable
| effects from different classification methods. Finally, we study
| the effect of (pretraining) dataset scale on intermediate
| features and transfer learning, and conclude with a discussion on
| connections to new architectures such as the MLP-Mixer.
| 6gvONxR4sf7o wrote:
| Is the idea to not make people click the links before
| discussing?
| p_j_w wrote:
| This would be handy, but at the same time, I think I like that
| not doing so encourages people to click the link and read more
| of the article than they might otherwise.
| codetrotter wrote:
| > A useful HN feature would be small space to put in a summary,
| like the abstract
|
| Interestingly, HN does actually save accompanying text when you
| submit a link also. It just doesn't show the text on the
| website.
|
| https://news.ycombinator.com/item?id=28180298
|
| Personally I like it the way that it is. I think showing an
| accompanying text for links would allow too much for anyone
| posting a link to "force" everyone to read their comment on it.
| Leaving it so that comments must be posted separately in order
| to be visible in the thread makes it so that useful
| accompanying comments can float to the top, whereas a useless
| comment from the submitter sinks to the bottom while still
| allowing the submitted link to be voted on individually.
| omarhaneef wrote:
| I guess this is really in response to all the other responses
| as well, but I thought the idea would be:
|
| to help people decide if they want to click the link.
|
| So the title may not be sufficiently informative to let
| people know if they can understand the article, are
| interested in it, if it is at the right technical level and
| so on.
|
| I think you're right that it will be abused in many instances
| and might not be worth it.
| detaro wrote:
| There is little point in duplicating the first thing you see on
| the page on HN too.
| mrfusion wrote:
| It makes sense to me that attention would be hugely beneficial
| for vision tasks. We use contextual clues every day to decide
| what we're looking at.
| sillysaurusx wrote:
| It may make sense, but it also makes no sense. CNNs already
| have full view of the entire input image. That's how
| discriminators are able to discriminate in GANs.
|
| We added attention and observed no benefits at all in our GAN
| experiments.
| mrfusion wrote:
| Does tesla use transformers for the auto pilot?
| sillysaurusx wrote:
| Doubtful. The biggest downside of transformers for vision is
| how ungodly long they take to produce results. Tesla has to
| operate in realtime.
| gamegoblin wrote:
| In Karpathy's recent AI day presentation he specifically
| stated they use transformers.
|
| But not on the raw camera input -- they use regnets for that.
| The transformers come higher up the stack:
|
| https://youtu.be/j0z4FweCy4M
|
| Transformers mentioned on the slide at timestamp 1:00:18.
| tandem5000 wrote:
| They use the key-value lookup/routing mechanism from
| Transformers to predict pixel-wise labels in bird view
| (lane, car, obstacle, intersection etc.). The motivation
| here is that some of the predictions may temporarily be
| occluded, so for predicting these occluded areas it may be
| particularly helpful to attend to remote regions in the
| input images which requires long-range dependencies that
| highly depend on the input itself (e.g. on whether there is
| an occlusion), which is exactly where the key-value
| mechanism excels. Not sure they even process past camera
| frames at this point. They only mention that later in the
| pipline they have an LSTM-like NN incorporating past camera
| frames (Schmidhuber will be proud!!).
|
| Edit: A random observation which just occurred to me is
| that their predictions seem surprisingly temporally
| unstable. Observe, for example, the lane layout wildly
| changing while it drives makes a left-turn at the
| intersection (https://youtu.be/j0z4FweCy4M?t=2608). You can
| use the comma and period keys to step through the video
| frame-by-frame.
| sillysaurusx wrote:
| Thank you!
| dexter89_kp3 wrote:
| They do. Karpathy spoke about it in on Tesla AI day. They use
| it for transforming image space to a vector space.
|
| See: https://youtu.be/j0z4FweCy4M (timestamp 54.40 onwards)
| sillysaurusx wrote:
| Can you provide a more specific timestamp? 54.40 doesn't seem
| to mention anything about transformers, and "onwards" is two
| hours.
|
| I'd be really surprised if they use transformers due to how
| computationally expensive they are for anything involving
| vision.
|
| EDIT: Found it. 1h:
| https://www.youtube.com/watch?v=j0z4FweCy4M?t=1h
|
| Fascinating. I guess transformers are efficient.
| dexter89_kp3 wrote:
| I gave the timestamp where they start talking about the
| problem they are trying to solve using transformers. As you
| said it is around 1hr mark
| [deleted]
| mrfusion wrote:
| Offtopic sort of, but does anyone know if folks are working on
| combining vision and natural language in one model? I think that
| could wield some interesting results.
| abalaji wrote:
| yeah there has definitely been work done in that space: it's
| called multi-modal models
|
| not sure if this is the latest work but here's some results
| from Google's AI Blog
|
| https://ai.googleblog.com/2017/06/multimodel-multi-task-mach...
| im3w1l wrote:
| What would be really cool is neural networks with routing.
| Like circuit switching or packet switching. No idea how you
| would train such a beast though.
|
| Like imagine the vision part making a phonecall to the
| natural language part to ask it for help with something.
| wittenator wrote:
| Capsule networks have a routing algorithm as far as I know
| TylerLives wrote:
| Sounds like The Society of Mind -
| https://en.m.wikipedia.org/wiki/Society_of_Mind
| sillysaurusx wrote:
| https://github.com/openai/CLIP
|
| The results are quite interesting:
|
| https://www.reddit.com/r/Art/comments/p866wv/deep_dive_meai_...
| codetrotter wrote:
| And here is a short guide and a link to a Google Collab
| notebook that anyone can use to create their own AI-powered
| art using VQGAN+CLIP:
| https://sourceful.us/doc/935/introduction-to-vqganclip
___________________________________________________________________
(page generated 2021-08-25 23:01 UTC)