[HN Gopher] A Visual Guide to Vision Transformers
       ___________________________________________________________________
        
       A Visual Guide to Vision Transformers
        
       Author : md2rp
       Score  : 180 points
       Date   : 2024-04-16 14:00 UTC (9 hours ago)
        
 (HTM) web link (blog.mdturp.ch)
 (TXT) w3m dump (blog.mdturp.ch)
        
       | md2rp wrote:
       | A Visual Guide to Vision Transformers This is a visual guide to
       | Vision Transformers (ViTs), a class of deep learning models that
       | have achieved state-of-the-art performance on image
       | classification tasks. Vision Transformers apply the transformer
       | architecture, originally designed for natural language processing
       | (NLP), to image data. This guide will walk you through the key
       | components of Vision Transformers in a scroll story format, using
       | visualizations and simple explanations to help you understand how
       | these models work and how the flow of the data through the model
       | looks like.
        
         | bArray wrote:
         | Nice! A small piece of feedback: I would have the dimensions
         | mentioned in the text also annotated on the diagram. It wasn't
         | exactly clear how the input data was flattened for example.
        
           | byteknight wrote:
           | Would also add, as a 100% math idiot, linear transformations,
           | and how it performs them is not explained.
           | 
           | Entirely plausible this is intended for someone more
           | "mathmatical" than myself but appreciate the work regardless.
        
             | md2rp wrote:
             | Thanks for the feedback! I left it out intentionally but
             | probably worth thinking about doing a more fundamental
             | guide!
        
           | md2rp wrote:
           | Thanks for the feedback! Will add it in the revision!
        
       | challenger-derp wrote:
       | Very nice. I wish I could do this sort of scroll story in my
       | digital notes. Is this done with a javascript library?
        
         | md2rp wrote:
         | Yes this was done with a combination of GSAP Scrolltrigger
         | https://gsap.com/docs/v3/Plugins/ScrollTrigger/ and
         | https://d3js.org/
        
           | TuringTest wrote:
           | That kind of scroll is OK-ish for a background parallax
           | effect, or maybe some pretty fade-in/out effects while
           | elements scroll into view (without changing their relative
           | position in the page).
           | 
           | When it interferes with the main functionality of the page,
           | namely reading the content, they break accessibility,
           | distract over understanding the difficult topic, make the
           | content brittle against changes in the platform (different
           | browsers or future standard updates), and as others pointed
           | out make it difficult or impossible to use alternative
           | presentations.
           | 
           | With most comments commenting on the presentation and not on
           | the content, I think it makes clear that it detracts from the
           | experience more than helps.
        
       | tantalor wrote:
       | Stop scrollytelling! It's awful, nobody should do this.
        
         | 4chandaily wrote:
         | Agreed. My scroll wheel should scroll the page, not advance
         | slides or split birds or whatever else. If you need to do this
         | kind of information display, use buttons or a UI widget to
         | control it. Don't hijack the HID devices I use for accessibly
         | operating my computer.
         | 
         | This goes for Scroll Wheels, Scrollbars, the Back Button, the
         | Right Click Button, or any other standard input paradigm.
         | (please) Don't fuck with these! Some of us make use of
         | accessibility features, and messing with our interfaces makes
         | these break or behave in unexpected ways.
        
         | layer8 wrote:
         | This. You can't use reader mode, you can't save the page as a
         | PDF, you can't use PageUp/PageDown because you'll miss some in-
         | between state, and the scroll position where a certain image is
         | shown may not be the preferred one for reading the
         | corresponding text. And the JS will invariably break sooner or
         | later.
        
         | elicash wrote:
         | I'd be annoyed if my bank did this, or airlines, or anything
         | where I just need to get a task done.
         | 
         | For personal websites, I actually think individuality and fun
         | and creativity are good.
        
         | observationist wrote:
         | It's aggressively inaccessible. I don't know if it's a "I'm a
         | web designer, I know better" thing or what.
         | 
         | Web designers: Don't let form interfere with function. The
         | function of this page is to communicate information about
         | transformers. The form effectively prevents that from
         | happening. Don't do it. No, bad, stop.
        
       | SpaceManNabs wrote:
       | Lucas Beyer has a lot of references and material as well that I
       | recommend.
        
       | causal wrote:
       | I like this, but think there is some crucial motivation missing
       | in steps 10.1-10.3 regarding what query/key weights are and why
       | they're needed.
        
         | ThouYS wrote:
         | yes, same issue in all transformer tutorials
        
           | causal wrote:
           | The 2b1b video was the first to make it click for me
        
             | hotdogscout wrote:
             | You mean 3b1b (three blue one brown)?
        
               | causal wrote:
               | Ah that's right, miscounted the blues
        
           | lordswork wrote:
           | I suspect this is because most people (including people
           | writing these tutorials) don't have a strong grasp on this
           | piece as well.
        
         | vikiomega9 wrote:
         | this post made sense to me https://teltam.github.io/posts/soft-
         | dictionary-keys.html
         | 
         | It helps to think of kqv as a form of look up.
        
       | lyapunova wrote:
       | To be honest, I actually really like the visual delivery here.
       | It's especially helpful for understanding what's going on with
       | computer vision problems. Please make more!
        
       ___________________________________________________________________
       (page generated 2024-04-16 23:01 UTC)