[HN Gopher] Don't Look Twice: Faster Video Transformers with Run...
       ___________________________________________________________________
        
       Don't Look Twice: Faster Video Transformers with Run-Length
       Tokenization
        
       Author : jasondavies
       Score  : 35 points
       Date   : 2024-11-16 00:11 UTC (22 hours ago)
        
 (HTM) web link (rccchoudhury.github.io)
 (TXT) w3m dump (rccchoudhury.github.io)
        
       | smusamashah wrote:
       | Isn't this like Differential Transformers that worked based on
       | differences?
        
         | Lerc wrote:
         | That was my feeling too for the most part, but The run length
         | is a significant source of information and if it enables tokens
         | to be skipped it is essentially gaining performance by working
         | with a smaller but more dense form of the same information. My
         | instinct is that run-length would be just the most basic case
         | of a more generalized method for storing token information to
         | encompass time and area and for the density of information in
         | tokens to be more even, The area and duration being variable
         | but the token stream containing a series of tokens containing
         | similar quantities of semantic data.
         | 
         | I feel like this is very much like the early days of data
         | compression where a few logical but kind of ad-hoc principles
         | are being investigated in advance of a more sophisticated
         | theory that integrates the ideas of what is being attempted,
         | how to identify success, and recognizing pathways that move
         | towards the optimal solution.
         | 
         | These papers are the foundations of that work.
        
         | ImageXav wrote:
         | As far as I can can tell though the core idea is the same, to
         | focus on the differences, the implementation is different.
         | Differential transformers 'calculates attention scores as the
         | difference between two separate softmax attention maps'. So
         | they must process the redundant areas. This removes them
         | altogether, which would significantly reduce compute. Very neat
         | idea.
         | 
         | However, I do think that background information can sometimes
         | be important. I reckon a mild improvement on this model would
         | be to leave the background in the first frame, and perhaps
         | every x frames, so that the model gets better context cues.
         | This would also more accurately replicate video compression.
        
       | robbiemitchell wrote:
       | For training, would it be useful to stabilize the footage first?
        
         | nairoz wrote:
         | I guess yes. Having worked on video processing, it's always
         | better if you can stabilize because it significantly reduces
         | the number of unique tokens, which would be even more useful
         | for the present method. However, you probably lose in
         | generalization performance and not all videos can be
         | stabilized.
        
         | FatalLogic wrote:
         | Stabilization appears to be a subset of a literally wider, but
         | more rewarding, challenge: reconstructing the whole area that
         | is scanned by the camera. It could be better to work on that
         | challenge, not on simple stabilization.
         | 
         | That's similar to how the human visual system 'paints' a
         | coherent scene from a quite narrow field of high-resolution
         | view, with educated guesses and assumptions
        
       ___________________________________________________________________
       (page generated 2024-11-16 23:00 UTC)