[HN Gopher] Don't Look Twice: Faster Video Transformers with Run...
___________________________________________________________________
Don't Look Twice: Faster Video Transformers with Run-Length
Tokenization
Author : jasondavies
Score : 35 points
Date : 2024-11-16 00:11 UTC (22 hours ago)
(HTM) web link (rccchoudhury.github.io)
(TXT) w3m dump (rccchoudhury.github.io)
| smusamashah wrote:
| Isn't this like Differential Transformers that worked based on
| differences?
| Lerc wrote:
| That was my feeling too for the most part, but The run length
| is a significant source of information and if it enables tokens
| to be skipped it is essentially gaining performance by working
| with a smaller but more dense form of the same information. My
| instinct is that run-length would be just the most basic case
| of a more generalized method for storing token information to
| encompass time and area and for the density of information in
| tokens to be more even, The area and duration being variable
| but the token stream containing a series of tokens containing
| similar quantities of semantic data.
|
| I feel like this is very much like the early days of data
| compression where a few logical but kind of ad-hoc principles
| are being investigated in advance of a more sophisticated
| theory that integrates the ideas of what is being attempted,
| how to identify success, and recognizing pathways that move
| towards the optimal solution.
|
| These papers are the foundations of that work.
| ImageXav wrote:
| As far as I can can tell though the core idea is the same, to
| focus on the differences, the implementation is different.
| Differential transformers 'calculates attention scores as the
| difference between two separate softmax attention maps'. So
| they must process the redundant areas. This removes them
| altogether, which would significantly reduce compute. Very neat
| idea.
|
| However, I do think that background information can sometimes
| be important. I reckon a mild improvement on this model would
| be to leave the background in the first frame, and perhaps
| every x frames, so that the model gets better context cues.
| This would also more accurately replicate video compression.
| robbiemitchell wrote:
| For training, would it be useful to stabilize the footage first?
| nairoz wrote:
| I guess yes. Having worked on video processing, it's always
| better if you can stabilize because it significantly reduces
| the number of unique tokens, which would be even more useful
| for the present method. However, you probably lose in
| generalization performance and not all videos can be
| stabilized.
| FatalLogic wrote:
| Stabilization appears to be a subset of a literally wider, but
| more rewarding, challenge: reconstructing the whole area that
| is scanned by the camera. It could be better to work on that
| challenge, not on simple stabilization.
|
| That's similar to how the human visual system 'paints' a
| coherent scene from a quite narrow field of high-resolution
| view, with educated guesses and assumptions
___________________________________________________________________
(page generated 2024-11-16 23:00 UTC)