[HN Gopher] Vid2Seq: A pretrained visual language model for desc...
       ___________________________________________________________________
        
       Vid2Seq: A pretrained visual language model for describing multi-
       event videos
        
       Author : og_kalu
       Score  : 47 points
       Date   : 2023-03-17 19:24 UTC (3 hours ago)
        
 (HTM) web link (ai.googleblog.com)
 (TXT) w3m dump (ai.googleblog.com)
        
       | netdur wrote:
       | one of cool google projects that does not take off because its so
       | hard to run and integrate with other cool things...
        
         | philjohn wrote:
         | If we take the less cynical take, it'll end up in Youtube as
         | audio description for people with visual impairments.
        
         | TaylorAlexander wrote:
         | This isn't really a "google project" in the way I think about
         | that term, but it's a research project. Google's research is
         | constantly advancing and when things get far enough along they
         | do tend to get used in production. Individual research papers
         | are just a step along the way. This research seems useful for
         | training video generation systems like transformers and
         | especially multi modal systems. Imagine you have a robot that
         | needs to understand the world around it. It needs to interpret
         | text input (likely as voice) but it also needs to understand
         | complex scenes around it. If you can get a system to accurately
         | describe YouTube videos (a captive data set) then it should
         | also be able to understand a live video feed on a robot. That's
         | an important part of a robot. But it is not in itself a product
         | or notable project.
        
       | sdwr wrote:
       | Had this idea about 5 years ago, seems like it might be viable
       | now, would love to see a video analyzer that creates relationship
       | graphs using sentiment analysis. Ex. who responds to whom, what
       | their tone is, how often. Hopefully with modern methods, it
       | wouldn't take too many examples to pick out the dimensionality of
       | expression (cheating out vs in, loudness + arousal, whining on
       | the spectrum of playful to hurt, confidence, clarity)
       | 
       | Could test against TV shows and see if it gets an understanding
       | of the social dynamics.
       | 
       | Plus could uncover a lot of the editing technique, I forget what
       | the term is, when they create context from unrelated scenes by
       | cutting from one to the other.
       | 
       | Would also pick up the general plot formula pretty quickly by
       | mapping out the relative intensity and direction (action, tense,
       | playful, romantic) of scenes.
       | 
       | I remember reading about a startup that did this or something
       | similar for TV shows + movies a while back in the New Yorker, the
       | idea was that they could predict how well it would do from a
       | pilot or even the script.
        
         | mdswanson wrote:
         | Some of this: https://vi.microsoft.com/en-us
        
         | groestl wrote:
         | > when they create context from unrelated scenes by cutting
         | from one to the other.
         | 
         | Do you mean juxtaposition?
        
       | og_kalu wrote:
       | an augmented LM with time tokens to predict event boundaries and
       | textual descriptions in the same output sequence
        
       | zone411 wrote:
       | I believe that this type of video captioning will be able to fill
       | the gaps in the common-sense knowledge of and understanding of
       | the world and its physics for LLMs. It should also be useful for
       | robots.
        
       ___________________________________________________________________
       (page generated 2023-03-17 23:00 UTC)