[HN Gopher] Vid2Seq: A pretrained visual language model for desc...
___________________________________________________________________
Vid2Seq: A pretrained visual language model for describing multi-
event videos
Author : og_kalu
Score : 47 points
Date : 2023-03-17 19:24 UTC (3 hours ago)
(HTM) web link (ai.googleblog.com)
(TXT) w3m dump (ai.googleblog.com)
| netdur wrote:
| one of cool google projects that does not take off because its so
| hard to run and integrate with other cool things...
| philjohn wrote:
| If we take the less cynical take, it'll end up in Youtube as
| audio description for people with visual impairments.
| TaylorAlexander wrote:
| This isn't really a "google project" in the way I think about
| that term, but it's a research project. Google's research is
| constantly advancing and when things get far enough along they
| do tend to get used in production. Individual research papers
| are just a step along the way. This research seems useful for
| training video generation systems like transformers and
| especially multi modal systems. Imagine you have a robot that
| needs to understand the world around it. It needs to interpret
| text input (likely as voice) but it also needs to understand
| complex scenes around it. If you can get a system to accurately
| describe YouTube videos (a captive data set) then it should
| also be able to understand a live video feed on a robot. That's
| an important part of a robot. But it is not in itself a product
| or notable project.
| sdwr wrote:
| Had this idea about 5 years ago, seems like it might be viable
| now, would love to see a video analyzer that creates relationship
| graphs using sentiment analysis. Ex. who responds to whom, what
| their tone is, how often. Hopefully with modern methods, it
| wouldn't take too many examples to pick out the dimensionality of
| expression (cheating out vs in, loudness + arousal, whining on
| the spectrum of playful to hurt, confidence, clarity)
|
| Could test against TV shows and see if it gets an understanding
| of the social dynamics.
|
| Plus could uncover a lot of the editing technique, I forget what
| the term is, when they create context from unrelated scenes by
| cutting from one to the other.
|
| Would also pick up the general plot formula pretty quickly by
| mapping out the relative intensity and direction (action, tense,
| playful, romantic) of scenes.
|
| I remember reading about a startup that did this or something
| similar for TV shows + movies a while back in the New Yorker, the
| idea was that they could predict how well it would do from a
| pilot or even the script.
| mdswanson wrote:
| Some of this: https://vi.microsoft.com/en-us
| groestl wrote:
| > when they create context from unrelated scenes by cutting
| from one to the other.
|
| Do you mean juxtaposition?
| og_kalu wrote:
| an augmented LM with time tokens to predict event boundaries and
| textual descriptions in the same output sequence
| zone411 wrote:
| I believe that this type of video captioning will be able to fill
| the gaps in the common-sense knowledge of and understanding of
| the world and its physics for LLMs. It should also be useful for
| robots.
___________________________________________________________________
(page generated 2023-03-17 23:00 UTC)