[HN Gopher] Show HN: Search inside YouTube videos using natural ...
___________________________________________________________________
Show HN: Search inside YouTube videos using natural language
queries
Author : vladoh
Score : 246 points
Date : 2021-02-12 21:29 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mncharity wrote:
| Years ago, when I prototyped an orders-of-magnitude physical-
| properties playground web app, I found the development bottleneck
| was searching for video clips and images. For example, find clips
| showing the heartbeat of a
| {goldfish,mouse,cat,dog,child,adult,horse,elephant,whale}, for a
| <kg - body mass - metabolic/heart rate> association (metabolic
| rate scales with body mass). Jiggling an oom kg scale, maybe
| you're shown a cat or mouse, and maybe a mouse heart patter, and
| whale's slow swish. Providing a massive (sorry) hidden
| curriculum. They all exist on youtube, and might be fair use for
| OER content, but finding them was not plausible. And still isn't,
| even with commercial use of stock videos.
|
| In the 1950's, the first Powers of Ten zoom book was hand drawn
| from books over years. Around 1980, a PoT film and book could use
| photos, but still good people made mistakes. Now creating a PoT
| zoom book can be homework. A video a school project. An XR a
| professional project. Technology, media search, acquisition, and
| handling costs, throttles science education content.
|
| Necessary but not sufficient, of course. The first book might
| have been imagined and created earlier, but wasn't. 1950's
| astronomy textbooks needn't have had the color of the Sun wrong,
| and the same now in 2020. Though, anticipated difficulty of
| creation does throttle imagination...
|
| If OP search were deployed on youtube, and fair use in its
| current form was allowed to survive, providing a historical step-
| change in the abundance and accessibility of reusable content,
| how might you imagine using that?
| vladoh wrote:
| I think we will get there soon! CLIP is a new model that OpenAI
| published in January and I'm sure Google is working on similar
| technology, which can be used for both video and image search.
| [deleted]
| vladoh wrote:
| This is a demonstration of the abilities of OpenAI's CLIP neural
| network. The tool will download a YouTube video, extract frames
| at regular intervals and precompute the feature vectors of each
| frame using CLIP.
|
| You can then use natural language search queries to find a
| particular frame of the video. The results are really amazing in
| my opinion...
|
| If you want to experiment with it yourself, I prepared a Colab
| notebook that can easily be run:
| https://colab.research.google.com/github/haltakov/natural-la...
| ramraj07 wrote:
| This is amazing. I'm going to get this running on my Dropbox.
| Curious what it gets out of scanned documents as well.
| vladoh wrote:
| There is one caveat to be aware of - the image is cropped to
| a square in the center and scaled down to 224x224. So small
| details will be lost, for example if you want to run it on
| scanned documents. Photos work great though.
|
| I tried it on the 2M photos from the Unsplash dataset:
| https://github.com/haltakov/natural-language-image-search
| Crazyontap wrote:
| Can it work for more advance keywords like say, "traffic
| violation" where it spots a car jumping red light or pedestrian
| not using a crosswalk, etc?
|
| It could be very useful to help with law enforcement.
| vladoh wrote:
| I think it can. However, you will likely need a bigger model.
| Currently, OpenAI shares only their small model and I hope
| they will soon release bigger ones!
| canada_dry wrote:
| Great demo.
|
| Wondering whether it would be more efficient if extracting
| frames where the content has changed (e.g. over a threshold
| and/or all I-frames)?
|
| Also, could this be used to identify event type in videos? I'd
| love to run my 25 years of home videos through this an have it
| annotate: "Christmas, birthday, park, camping...".
| vladoh wrote:
| Yes, this is definitely possible. You can maybe try computing
| some kind of image distance between frames or some keyframe
| extraction.
|
| Once you compute the features, the search is very efficient!
| I tried it for searching in the 2M photos dataset from
| Unsplash and it takes like 2-3 seconds:
| https://github.com/haltakov/natural-language-image-search
|
| I plan to run my personal photos through it :)
| mockingbirdy wrote:
| Awesome! I'm currently working on the exact same thing (but
| with OCR added). Thank you for releasing this.
| canada_dry wrote:
| > the exact same thing (but with OCR added)
|
| Hmmm... what does "with OCR added" mean? If there is text in
| the video (e.g. street sign) that it can also be searched??
| mockingbirdy wrote:
| No, that wouldn't work too well. It's for YouTubers who
| stream their desktop screens and I need to extract some
| information to automatically process it. The desktop
| streams always look very similar so I don't need advanced
| AI/neural nets to extract that.
| mandeepj wrote:
| > The tool will download a YouTube video, extract frames at
| regular intervals
|
| That should be able to scale well :-)
| woko wrote:
| For info, the same tool works well with 2 million images
| found in the Unsplash dataset [1]. Features only have to be
| computed once for the dataset, and only the feature vector
| for the user query has to be computed on the fly. Then
| matching features can be done in a manner that scales well.
|
| So, the present tool does not scale because the videos are
| part of the user query, but a company with an easy access to
| the videos and the computational power to pre-encode the
| frames as features could create a search engine based on
| CLIP.
|
| [1] https://github.com/haltakov/natural-language-image-search
| vladoh wrote:
| Thanks for sharing! :)
|
| Yes, the feature computation on the images has to be sone
| only once and the representation is very efficient - 512
| float16 values per image.
| woko wrote:
| Yes, I know. :D Your previous project with Unsplash made
| me try a similar approach [1] for banners of video games
| on Steam.
|
| [1] https://github.com/woctezuma/steam-image-search
| vladoh wrote:
| Cool application! I wasn't aware that there are so many
| images available on Steam...
| woko wrote:
| Yes, Steam has grown a lot. Last I checked, there were
| ~50k apps.
|
| Edit: 50,630 apps according to
| https://www.gamedatacrunch.com/
|
| As I focused on vertical banners, the list was smaller
| (~30k apps). This is equivalent to the *lite* version of
| Unsplash's dataset.
| Abishek_Muthian wrote:
| Excellent work.
|
| If it could take image set as input, then perhaps we can use
| this to identify our self in a random Internet video e.g.
| Lengthy tourist video in which you suspect you could have been
| covered as you were there at that place on that day.
|
| There are people already looking for such solution(I've added
| the link to that discussion on my profile).
| vladoh wrote:
| I think this is a valuable application, but I don't think
| CLIP is well suited for it. The power of CLIP comes from
| training a model to jointly "understand" text and images. If
| you are looking at identifying a particular person there are
| more suitable designs for face recognition.
| pininja wrote:
| This is very cool! Does this produce a occurrence index by any
| chance? It would be neat to explore a word map of a video.
| vladoh wrote:
| I pushed a small update and you will now see a heatmap
| displaying the score of the search query for each frame.
| vladoh wrote:
| Not yet, but I had this idea as well. You basically get a
| score describing how well a phrase is matching each of the
| images so it won't be difficult to do. I'll look into that!
| fireattack wrote:
| Just a heads up: in the demo's setup block, it installs pytorch
| 1.7.0 + cu101 from CLIP (since it requires them) and then
| immediately uninstalls it to re-install pytorch 1.7.1 by the
| next command, which takes at least 5 minutes. If we don't
| really need 1.7.1, we can save some time by removing the manual
| pytorch installation line.
| vladoh wrote:
| Yes, I know that this is a bit slow. The problem is you
| really need 1.7.1, because 1.7.0 leads to some strange issues
| and broken results:
|
| https://github.com/openai/CLIP/issues/13#issuecomment-771143.
| ..
| fireattack wrote:
| Ah, got it. I just noticed 1.7.1 is already in CLIP's
| requirements.txt, weird that colab would still install
| 1.7.0 to begin with.
|
| Edit: I just realized 1.7.0 just came with colab, not
| installed by CLIP.
| vladoh wrote:
| Yes, this is the actual problem...
| akinhwan wrote:
| Wow what sort of business ideas do you think could come of
| this?
| purplecats wrote:
| notebook isn't loading for me but this seems really cool
| vladoh wrote:
| Hmm, this is strange... The Colab notebook should load even if
| you are not logged in with your Google account (you will need
| to log in if you want to run it, though).
| gbennett71 wrote:
| Downloading YouTube videos is a Terms of Use violation, no?
| vladoh wrote:
| I haven't read the Terms of Use in details, but I guess it
| depends on what you do with the data. I actually don't store,
| distribute or sell the videos, so I hope it is OK in the scope
| of such personal project :)
| hansel_der wrote:
| lol no.
|
| belive it or not, but your browser/tv has to download the video
| in order to show it.
| amelius wrote:
| I feel like I've been training most of their road/traffic data
| through Google's much hated captcha.
| vladoh wrote:
| You feel like this, because you are right :) Just search online
| for "google captcha self driving".
| jameshart wrote:
| I'm curious: why hasn't google done this already?
|
| Seems like a key part of 'organizing all the world's
| information', right? Making YouTube searchable would open up
| their massive content library to discovery, and enable people to
| find content without having to rely on The Algorithm guessing
| what they might want to see...
|
| In general the switch from users going to the internet and asking
| it to find them certain things, to going to the internet and
| having things selected for them and offered up as
| 'recommendations' seems like a key shift in how the internet
| functions to disseminate ideas...
| vladoh wrote:
| CLIP is quite a new model published by OpenAI in January. Their
| work is novel and pushed the state-of-the-art in this area by a
| lot. I'm sure that Google is also working on similar
| applications.
|
| And I think they already have something similar. Recently, I've
| seen search results on Google that point you to a specific time
| of a YouTube video...
| suyash wrote:
| The problem still exists that you have to provide it the YouTube
| video to search within, would be nice if there was a tool to
| search across all of YouTube.
| woko wrote:
| Which would require an easy access to all of the videos, which
| only Googe/Youtube itself has.
|
| Many nice things could be done, but the platform (or the data
| owner) has all the power in its hands.
| vladoh wrote:
| Yeah, indexing all YouTube videos is surely possible, but out
| of the scope of a personal project :)
___________________________________________________________________
(page generated 2021-02-13 23:02 UTC)