[HN Gopher] Show HN: Search inside YouTube videos using natural ...
       ___________________________________________________________________
        
       Show HN: Search inside YouTube videos using natural language
       queries
        
       Author : vladoh
       Score  : 246 points
       Date   : 2021-02-12 21:29 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mncharity wrote:
       | Years ago, when I prototyped an orders-of-magnitude physical-
       | properties playground web app, I found the development bottleneck
       | was searching for video clips and images. For example, find clips
       | showing the heartbeat of a
       | {goldfish,mouse,cat,dog,child,adult,horse,elephant,whale}, for a
       | <kg - body mass - metabolic/heart rate> association (metabolic
       | rate scales with body mass). Jiggling an oom kg scale, maybe
       | you're shown a cat or mouse, and maybe a mouse heart patter, and
       | whale's slow swish. Providing a massive (sorry) hidden
       | curriculum. They all exist on youtube, and might be fair use for
       | OER content, but finding them was not plausible. And still isn't,
       | even with commercial use of stock videos.
       | 
       | In the 1950's, the first Powers of Ten zoom book was hand drawn
       | from books over years. Around 1980, a PoT film and book could use
       | photos, but still good people made mistakes. Now creating a PoT
       | zoom book can be homework. A video a school project. An XR a
       | professional project. Technology, media search, acquisition, and
       | handling costs, throttles science education content.
       | 
       | Necessary but not sufficient, of course. The first book might
       | have been imagined and created earlier, but wasn't. 1950's
       | astronomy textbooks needn't have had the color of the Sun wrong,
       | and the same now in 2020. Though, anticipated difficulty of
       | creation does throttle imagination...
       | 
       | If OP search were deployed on youtube, and fair use in its
       | current form was allowed to survive, providing a historical step-
       | change in the abundance and accessibility of reusable content,
       | how might you imagine using that?
        
         | vladoh wrote:
         | I think we will get there soon! CLIP is a new model that OpenAI
         | published in January and I'm sure Google is working on similar
         | technology, which can be used for both video and image search.
        
       | [deleted]
        
       | vladoh wrote:
       | This is a demonstration of the abilities of OpenAI's CLIP neural
       | network. The tool will download a YouTube video, extract frames
       | at regular intervals and precompute the feature vectors of each
       | frame using CLIP.
       | 
       | You can then use natural language search queries to find a
       | particular frame of the video. The results are really amazing in
       | my opinion...
       | 
       | If you want to experiment with it yourself, I prepared a Colab
       | notebook that can easily be run:
       | https://colab.research.google.com/github/haltakov/natural-la...
        
         | ramraj07 wrote:
         | This is amazing. I'm going to get this running on my Dropbox.
         | Curious what it gets out of scanned documents as well.
        
           | vladoh wrote:
           | There is one caveat to be aware of - the image is cropped to
           | a square in the center and scaled down to 224x224. So small
           | details will be lost, for example if you want to run it on
           | scanned documents. Photos work great though.
           | 
           | I tried it on the 2M photos from the Unsplash dataset:
           | https://github.com/haltakov/natural-language-image-search
        
         | Crazyontap wrote:
         | Can it work for more advance keywords like say, "traffic
         | violation" where it spots a car jumping red light or pedestrian
         | not using a crosswalk, etc?
         | 
         | It could be very useful to help with law enforcement.
        
           | vladoh wrote:
           | I think it can. However, you will likely need a bigger model.
           | Currently, OpenAI shares only their small model and I hope
           | they will soon release bigger ones!
        
         | canada_dry wrote:
         | Great demo.
         | 
         | Wondering whether it would be more efficient if extracting
         | frames where the content has changed (e.g. over a threshold
         | and/or all I-frames)?
         | 
         | Also, could this be used to identify event type in videos? I'd
         | love to run my 25 years of home videos through this an have it
         | annotate: "Christmas, birthday, park, camping...".
        
           | vladoh wrote:
           | Yes, this is definitely possible. You can maybe try computing
           | some kind of image distance between frames or some keyframe
           | extraction.
           | 
           | Once you compute the features, the search is very efficient!
           | I tried it for searching in the 2M photos dataset from
           | Unsplash and it takes like 2-3 seconds:
           | https://github.com/haltakov/natural-language-image-search
           | 
           | I plan to run my personal photos through it :)
        
         | mockingbirdy wrote:
         | Awesome! I'm currently working on the exact same thing (but
         | with OCR added). Thank you for releasing this.
        
           | canada_dry wrote:
           | > the exact same thing (but with OCR added)
           | 
           | Hmmm... what does "with OCR added" mean? If there is text in
           | the video (e.g. street sign) that it can also be searched??
        
             | mockingbirdy wrote:
             | No, that wouldn't work too well. It's for YouTubers who
             | stream their desktop screens and I need to extract some
             | information to automatically process it. The desktop
             | streams always look very similar so I don't need advanced
             | AI/neural nets to extract that.
        
         | mandeepj wrote:
         | > The tool will download a YouTube video, extract frames at
         | regular intervals
         | 
         | That should be able to scale well :-)
        
           | woko wrote:
           | For info, the same tool works well with 2 million images
           | found in the Unsplash dataset [1]. Features only have to be
           | computed once for the dataset, and only the feature vector
           | for the user query has to be computed on the fly. Then
           | matching features can be done in a manner that scales well.
           | 
           | So, the present tool does not scale because the videos are
           | part of the user query, but a company with an easy access to
           | the videos and the computational power to pre-encode the
           | frames as features could create a search engine based on
           | CLIP.
           | 
           | [1] https://github.com/haltakov/natural-language-image-search
        
             | vladoh wrote:
             | Thanks for sharing! :)
             | 
             | Yes, the feature computation on the images has to be sone
             | only once and the representation is very efficient - 512
             | float16 values per image.
        
               | woko wrote:
               | Yes, I know. :D Your previous project with Unsplash made
               | me try a similar approach [1] for banners of video games
               | on Steam.
               | 
               | [1] https://github.com/woctezuma/steam-image-search
        
               | vladoh wrote:
               | Cool application! I wasn't aware that there are so many
               | images available on Steam...
        
               | woko wrote:
               | Yes, Steam has grown a lot. Last I checked, there were
               | ~50k apps.
               | 
               | Edit: 50,630 apps according to
               | https://www.gamedatacrunch.com/
               | 
               | As I focused on vertical banners, the list was smaller
               | (~30k apps). This is equivalent to the *lite* version of
               | Unsplash's dataset.
        
         | Abishek_Muthian wrote:
         | Excellent work.
         | 
         | If it could take image set as input, then perhaps we can use
         | this to identify our self in a random Internet video e.g.
         | Lengthy tourist video in which you suspect you could have been
         | covered as you were there at that place on that day.
         | 
         | There are people already looking for such solution(I've added
         | the link to that discussion on my profile).
        
           | vladoh wrote:
           | I think this is a valuable application, but I don't think
           | CLIP is well suited for it. The power of CLIP comes from
           | training a model to jointly "understand" text and images. If
           | you are looking at identifying a particular person there are
           | more suitable designs for face recognition.
        
         | pininja wrote:
         | This is very cool! Does this produce a occurrence index by any
         | chance? It would be neat to explore a word map of a video.
        
           | vladoh wrote:
           | I pushed a small update and you will now see a heatmap
           | displaying the score of the search query for each frame.
        
           | vladoh wrote:
           | Not yet, but I had this idea as well. You basically get a
           | score describing how well a phrase is matching each of the
           | images so it won't be difficult to do. I'll look into that!
        
         | fireattack wrote:
         | Just a heads up: in the demo's setup block, it installs pytorch
         | 1.7.0 + cu101 from CLIP (since it requires them) and then
         | immediately uninstalls it to re-install pytorch 1.7.1 by the
         | next command, which takes at least 5 minutes. If we don't
         | really need 1.7.1, we can save some time by removing the manual
         | pytorch installation line.
        
           | vladoh wrote:
           | Yes, I know that this is a bit slow. The problem is you
           | really need 1.7.1, because 1.7.0 leads to some strange issues
           | and broken results:
           | 
           | https://github.com/openai/CLIP/issues/13#issuecomment-771143.
           | ..
        
             | fireattack wrote:
             | Ah, got it. I just noticed 1.7.1 is already in CLIP's
             | requirements.txt, weird that colab would still install
             | 1.7.0 to begin with.
             | 
             | Edit: I just realized 1.7.0 just came with colab, not
             | installed by CLIP.
        
               | vladoh wrote:
               | Yes, this is the actual problem...
        
         | akinhwan wrote:
         | Wow what sort of business ideas do you think could come of
         | this?
        
       | purplecats wrote:
       | notebook isn't loading for me but this seems really cool
        
         | vladoh wrote:
         | Hmm, this is strange... The Colab notebook should load even if
         | you are not logged in with your Google account (you will need
         | to log in if you want to run it, though).
        
       | gbennett71 wrote:
       | Downloading YouTube videos is a Terms of Use violation, no?
        
         | vladoh wrote:
         | I haven't read the Terms of Use in details, but I guess it
         | depends on what you do with the data. I actually don't store,
         | distribute or sell the videos, so I hope it is OK in the scope
         | of such personal project :)
        
         | hansel_der wrote:
         | lol no.
         | 
         | belive it or not, but your browser/tv has to download the video
         | in order to show it.
        
       | amelius wrote:
       | I feel like I've been training most of their road/traffic data
       | through Google's much hated captcha.
        
         | vladoh wrote:
         | You feel like this, because you are right :) Just search online
         | for "google captcha self driving".
        
       | jameshart wrote:
       | I'm curious: why hasn't google done this already?
       | 
       | Seems like a key part of 'organizing all the world's
       | information', right? Making YouTube searchable would open up
       | their massive content library to discovery, and enable people to
       | find content without having to rely on The Algorithm guessing
       | what they might want to see...
       | 
       | In general the switch from users going to the internet and asking
       | it to find them certain things, to going to the internet and
       | having things selected for them and offered up as
       | 'recommendations' seems like a key shift in how the internet
       | functions to disseminate ideas...
        
         | vladoh wrote:
         | CLIP is quite a new model published by OpenAI in January. Their
         | work is novel and pushed the state-of-the-art in this area by a
         | lot. I'm sure that Google is also working on similar
         | applications.
         | 
         | And I think they already have something similar. Recently, I've
         | seen search results on Google that point you to a specific time
         | of a YouTube video...
        
       | suyash wrote:
       | The problem still exists that you have to provide it the YouTube
       | video to search within, would be nice if there was a tool to
       | search across all of YouTube.
        
         | woko wrote:
         | Which would require an easy access to all of the videos, which
         | only Googe/Youtube itself has.
         | 
         | Many nice things could be done, but the platform (or the data
         | owner) has all the power in its hands.
        
           | vladoh wrote:
           | Yeah, indexing all YouTube videos is surely possible, but out
           | of the scope of a personal project :)
        
       ___________________________________________________________________
       (page generated 2021-02-13 23:02 UTC)