[HN Gopher] Self-Supervised Video Object Segmentation by Motion ...
       ___________________________________________________________________
        
       Self-Supervised Video Object Segmentation by Motion Grouping
        
       Author : Hard_Space
       Score  : 178 points
       Date   : 2021-04-17 06:21 UTC (16 hours ago)
        
 (HTM) web link (charigyang.github.io)
 (TXT) w3m dump (charigyang.github.io)
        
       | boromi wrote:
       | The official code is a README.md file ??
       | https://github.com/charigyang/motiongrouping
        
         | weidi_xie wrote:
         | will be online soon.
        
       | ttty wrote:
       | Animals have evolved highly functional visual systems to
       | understand motion, assisting perception even under complex
       | environments. In this paper, we work towards developing a
       | computer vision system able to segment objects by exploiting
       | motion cues, i.e. motion segmentation. We make the following
       | contributions: First, we introduce a simple variant of the
       | Transformer to segment optical flow frames into primary objects
       | and the background. Second, we train the architecture in a self-
       | supervised manner, i.e. without using any manual annotations.
       | Third, we analyze several critical components of our method and
       | conduct thorough ablation studies to validate their necessity.
       | Fourth, we evaluate the proposed architecture on public
       | benchmarks (DAVIS2016, SegTrackv2, and FBMS59). Despite using
       | only optical flow as input, our approach achieves superior
       | results compared to previous state-of-the-art self-supervised
       | methods, while being an order of magnitude faster. We
       | additionally evaluate on a challenging camouflage dataset (MoCA),
       | significantly outperforming the other self-supervised approaches,
       | and comparing favourably to the top supervised approach,
       | highlighting the importance of motion cues, and the potential
       | bias towards visual appearance in existing video segmentation
       | models.
        
         | mkl wrote:
         | That's just the abstract of the linked article.
        
       | seesawtron wrote:
       | Does anyone in this field of "flow" based segmentation know how
       | would the performance be if there were multiple moving objects in
       | the same scene? The examples in this paper seem limted to one
       | moving object (cluster of moving pixels) apposed to a relatively
       | static background?
        
         | g_airborne wrote:
         | In terms of accuracy, the authors mention it as a limitation,
         | so probably it could be a problem.
         | 
         | In terms of runtime, it should not matter. Generally speaking
         | though, the overhead of optical flow is often overlooked. For
         | video DL applications, optical flow calculation often takes
         | more time than inference itself. For academic purposes,
         | datasets are often already preprocessed and the optical flow
         | runtime is not mentioned. Doing real time video analysis with
         | optical flow is quite impractical though.
        
           | weidi_xie wrote:
           | The time for computing optical flow is indeed a bottleneck,
           | and some researchers in the community is working on that: -
           | https://arxiv.org/abs/2103.04524 -
           | https://arxiv.org/pdf/2103.17271.pdf
        
         | amcoastal wrote:
         | Its an active field of research. Optical flow algorithms are
         | typically used to create vector fields and have shown promise
         | in fluid flows (https://www.mdpi.com/2072-4292/13/4/690/htm).
         | However a deep learning version of this on real life data has
         | yet to be accomplished. If anyone reading this knows otherwise
         | plz link :D
        
           | CyberDildonics wrote:
           | I'm not sure what using the vectors for fluids has to do with
           | this or what the parent asked. Fluids can use vectors,
           | optical flow can produce vectors, but that is about it.
           | 
           | This person asked about clustering and segmenting into more
           | than two separate groups.
        
       | forgingahead wrote:
       | Fantastic. The ability to train without labourious
       | labelling/annotations can really help produce effective models
       | appropriate for real-world use cases.
       | 
       | No code as yet, but looking forward to having it released and
       | playing with it: https://github.com/charigyang/motiongrouping
       | 
       | *Edit: It will be interesting to see if this works Video > Image
       | as well. Much of the current video AI work stems from the image
       | foundation, ie split a video into constituent frames and run
       | detection models on those images. But the image
       | detection/segmentation models assume each image is different, and
       | so the process when parsing video this way is unnecessarily
       | complex - sequential video frames in the same scene are more
       | alike than different.
       | 
       | If good segmentation models for video can be more easily trained
       | using this method, then it would be interesting if they can also
       | be applied accurately to still images, since a snapshot of a
       | video is a single image anyway.
        
       | binaryzeitgeist wrote:
       | This is very similar to work by Curious AI done a couple years
       | ago, although it didn't work on high res videos.
       | 
       | Tagger: Deep Unsupervised Perceptual Grouping
       | 
       | https://arxiv.org/abs/1606.06724
        
         | jiofih wrote:
         | The difference is that the new work is based completely on
         | optical flow (i.e. movement) input, while this one is based
         | on.. something
        
         | weidi_xie wrote:
         | Indeed, the only difference is they work on RGB space, and the
         | dataset is a bit toy-ish (no offence), as the networks simply
         | need to separate the objects either by color, or a regular
         | texture pattern.
         | 
         | What proposed in this motion grouping paper, is more like on
         | the idea level, which gives an observation that, although
         | objects in natural videos or images are of very complicated
         | texture, and there is no reason a network can group these
         | pixels together if no supervision is provided.
         | 
         | However, in motion space, pixels moving together form an
         | homogeneous field, and luckily, from psychology, we know that
         | any parts of the objects tend to move together.
        
       | jonplackett wrote:
       | Amazing! I always thought it must be a huge disadvantage that
       | computer vision only has one frame to work with (and only one
       | camera vs 2 eyes). Often as a human you stare at for a while when
       | trying to spot something tricky, like a bird in the trees and
       | then find it when it moves. Or you move you head side to side to
       | create movement. It would be interesting to see a comparison with
       | what the best single frame segmentation could detect in one of
       | these scenes.
        
         | [deleted]
        
         | weidi_xie wrote:
         | If you check the Table 3, you can see the comparison with some
         | of the top unsupervised video segmentation model, trained with
         | supervised learning, eg. COSNet, MATNet. They perform
         | reasonably well on MoCA, but they were all trained with massive
         | manual segmentation annotations, which is not typically not
         | scalable.
         | 
         | The proposed self-supervised approach is comparable to those
         | top methods, even without using RGB, and any manual
         | annotations.
        
       | rasz wrote:
       | Reminds me of when people wanted to use rasPi for ZoneMinder but
       | it turned out it was too slow for traditional Motion application.
       | They turned to Hardware h264 encoder producing nice motion
       | vectors and we got this Motion detection for free by leveraging
       | hardware encoder
       | https://billw2.github.io/pikrellcam/pikrellcam.html
        
         | rijoja wrote:
         | Wow!
        
         | ulnarkressty wrote:
         | This is also one of the tricks used to get real-time 3D
         | reconstruction on the Oculus Quest.
         | 
         | https://research.fb.com/wp-content/uploads/2020/07/Passthrou...
        
       | watersb wrote:
       | I thought that MP4 did something like this, but then I suppose
       | (after working through that great image seam-carving article)
       | that MP4 is doing inter-frame diff chunking. Hmm. So it "knows"
       | that the pixels are grouped into regions that move or not -- but
       | doesn't care. You can't ask an MP4 video codec to recognize
       | objects.
        
       | jeeeb wrote:
       | I've been looking (preliminarily) recently at using movement data
       | to improve object detection performance on video frames.
       | 
       | The primary challenge I can think of is that the different
       | network structure required makes transfer learning a feature
       | extraction backbone trained from imagenet etc difficult
       | 
       | This looks interesting. Is there any other good works in this
       | area?
        
       | eschneider wrote:
       | Huh. We shipped commercial products based on this sort of thing
       | years ago...
        
         | krisoft wrote:
         | Did you publish the method?
        
         | weidi_xie wrote:
         | nice
        
       | rembicilious wrote:
       | Wow! The title is perfectly succinct. I have never heard of this
       | concept before and upon reading the title realized that many
       | powerful techniques will be based upon it. Perhaps several
       | already are. It seems such an "obvious" solution, but I can't say
       | that it would have ever occurred to me. I don't work in ML, but
       | fascinating little gems like this are what keeps me coming back
       | to HN
        
       | [deleted]
        
       | strogonoff wrote:
       | Related concept: event camera[0]. Instead of capturing light
       | intensities at every pixel of the sensor every frame, every pixel
       | captures _changes in intensity_ as they occur.
       | 
       | [0] https://en.wikipedia.org/wiki/Event_camera
        
       ___________________________________________________________________
       (page generated 2021-04-17 23:01 UTC)