[HN Gopher] Self-Supervised Video Object Segmentation by Motion ...
___________________________________________________________________
Self-Supervised Video Object Segmentation by Motion Grouping
Author : Hard_Space
Score : 178 points
Date : 2021-04-17 06:21 UTC (16 hours ago)
(HTM) web link (charigyang.github.io)
(TXT) w3m dump (charigyang.github.io)
| boromi wrote:
| The official code is a README.md file ??
| https://github.com/charigyang/motiongrouping
| weidi_xie wrote:
| will be online soon.
| ttty wrote:
| Animals have evolved highly functional visual systems to
| understand motion, assisting perception even under complex
| environments. In this paper, we work towards developing a
| computer vision system able to segment objects by exploiting
| motion cues, i.e. motion segmentation. We make the following
| contributions: First, we introduce a simple variant of the
| Transformer to segment optical flow frames into primary objects
| and the background. Second, we train the architecture in a self-
| supervised manner, i.e. without using any manual annotations.
| Third, we analyze several critical components of our method and
| conduct thorough ablation studies to validate their necessity.
| Fourth, we evaluate the proposed architecture on public
| benchmarks (DAVIS2016, SegTrackv2, and FBMS59). Despite using
| only optical flow as input, our approach achieves superior
| results compared to previous state-of-the-art self-supervised
| methods, while being an order of magnitude faster. We
| additionally evaluate on a challenging camouflage dataset (MoCA),
| significantly outperforming the other self-supervised approaches,
| and comparing favourably to the top supervised approach,
| highlighting the importance of motion cues, and the potential
| bias towards visual appearance in existing video segmentation
| models.
| mkl wrote:
| That's just the abstract of the linked article.
| seesawtron wrote:
| Does anyone in this field of "flow" based segmentation know how
| would the performance be if there were multiple moving objects in
| the same scene? The examples in this paper seem limted to one
| moving object (cluster of moving pixels) apposed to a relatively
| static background?
| g_airborne wrote:
| In terms of accuracy, the authors mention it as a limitation,
| so probably it could be a problem.
|
| In terms of runtime, it should not matter. Generally speaking
| though, the overhead of optical flow is often overlooked. For
| video DL applications, optical flow calculation often takes
| more time than inference itself. For academic purposes,
| datasets are often already preprocessed and the optical flow
| runtime is not mentioned. Doing real time video analysis with
| optical flow is quite impractical though.
| weidi_xie wrote:
| The time for computing optical flow is indeed a bottleneck,
| and some researchers in the community is working on that: -
| https://arxiv.org/abs/2103.04524 -
| https://arxiv.org/pdf/2103.17271.pdf
| amcoastal wrote:
| Its an active field of research. Optical flow algorithms are
| typically used to create vector fields and have shown promise
| in fluid flows (https://www.mdpi.com/2072-4292/13/4/690/htm).
| However a deep learning version of this on real life data has
| yet to be accomplished. If anyone reading this knows otherwise
| plz link :D
| CyberDildonics wrote:
| I'm not sure what using the vectors for fluids has to do with
| this or what the parent asked. Fluids can use vectors,
| optical flow can produce vectors, but that is about it.
|
| This person asked about clustering and segmenting into more
| than two separate groups.
| forgingahead wrote:
| Fantastic. The ability to train without labourious
| labelling/annotations can really help produce effective models
| appropriate for real-world use cases.
|
| No code as yet, but looking forward to having it released and
| playing with it: https://github.com/charigyang/motiongrouping
|
| *Edit: It will be interesting to see if this works Video > Image
| as well. Much of the current video AI work stems from the image
| foundation, ie split a video into constituent frames and run
| detection models on those images. But the image
| detection/segmentation models assume each image is different, and
| so the process when parsing video this way is unnecessarily
| complex - sequential video frames in the same scene are more
| alike than different.
|
| If good segmentation models for video can be more easily trained
| using this method, then it would be interesting if they can also
| be applied accurately to still images, since a snapshot of a
| video is a single image anyway.
| binaryzeitgeist wrote:
| This is very similar to work by Curious AI done a couple years
| ago, although it didn't work on high res videos.
|
| Tagger: Deep Unsupervised Perceptual Grouping
|
| https://arxiv.org/abs/1606.06724
| jiofih wrote:
| The difference is that the new work is based completely on
| optical flow (i.e. movement) input, while this one is based
| on.. something
| weidi_xie wrote:
| Indeed, the only difference is they work on RGB space, and the
| dataset is a bit toy-ish (no offence), as the networks simply
| need to separate the objects either by color, or a regular
| texture pattern.
|
| What proposed in this motion grouping paper, is more like on
| the idea level, which gives an observation that, although
| objects in natural videos or images are of very complicated
| texture, and there is no reason a network can group these
| pixels together if no supervision is provided.
|
| However, in motion space, pixels moving together form an
| homogeneous field, and luckily, from psychology, we know that
| any parts of the objects tend to move together.
| jonplackett wrote:
| Amazing! I always thought it must be a huge disadvantage that
| computer vision only has one frame to work with (and only one
| camera vs 2 eyes). Often as a human you stare at for a while when
| trying to spot something tricky, like a bird in the trees and
| then find it when it moves. Or you move you head side to side to
| create movement. It would be interesting to see a comparison with
| what the best single frame segmentation could detect in one of
| these scenes.
| [deleted]
| weidi_xie wrote:
| If you check the Table 3, you can see the comparison with some
| of the top unsupervised video segmentation model, trained with
| supervised learning, eg. COSNet, MATNet. They perform
| reasonably well on MoCA, but they were all trained with massive
| manual segmentation annotations, which is not typically not
| scalable.
|
| The proposed self-supervised approach is comparable to those
| top methods, even without using RGB, and any manual
| annotations.
| rasz wrote:
| Reminds me of when people wanted to use rasPi for ZoneMinder but
| it turned out it was too slow for traditional Motion application.
| They turned to Hardware h264 encoder producing nice motion
| vectors and we got this Motion detection for free by leveraging
| hardware encoder
| https://billw2.github.io/pikrellcam/pikrellcam.html
| rijoja wrote:
| Wow!
| ulnarkressty wrote:
| This is also one of the tricks used to get real-time 3D
| reconstruction on the Oculus Quest.
|
| https://research.fb.com/wp-content/uploads/2020/07/Passthrou...
| watersb wrote:
| I thought that MP4 did something like this, but then I suppose
| (after working through that great image seam-carving article)
| that MP4 is doing inter-frame diff chunking. Hmm. So it "knows"
| that the pixels are grouped into regions that move or not -- but
| doesn't care. You can't ask an MP4 video codec to recognize
| objects.
| jeeeb wrote:
| I've been looking (preliminarily) recently at using movement data
| to improve object detection performance on video frames.
|
| The primary challenge I can think of is that the different
| network structure required makes transfer learning a feature
| extraction backbone trained from imagenet etc difficult
|
| This looks interesting. Is there any other good works in this
| area?
| eschneider wrote:
| Huh. We shipped commercial products based on this sort of thing
| years ago...
| krisoft wrote:
| Did you publish the method?
| weidi_xie wrote:
| nice
| rembicilious wrote:
| Wow! The title is perfectly succinct. I have never heard of this
| concept before and upon reading the title realized that many
| powerful techniques will be based upon it. Perhaps several
| already are. It seems such an "obvious" solution, but I can't say
| that it would have ever occurred to me. I don't work in ML, but
| fascinating little gems like this are what keeps me coming back
| to HN
| [deleted]
| strogonoff wrote:
| Related concept: event camera[0]. Instead of capturing light
| intensities at every pixel of the sensor every frame, every pixel
| captures _changes in intensity_ as they occur.
|
| [0] https://en.wikipedia.org/wiki/Event_camera
___________________________________________________________________
(page generated 2021-04-17 23:01 UTC)