[HN Gopher] Label a Dataset with a Few Lines of Code
___________________________________________________________________
Label a Dataset with a Few Lines of Code
Author : ulrikhansen54
Score : 17 points
Date : 2021-01-18 21:21 UTC (1 hours ago)
(HTM) web link (eric-landau.medium.com)
(TXT) w3m dump (eric-landau.medium.com)
| Imnimo wrote:
| I'm not really convinced this would work in practice. The trick
| seems to depend on the fact that the dataset is a sequence of
| frames of the same object shot from slightly different angles.
| But that's a terrible dataset - it might work for training a toy
| proof-of-concept, but if you actually wanted to do calorie
| estimation in the wild, you'd need a much more varied (and
| larger) training set. And once you have that, you lose the
| properties that made this labelling approach viable in the first
| place.
| gharman wrote:
| This reminds me of Snorkel (though unclear from the article if
| they're using Snorkel's trick of aggregating many weak
| heuristics). It can be made to work even in the real world. The
| rub is that coming up with these programmatic labelers is
| easier said than done especially for complex data.
|
| It works well if a domain expert can say something without
| "cheating" and looking at the data like "put a box around round
| red objects because those are always apples". But in practice
| people tend to cheat and look at the data first, and you end up
| with humans trying to emulate ML, poorly.
| eric_landau wrote:
| Definitely easier said than done, but the process at least
| makes labelling interesting. Sometimes you run into
| roadblocks where you can't get past just having a human doing
| some element of the labelling, but once you do have a few
| algorithmic strategies that work reasonably well on a
| representative sample of your data, you can usually scale
| them pretty effectively to the rest of your data
| eric_landau wrote:
| Hi Imnimo, I wrote the article and definitely understand your
| concerns. The point is not the specific steps I took working in
| general for most datasets, but more the overall idea of using a
| more data science-y approach to labelling rather than just
| blindly throwing your data at a workforce.
|
| A more varied dataset will require additional strategies. We
| have done this type of thing with various datasets and what
| normally works is a combination of some vertical models,
| heuristics specific to the dataset, classical computer vision
| techniques, and some human label seeding/correction.
| Q6T46nT668w6i3m wrote:
| A common mistake in applied computer vision is to use a
| classical method (e.g. distanced-based watershed) to buoy your
| training set. You'll end up with a computationally expensive
| method (e.g. a region-based convolutional neural network)
| that's a poor replication of the classical method. The major
| advantage of learning-based methods is to go _beyond_ classical
| performance and make inferences comparable to the manually
| annotated image. There's no shortcut.
| florin4- wrote:
| if you could just do "algorithmic labelling" why do you need to
| go to all that trouble of making a dataset and training a model
| in the first place? Why not just use this "algorithmic labelling"
| thing?
|
| Because thats not how this works, thats not how any of this
| works.
___________________________________________________________________
(page generated 2021-01-18 23:00 UTC)