[HN Gopher] A better way to build ML: Why you should be using Ac...
___________________________________________________________________
A better way to build ML: Why you should be using Active Learning
Author : razcle
Score : 95 points
Date : 2021-02-04 16:38 UTC (6 hours ago)
(HTM) web link (humanloop.com)
(TXT) w3m dump (humanloop.com)
| andy99 wrote:
| I have a suggestion about the first plot you show in the writeup.
| From what I can see, it is based on a finite pool of data and so
| it undersells active learning: performance shoots up as AL finds
| the interesting points, but then the curve flattens and is less
| steep than the random curve as the "boring" points get added. It
| would be nice to see the same curve for a bigger training pool
| where AL was able to get to a target accuracy without running out
| of valuable training points. I suspect that would make the
| difference between the two curves much more stark. As it is, it
| just looks like AL does better for very low data but to get to
| high accuracy you need to use the whole dataset anyway so it's a
| wash between AL and random.
| razcle wrote:
| Yeah I think is a good point. I'm actually planning to do a
| follow up post that is a case study with some real world data
| and the plots in that are much more like what you describe.
|
| I may update the post. thanks!
| dexter89_kp3 wrote:
| what is your thought on synthetic data vs active learning?
|
| For some domains, with privacy concerns or rarity of objects,
| getting labelled data for deep learning is challenging.
|
| There is decent research on sim2real i.e transferring models
| trained on synthetic data to real world applications
| https://arxiv.org/pdf/1703.06907.pdf
| razcle wrote:
| TL;DR I think its complementary.
|
| Synthetic data is particularly valuable when even the
| unlabelled data is expensive to obtain. For example if you want
| to train a driverless car, you may never see an ambulance
| driving at night in the rain even if you drive for thousands of
| miles. In that case, being able to synthesise data makes a lot
| of sense and we have lots of tools for computer graphics that
| make this easy.
|
| Synthetic data can also be useful to share data when there are
| privacy concerns but my own feeling here is that there are
| better approaches to privacy preservation, like federated
| learning and learning via randomised response
| (https://arxiv.org/pdf/2001.04942.pdf).
|
| In general though, outside of some vision applications, I'm
| pretty sceptical of synthetic data. For synthetic data to work
| well, you need a really good class conditional generator. E.g
| "generate a tweet that is a negative sentiment review" but if
| you have a sufficiently good model to do this, then you can
| probably use that model to solve your classification task
| anyway.
|
| For most settings, I think synthetic data will work for data
| augmentation as a regulariser but will not be a substitute for
| all labelled data.
|
| For the labelled data, active learning should still help.
| lwhsiao wrote:
| Hi Mike,
|
| Can you talk about the tradeoffs or relationship between active
| learning and weak supervision from your point of view?
| peadarohaodha wrote:
| adding to what Raza said - a consideration for both active
| learning and weak supervision is the need to construct a gold
| standard labelled dataset for model validation and testing
| purposes. At Humanloop, in addition to collecting training
| data, we are also using a form of active learning to speed up
| the collection of an unbiased test data set.
|
| Another consideration on the weak supervision side (for the
| snorkel style approach of labelling functions) is that creating
| labelling functions can be a relatively technical task, which
| may not be well suited for non-technical domain expert
| annotators for providing feedback to the model
| razcle wrote:
| Hi lwhsiao,
|
| Raza here, author of the post.
|
| My high-level answer is weak-labelling overcomes cold starts
| and active learning helps with the last mile.
|
| More detail:
|
| We see weak learning as very complementary to active learning.
| By using labelling functions, you can quickly overcome the cold
| start problem and also better leverage external resources like
| knowledge bases.
|
| But most of the work in training ML systems often comes in
| getting the last few percentage points of performance. Going
| from good to good enough. This is where active learning can
| really shine because it guides you as to what data you really
| need to move model performance.
|
| At Humanloop, we've started with active learning tools but are
| also doing a lot of work on weak labelling.
| lwhsiao wrote:
| Thanks for the detailed response :)
| nailer wrote:
| Hey Luke! I'm a full stack / devops person so this is going to
| be high level but getting a couple of my ML researcher
| colleagues in for a deeper dive now.
|
| Short version:
|
| 1) we can get to the same level of accuracy with around 10% of
| the data points. Getting (and managing) a big enough data set
| to train a supervised learning model is the biggest thing
| slowing down ML deployments.
|
| 2) the model contacts a human when it can't label a data point
| with a high degree of confidence. You'll never have people with
| a bunch of specialist knowledge being asked to perform mundane
| data labelling tasks
| nailer wrote:
| Mike from Humanloop here - if you're interested in active
| learning we'll be around on this thread, also we're looking for
| fullstack SW engineers and ML engineers -
| https://news.ycombinator.com/item?id=25992607
| nmca wrote:
| Hey Mike - I did some work on an industry active learning
| system a few years ago. The high level finding was that
| transfer & nonparametric methods were _huge_ wins, but online
| uncertainty and "real active learning" hardly worked at all
| and were super complicated (at least in caffe1 anyway lol).
|
| Can you point to any big breakthroughs that have helped in
| recent years? Linear Hypermodels
| (https://arxiv.org/abs/2006.07464) seem promising, but that
| original experience has left me with some healthy skepticism.
| razcle wrote:
| Hi NMCA, I'm Raza one of the founders of Humanloop. I totally
| agree that transfer learning is one of the best strategies
| for data efficiency and it's pretty common to see people
| start from large pre-trained models like BERT. Active
| learning then provides an additional benefit, especially when
| labels are expensive. For example we've worked with teams
| where lawyers have to annotate.
|
| In terms of breakthroughs in recent years, some things I'd
| point to would BALD (https://arxiv.org/abs/1112.5745) and its
| applications in deep learning as well. There has also been
| progress in coreset methods
| (https://openreview.net/forum?id=H1aIuk-RW).
|
| I think that you're right that it used to be much to hard to
| get active learning to work. Part of what we're trying to do
| is make it easy enough that its worth the benefits.
| razcle wrote:
| I'd also point out that people always focus on just the
| labelling savings from active learning but there are other
| benefits in practice too: 1) faster feedback on model
| performance during the annotation process and 2) Better
| engagement from the annotators as they can see the benefit
| of their work.
| andy99 wrote:
| I'd add that there is a deep connection between active
| learning and understanding the "domain of expertise" of a
| model, for example what inputs are ambiguous or low
| confidence, and which are out of distribution. E.g. BALD
| is a form of out of distribution detection - a point with
| high disagreement it not only useful to add to the
| training pool, it is a point for which the current model
| has no business making a prediction.
| peadarohaodha wrote:
| Adding to what Raza as said - to your point on "real active
| learning" hardly working. I would be interested to hear what
| approaches you took? We've found that the quality of the
| uncertainty estimate for your model is quite important for
| active learning to work well in practise. So applying good
| approximations for the model uncertainty for modern sized
| transformer models (like BERT) is an important consideration
| nmca wrote:
| Irrc we were using a Bayesian classification model on top
| of of fixed pretrained features from transfer, something
| along the lines of refitting a GP every time the number of
| classes changed. This was images as opposed to text, and
| after an epoch classification was ~ok but during training
| (eg the active bit) we didn't see much benefit.
| UnpossibleJim wrote:
| Hi Mike, let me know if I'm getting too into specifics for
| casual conversation. You mentioned a smaller dataset for
| training and I'm curious as to how much smaller. Like, for
| static image recognition, what type of dataset trade off are we
| looking at. Also, randomly, is there a compute cost trade off
| or is it just a smarter process?
|
| I read the site, and no one likes cleaning data (no one), and
| answering questions from a "toddler machine" (for lack of a
| better term) doesn't sound as bad, but I was curious what
| potential trade offs there might be.
| talolard wrote:
| I wrote a blog post a few years ago about possible downsides
| to active learning.
|
| https://www.lighttag.io/blog/active-learning-optimization-
| is...
|
| To be fair to the Humanloop folks, the criticisms probably
| don't apply to the kinds of models their using (modern
| transformers).
| peadarohaodha wrote:
| We've found you can get an order of magnitude improvement in
| the amount of labelled data you need - but there is some
| variance based on the difficulty of the problem. Because you
| are retraining the model in tandem with the data labelling
| process, there is additional compute associated to an active
| learning powered labelling process versus just selecting the
| data at random to label next. But this additional compute
| consideration is almost always outweighed by the saving of
| human time spent labelling.
| talolard wrote:
| I have a question on the compute aspect regarding your
| business model, hope I'm not being to nosy..
|
| I tried HL, the experience was stellar (well done!) and it
| made me think...
|
| To get AL working with a great user experience you need
| quite a bit of compute. How are you thinking about your
| margins, e.g the cost to produce what you're offering
| versus what customers will pay for it?
| peadarohaodha wrote:
| Thanks for the feedback! It's a good question re compute.
| There are some fun engineering and ML research challenges
| that we are constantly iterating on that are related to
| this. A few examples - how to most efficiently share
| compute resources in a JIT manner (e.g. GPU memory)
| during model serving for both training and inference
| (where the use case and privacy requirements permit) -
| how to construct model training algorithms that operate
| in a more online manner effectively (so you don't have to
| retrain on the whole dataset when you see new examples) -
| how to significantly reduce the model footprint (in terms
| of memory and flops) of modern deep transformer models
| given they are highly over-parameterised and can contain
| a lot of redundancy.
|
| this stuff helps us a lot on the margins point!
| nicoburns wrote:
| Active learning sounds like a step closer to how humans and other
| animals learn: iteratively with continuous feedback.
| uoaei wrote:
| In the sense of a personally-curated curriculum, yes.
| anonymouse008 wrote:
| Ha! This is amazing -- we did a similar process for an EEG
| research project, and it was stellar (working memory and learning
| curves)! Until now, I didn't have the right words to articulate
| what we did - so thank you for the incantation!
| throwaway3699 wrote:
| I'm curious - what was the project about? I'm super interested
| in combining EEG and these active learning techniques.
| anonymouse008 wrote:
| Super cool. Let's connect - email in profile
| dr_dshiv wrote:
| https://www.digitaltrends.com/features/ai-identifies-
| songs-b...
|
| Ima gonna hit u up too :)
| realradicalwash wrote:
| Nice to see some active learning around here. To add a data point
| from a less successful story:
|
| In one of our research projects, we used AL to improve part-of-
| speech prediction, inspired by work by Rehbein and Ruppenhofer,
| e.g. https://www.aclweb.org/anthology/P17-1107/
|
| Our data base was a corpus of Scientific English from 17th-now
| and for our data and situation, we found that choosing the right
| tool/model and having the right training data were the most
| important things. Once that was in place, active learning did
| not, unfortunately, add that much. For different tools/settings,
| we got about +/-0.2% in accuracy for checking 200k tokens and
| only correcting 400 of them.
|
| Maybe one problem was that AL was only triggered when a majority
| vote was inconclusive. Also, we used it on top of individualised,
| gs training data. I guess things can look different if you don't
| have a gs to start with. And if you have better computational
| resources: Our oracles spent quite some time waiting, which is
| why we even reorganised the original design to then process
| batches of corrections.
|
| As so often, those null results were hard to publish :|
|
| Either way, I thought I'd share our experiences. Your work sounds
| really cool, best of luck!
| jll29 wrote:
| I would agree - active learning is a neat idea, but while it
| gets up the learning curve quicker that does not necessarily
| correspond to saving data in practice, for two reasons.
|
| First, a lot of the AL papers use _simulation_ scenarios rather
| than production scenarios, i.e. there is already more training
| data available, it just gets withheld. Obviously, if you
| already have more, you have spent annotating it, too, so there
| can't have been any saving.
|
| Second, you always want to annotate more data than you have as
| long as the learning curve isn't flat, so it's not about how
| quickly you get up, but it's about should you keep annotating
| or does a flattening learning curve suggest you have reached
| the area of diminishing returns.
|
| There are many sampling strategies like balance exploration &
| exploitation, expected model change, expected error reduction,
| exponentiated gradient exploration, uncertainty sampling, query
| by committee, querying from diverse subspaces/partitions,
| variance reduction, conformal predictors or mismatch-first
| farthest-traversal, and there isn't a theory to pick the best
| one given what you know (I've mostly heard people play with
| uncertainty sampling or query by committee in academia, but
| nobody in industry I know has told me they use AL).
| natch wrote:
| It's hard to find articles like this that give a glimpse into
| what is used by larger shops doing ML. I take this one with a
| grain of salt due to the source being a vendor, but still it is
| generous with the amount of detail and with its even mentioning
| some alternative solutions for cases that might fit those, so
| that is really appreciated.
|
| The pros working in big shops who write these tend to overlook
| the tiny use cases such as apps that recognize a cat coming
| through a cat door (as opposed to a raccoon) which can get by
| with minuscule training.
|
| There's a lot of discussion of "big data" but small data is
| amazingly powerful too. I wish there was more bridging of these
| two worlds -- to have tools that deal with the needs of small
| data, without the assumption that training a model takes days or
| months, and on the other side, to have the big data world share
| more insights about how they manage their data for the big cases.
| There is a ton of info out there but what I find lacking is info
| about how labeling and tagging is managed on a large scale (I'm
| interested in both, big and small, as well as medium). Maybe I'm
| just missing something. This article gave some good clues --
| thanks!
| fttx_ wrote:
| I agree. I've actually been working on something along these
| lines[0], albeit with a focus on marketing analytics. In my
| experience everyone in marketing cares about dashboards and
| metrics _a lot_ but outside of larger shops almost no-one is
| doing any real analysis and even simple tools like linear
| regression could have a big impact.
|
| Early days but we're looking to onboard a few more customers to
| help guide our roadmap.
|
| [0] https://ripbase.com
| nailer wrote:
| I feel that too - I joined HL just a couple of weeks ago (from
| Unix/webdev tech focus) and it's been a lot of learning so far.
| I'm going to do a little research (and write a blog post) into
| the specific case of 'too subtle for a regex' aimed at general
| webdev folk who have a problem to solve rather than people that
| already want to use ML.
| rocauc wrote:
| Nice read.
|
| Can you shed some light on what you think are the most valuable
| methods for identifying high entropy examples for the model to
| learn faster? I'm familiar with Pool-Based Sampling, Stream-Based
| Selective Sampling, Membership Query Synthesis[1], but less
| certain which techniques are most useful in NLP.
|
| [1] https://blog.roboflow.com/what-is-active-learning/
| razcle wrote:
| So entropy based active learning methods are an example of pool
| based sampling. Even within pool based sampling there a few
| different techniques.
|
| Entropy selection for pool based methods looks at the output
| probability for each prediction of the model in the unlabelled
| data-set. Then it calculates the entropy of the distributions.
| (in classification this is a bit like looking for the most
| uniform predictive distributions) and prioritises those.
|
| Entropy based active learning works ok but doesnt distinguish
| uncertainty that comes from a lack of knowledge (epistemic
| uncertainty) from noise. Techniques like Bayesian Active
| Learning by disagreement can do better. :)
| rocauc wrote:
| Makes a lot of sense, thanks. I'll need to dig deeper into
| Bayesian active learning techniques.
| porphyra wrote:
| A more detailed and technical writeup on the benefits of active
| learning: You should try active learning -
| https://medium.com/aquarium-learning/you-should-try-active-l...
|
| Also, Aquarium Learning is just awesome. Super slick.
| andrewmutz wrote:
| Spam filter is an interesting choice of motivating example, since
| usually it is your users labeling the data, rather than something
| that happens during the R&D process. You _could_ try to use
| active learning but I 'm not sure the users would like that
| product experience.
| razcle wrote:
| Great point Andrew. I was shooting for an easily digestible
| example rather than a realistic one.
|
| Some examples that we've actually worked on/are working on: *
| Contract classification
|
| * Content moderation
|
| * NER
|
| * Customer review understanding
|
| * support ticket routing
___________________________________________________________________
(page generated 2021-02-04 23:01 UTC)