[HN Gopher] A better way to build ML: Why you should be using Ac...
       ___________________________________________________________________
        
       A better way to build ML: Why you should be using Active Learning
        
       Author : razcle
       Score  : 95 points
       Date   : 2021-02-04 16:38 UTC (6 hours ago)
        
 (HTM) web link (humanloop.com)
 (TXT) w3m dump (humanloop.com)
        
       | andy99 wrote:
       | I have a suggestion about the first plot you show in the writeup.
       | From what I can see, it is based on a finite pool of data and so
       | it undersells active learning: performance shoots up as AL finds
       | the interesting points, but then the curve flattens and is less
       | steep than the random curve as the "boring" points get added. It
       | would be nice to see the same curve for a bigger training pool
       | where AL was able to get to a target accuracy without running out
       | of valuable training points. I suspect that would make the
       | difference between the two curves much more stark. As it is, it
       | just looks like AL does better for very low data but to get to
       | high accuracy you need to use the whole dataset anyway so it's a
       | wash between AL and random.
        
         | razcle wrote:
         | Yeah I think is a good point. I'm actually planning to do a
         | follow up post that is a case study with some real world data
         | and the plots in that are much more like what you describe.
         | 
         | I may update the post. thanks!
        
       | dexter89_kp3 wrote:
       | what is your thought on synthetic data vs active learning?
       | 
       | For some domains, with privacy concerns or rarity of objects,
       | getting labelled data for deep learning is challenging.
       | 
       | There is decent research on sim2real i.e transferring models
       | trained on synthetic data to real world applications
       | https://arxiv.org/pdf/1703.06907.pdf
        
         | razcle wrote:
         | TL;DR I think its complementary.
         | 
         | Synthetic data is particularly valuable when even the
         | unlabelled data is expensive to obtain. For example if you want
         | to train a driverless car, you may never see an ambulance
         | driving at night in the rain even if you drive for thousands of
         | miles. In that case, being able to synthesise data makes a lot
         | of sense and we have lots of tools for computer graphics that
         | make this easy.
         | 
         | Synthetic data can also be useful to share data when there are
         | privacy concerns but my own feeling here is that there are
         | better approaches to privacy preservation, like federated
         | learning and learning via randomised response
         | (https://arxiv.org/pdf/2001.04942.pdf).
         | 
         | In general though, outside of some vision applications, I'm
         | pretty sceptical of synthetic data. For synthetic data to work
         | well, you need a really good class conditional generator. E.g
         | "generate a tweet that is a negative sentiment review" but if
         | you have a sufficiently good model to do this, then you can
         | probably use that model to solve your classification task
         | anyway.
         | 
         | For most settings, I think synthetic data will work for data
         | augmentation as a regulariser but will not be a substitute for
         | all labelled data.
         | 
         | For the labelled data, active learning should still help.
        
       | lwhsiao wrote:
       | Hi Mike,
       | 
       | Can you talk about the tradeoffs or relationship between active
       | learning and weak supervision from your point of view?
        
         | peadarohaodha wrote:
         | adding to what Raza said - a consideration for both active
         | learning and weak supervision is the need to construct a gold
         | standard labelled dataset for model validation and testing
         | purposes. At Humanloop, in addition to collecting training
         | data, we are also using a form of active learning to speed up
         | the collection of an unbiased test data set.
         | 
         | Another consideration on the weak supervision side (for the
         | snorkel style approach of labelling functions) is that creating
         | labelling functions can be a relatively technical task, which
         | may not be well suited for non-technical domain expert
         | annotators for providing feedback to the model
        
         | razcle wrote:
         | Hi lwhsiao,
         | 
         | Raza here, author of the post.
         | 
         | My high-level answer is weak-labelling overcomes cold starts
         | and active learning helps with the last mile.
         | 
         | More detail:
         | 
         | We see weak learning as very complementary to active learning.
         | By using labelling functions, you can quickly overcome the cold
         | start problem and also better leverage external resources like
         | knowledge bases.
         | 
         | But most of the work in training ML systems often comes in
         | getting the last few percentage points of performance. Going
         | from good to good enough. This is where active learning can
         | really shine because it guides you as to what data you really
         | need to move model performance.
         | 
         | At Humanloop, we've started with active learning tools but are
         | also doing a lot of work on weak labelling.
        
           | lwhsiao wrote:
           | Thanks for the detailed response :)
        
         | nailer wrote:
         | Hey Luke! I'm a full stack / devops person so this is going to
         | be high level but getting a couple of my ML researcher
         | colleagues in for a deeper dive now.
         | 
         | Short version:
         | 
         | 1) we can get to the same level of accuracy with around 10% of
         | the data points. Getting (and managing) a big enough data set
         | to train a supervised learning model is the biggest thing
         | slowing down ML deployments.
         | 
         | 2) the model contacts a human when it can't label a data point
         | with a high degree of confidence. You'll never have people with
         | a bunch of specialist knowledge being asked to perform mundane
         | data labelling tasks
        
       | nailer wrote:
       | Mike from Humanloop here - if you're interested in active
       | learning we'll be around on this thread, also we're looking for
       | fullstack SW engineers and ML engineers -
       | https://news.ycombinator.com/item?id=25992607
        
         | nmca wrote:
         | Hey Mike - I did some work on an industry active learning
         | system a few years ago. The high level finding was that
         | transfer & nonparametric methods were _huge_ wins, but online
         | uncertainty and  "real active learning" hardly worked at all
         | and were super complicated (at least in caffe1 anyway lol).
         | 
         | Can you point to any big breakthroughs that have helped in
         | recent years? Linear Hypermodels
         | (https://arxiv.org/abs/2006.07464) seem promising, but that
         | original experience has left me with some healthy skepticism.
        
           | razcle wrote:
           | Hi NMCA, I'm Raza one of the founders of Humanloop. I totally
           | agree that transfer learning is one of the best strategies
           | for data efficiency and it's pretty common to see people
           | start from large pre-trained models like BERT. Active
           | learning then provides an additional benefit, especially when
           | labels are expensive. For example we've worked with teams
           | where lawyers have to annotate.
           | 
           | In terms of breakthroughs in recent years, some things I'd
           | point to would BALD (https://arxiv.org/abs/1112.5745) and its
           | applications in deep learning as well. There has also been
           | progress in coreset methods
           | (https://openreview.net/forum?id=H1aIuk-RW).
           | 
           | I think that you're right that it used to be much to hard to
           | get active learning to work. Part of what we're trying to do
           | is make it easy enough that its worth the benefits.
        
             | razcle wrote:
             | I'd also point out that people always focus on just the
             | labelling savings from active learning but there are other
             | benefits in practice too: 1) faster feedback on model
             | performance during the annotation process and 2) Better
             | engagement from the annotators as they can see the benefit
             | of their work.
        
               | andy99 wrote:
               | I'd add that there is a deep connection between active
               | learning and understanding the "domain of expertise" of a
               | model, for example what inputs are ambiguous or low
               | confidence, and which are out of distribution. E.g. BALD
               | is a form of out of distribution detection - a point with
               | high disagreement it not only useful to add to the
               | training pool, it is a point for which the current model
               | has no business making a prediction.
        
           | peadarohaodha wrote:
           | Adding to what Raza as said - to your point on "real active
           | learning" hardly working. I would be interested to hear what
           | approaches you took? We've found that the quality of the
           | uncertainty estimate for your model is quite important for
           | active learning to work well in practise. So applying good
           | approximations for the model uncertainty for modern sized
           | transformer models (like BERT) is an important consideration
        
             | nmca wrote:
             | Irrc we were using a Bayesian classification model on top
             | of of fixed pretrained features from transfer, something
             | along the lines of refitting a GP every time the number of
             | classes changed. This was images as opposed to text, and
             | after an epoch classification was ~ok but during training
             | (eg the active bit) we didn't see much benefit.
        
         | UnpossibleJim wrote:
         | Hi Mike, let me know if I'm getting too into specifics for
         | casual conversation. You mentioned a smaller dataset for
         | training and I'm curious as to how much smaller. Like, for
         | static image recognition, what type of dataset trade off are we
         | looking at. Also, randomly, is there a compute cost trade off
         | or is it just a smarter process?
         | 
         | I read the site, and no one likes cleaning data (no one), and
         | answering questions from a "toddler machine" (for lack of a
         | better term) doesn't sound as bad, but I was curious what
         | potential trade offs there might be.
        
           | talolard wrote:
           | I wrote a blog post a few years ago about possible downsides
           | to active learning.
           | 
           | https://www.lighttag.io/blog/active-learning-optimization-
           | is...
           | 
           | To be fair to the Humanloop folks, the criticisms probably
           | don't apply to the kinds of models their using (modern
           | transformers).
        
           | peadarohaodha wrote:
           | We've found you can get an order of magnitude improvement in
           | the amount of labelled data you need - but there is some
           | variance based on the difficulty of the problem. Because you
           | are retraining the model in tandem with the data labelling
           | process, there is additional compute associated to an active
           | learning powered labelling process versus just selecting the
           | data at random to label next. But this additional compute
           | consideration is almost always outweighed by the saving of
           | human time spent labelling.
        
             | talolard wrote:
             | I have a question on the compute aspect regarding your
             | business model, hope I'm not being to nosy..
             | 
             | I tried HL, the experience was stellar (well done!) and it
             | made me think...
             | 
             | To get AL working with a great user experience you need
             | quite a bit of compute. How are you thinking about your
             | margins, e.g the cost to produce what you're offering
             | versus what customers will pay for it?
        
               | peadarohaodha wrote:
               | Thanks for the feedback! It's a good question re compute.
               | There are some fun engineering and ML research challenges
               | that we are constantly iterating on that are related to
               | this. A few examples - how to most efficiently share
               | compute resources in a JIT manner (e.g. GPU memory)
               | during model serving for both training and inference
               | (where the use case and privacy requirements permit) -
               | how to construct model training algorithms that operate
               | in a more online manner effectively (so you don't have to
               | retrain on the whole dataset when you see new examples) -
               | how to significantly reduce the model footprint (in terms
               | of memory and flops) of modern deep transformer models
               | given they are highly over-parameterised and can contain
               | a lot of redundancy.
               | 
               | this stuff helps us a lot on the margins point!
        
       | nicoburns wrote:
       | Active learning sounds like a step closer to how humans and other
       | animals learn: iteratively with continuous feedback.
        
         | uoaei wrote:
         | In the sense of a personally-curated curriculum, yes.
        
       | anonymouse008 wrote:
       | Ha! This is amazing -- we did a similar process for an EEG
       | research project, and it was stellar (working memory and learning
       | curves)! Until now, I didn't have the right words to articulate
       | what we did - so thank you for the incantation!
        
         | throwaway3699 wrote:
         | I'm curious - what was the project about? I'm super interested
         | in combining EEG and these active learning techniques.
        
           | anonymouse008 wrote:
           | Super cool. Let's connect - email in profile
        
             | dr_dshiv wrote:
             | https://www.digitaltrends.com/features/ai-identifies-
             | songs-b...
             | 
             | Ima gonna hit u up too :)
        
       | realradicalwash wrote:
       | Nice to see some active learning around here. To add a data point
       | from a less successful story:
       | 
       | In one of our research projects, we used AL to improve part-of-
       | speech prediction, inspired by work by Rehbein and Ruppenhofer,
       | e.g. https://www.aclweb.org/anthology/P17-1107/
       | 
       | Our data base was a corpus of Scientific English from 17th-now
       | and for our data and situation, we found that choosing the right
       | tool/model and having the right training data were the most
       | important things. Once that was in place, active learning did
       | not, unfortunately, add that much. For different tools/settings,
       | we got about +/-0.2% in accuracy for checking 200k tokens and
       | only correcting 400 of them.
       | 
       | Maybe one problem was that AL was only triggered when a majority
       | vote was inconclusive. Also, we used it on top of individualised,
       | gs training data. I guess things can look different if you don't
       | have a gs to start with. And if you have better computational
       | resources: Our oracles spent quite some time waiting, which is
       | why we even reorganised the original design to then process
       | batches of corrections.
       | 
       | As so often, those null results were hard to publish :|
       | 
       | Either way, I thought I'd share our experiences. Your work sounds
       | really cool, best of luck!
        
         | jll29 wrote:
         | I would agree - active learning is a neat idea, but while it
         | gets up the learning curve quicker that does not necessarily
         | correspond to saving data in practice, for two reasons.
         | 
         | First, a lot of the AL papers use _simulation_ scenarios rather
         | than production scenarios, i.e. there is already more training
         | data available, it just gets withheld. Obviously, if you
         | already have more, you have spent annotating it, too, so there
         | can't have been any saving.
         | 
         | Second, you always want to annotate more data than you have as
         | long as the learning curve isn't flat, so it's not about how
         | quickly you get up, but it's about should you keep annotating
         | or does a flattening learning curve suggest you have reached
         | the area of diminishing returns.
         | 
         | There are many sampling strategies like balance exploration &
         | exploitation, expected model change, expected error reduction,
         | exponentiated gradient exploration, uncertainty sampling, query
         | by committee, querying from diverse subspaces/partitions,
         | variance reduction, conformal predictors or mismatch-first
         | farthest-traversal, and there isn't a theory to pick the best
         | one given what you know (I've mostly heard people play with
         | uncertainty sampling or query by committee in academia, but
         | nobody in industry I know has told me they use AL).
        
       | natch wrote:
       | It's hard to find articles like this that give a glimpse into
       | what is used by larger shops doing ML. I take this one with a
       | grain of salt due to the source being a vendor, but still it is
       | generous with the amount of detail and with its even mentioning
       | some alternative solutions for cases that might fit those, so
       | that is really appreciated.
       | 
       | The pros working in big shops who write these tend to overlook
       | the tiny use cases such as apps that recognize a cat coming
       | through a cat door (as opposed to a raccoon) which can get by
       | with minuscule training.
       | 
       | There's a lot of discussion of "big data" but small data is
       | amazingly powerful too. I wish there was more bridging of these
       | two worlds -- to have tools that deal with the needs of small
       | data, without the assumption that training a model takes days or
       | months, and on the other side, to have the big data world share
       | more insights about how they manage their data for the big cases.
       | There is a ton of info out there but what I find lacking is info
       | about how labeling and tagging is managed on a large scale (I'm
       | interested in both, big and small, as well as medium). Maybe I'm
       | just missing something. This article gave some good clues --
       | thanks!
        
         | fttx_ wrote:
         | I agree. I've actually been working on something along these
         | lines[0], albeit with a focus on marketing analytics. In my
         | experience everyone in marketing cares about dashboards and
         | metrics _a lot_ but outside of larger shops almost no-one is
         | doing any real analysis and even simple tools like linear
         | regression could have a big impact.
         | 
         | Early days but we're looking to onboard a few more customers to
         | help guide our roadmap.
         | 
         | [0] https://ripbase.com
        
         | nailer wrote:
         | I feel that too - I joined HL just a couple of weeks ago (from
         | Unix/webdev tech focus) and it's been a lot of learning so far.
         | I'm going to do a little research (and write a blog post) into
         | the specific case of 'too subtle for a regex' aimed at general
         | webdev folk who have a problem to solve rather than people that
         | already want to use ML.
        
       | rocauc wrote:
       | Nice read.
       | 
       | Can you shed some light on what you think are the most valuable
       | methods for identifying high entropy examples for the model to
       | learn faster? I'm familiar with Pool-Based Sampling, Stream-Based
       | Selective Sampling, Membership Query Synthesis[1], but less
       | certain which techniques are most useful in NLP.
       | 
       | [1] https://blog.roboflow.com/what-is-active-learning/
        
         | razcle wrote:
         | So entropy based active learning methods are an example of pool
         | based sampling. Even within pool based sampling there a few
         | different techniques.
         | 
         | Entropy selection for pool based methods looks at the output
         | probability for each prediction of the model in the unlabelled
         | data-set. Then it calculates the entropy of the distributions.
         | (in classification this is a bit like looking for the most
         | uniform predictive distributions) and prioritises those.
         | 
         | Entropy based active learning works ok but doesnt distinguish
         | uncertainty that comes from a lack of knowledge (epistemic
         | uncertainty) from noise. Techniques like Bayesian Active
         | Learning by disagreement can do better. :)
        
           | rocauc wrote:
           | Makes a lot of sense, thanks. I'll need to dig deeper into
           | Bayesian active learning techniques.
        
       | porphyra wrote:
       | A more detailed and technical writeup on the benefits of active
       | learning: You should try active learning -
       | https://medium.com/aquarium-learning/you-should-try-active-l...
       | 
       | Also, Aquarium Learning is just awesome. Super slick.
        
       | andrewmutz wrote:
       | Spam filter is an interesting choice of motivating example, since
       | usually it is your users labeling the data, rather than something
       | that happens during the R&D process. You _could_ try to use
       | active learning but I 'm not sure the users would like that
       | product experience.
        
         | razcle wrote:
         | Great point Andrew. I was shooting for an easily digestible
         | example rather than a realistic one.
         | 
         | Some examples that we've actually worked on/are working on: *
         | Contract classification
         | 
         | * Content moderation
         | 
         | * NER
         | 
         | * Customer review understanding
         | 
         | * support ticket routing
        
       ___________________________________________________________________
       (page generated 2021-02-04 23:01 UTC)