[HN Gopher] Launch HN: Dioptra (YC W22) - Improve ML models by i...
       ___________________________________________________________________
        
       Launch HN: Dioptra (YC W22) - Improve ML models by improving their
       training data
        
       Hi HN! We're Pierre, Jacques, and Farah from Dioptra
       (https://dioptra.ai). Dioptra tracks ML metrics to identify model
       error patterns and suggest the best data curation strategy to fix
       them.  We've seen a shift in paradigm in recent years in ML: the
       "code" has become a commodity: many powerful ML models are open
       source today. The real challenge is to grow and curate quality
       data. This raises the need for new data centric tools: IDEs,
       debuggers, monitoring. Dioptra is a data centric tool that helps
       debug models and fix them by systematically curating and growing
       the best data, at scale.  We experienced this problem, first hand,
       deploying and retraining models. Once a model was in production,
       maintenance was a huge pain. First, it was hard to assess model
       performance. Accessing the right production data to diagnose was
       complicated. We had to build custom scripts to connect to DBs,
       download production data (Compliance, look the other way!) and
       analyze it.  Second, it was hard to translate the diagnosis into
       concrete next steps: find the best data to fix and retrain my
       model. It required another set of scripts to sample new data, label
       it and retrain. With a large enough labeling budget, we were able
       to improve our models, but it wasn't optimal: labeling is
       expensive, and random data sampling doesn't yield the best results.
       And since the process relied on our individual domain expertise
       (aka gut feelings) it was inconsistent from one data scientist to
       the next and not scalable.  We talked to a couple hundred ML
       practitioners who helped us validate and refine our thinking (we
       thank every single one of them!). For example, one NLP team had to
       read more than 10 long legal contracts per week per person. The
       goal was to track any model errors. Once a month, they synthesized
       an Excel sheet to detect patterns of errors. Once detected, they
       had to read more contracts to build their retraining dataset! There
       were multiple issues with that process. First, the assessment of
       errors was subjective since it depended on individual
       interpretations of the legal language. Second, the sourcing of
       retraining data was time consuming and anecdotal. Finally, they had
       to spend a lot of time coaching new members to minimize
       subjectivity.  Processes like this highlight how model improvement
       needs to be less anecdotal and more systematic. A related problem
       is lack of tooling, which puts a huge strain on ML teams that are
       constantly asked to innovate and take on new projects.  Dioptra
       computes a comprehensive set of metrics to give ML teams a full
       view of their model and detect failure modes. Teams can objectively
       prioritize their efforts based on the impact of each error pattern.
       They can also slice and dice to root-cause errors, zero in on
       faulty data, and visualize it. What used to take days of reading
       can now be done in a couple hours. Teams can then quality check and
       curate the best data for retraining using our embedding similarity
       search or active learning techniques. They can easily understand,
       customize and systematically engineer their data curation strategy
       with our automation APIs in order to get the best model at each
       iteration and stay on top of the latest production patterns.
       Additionally, Dioptra fits within any ML stack. We have native
       integrations with major deep learning frameworks.  Some of our
       customers reduced their data ops costs by 30%. Others improved
       their model accuracy by 20% in one retraining cycle thanks to
       Dioptra.  Active Learning, which has been around for a while but
       was sort of confidential until recently, makes intentional
       retraining possible. This approach has been validated by ML
       organizations like Tesla, Cruise and Waymo. Recently, other
       companies like Pinterest started building similar infrastructure.
       However it is costly to build and requires specialized skills. We
       want to make it accessible to everybody.  We created an interactive
       demo for HN:
       https://capture.navattic.com/cl4hciffr2881909mv2qrlsc9g  Please
       share any feedback and thoughts. Thanks for reading!
        
       Author : farahg
       Score  : 37 points
       Date   : 2022-06-21 16:00 UTC (7 hours ago)
        
       | kkouddous wrote:
       | We've been trying to implement an active-learning retraining loop
       | for our critical NLP models for Koko but have never found the
       | time to prioritize the work as it was multi-sprint level of
       | effort. We've been working with them for the for a few weeks and
       | we and we are seeing meaningful performance improvement with our
       | models. I highly recommend trying them out.
        
         | nshm wrote:
         | For many domains active learning is not that efficient
         | actually. The promise is that you make a subset of labels and
         | train on them the model with the same accuracy. The reality is
         | that in order to estimate long tail properly you need all the
         | data points in the training set, not just a subset.
         | 
         | Consider simple language model case. In order to learn some
         | specific phrases you need to see them in the training, and
         | phrases of interest are rare (usually 1-2 cases per terabytes
         | of data). You simply can not select a half.
         | 
         | A semi-supervised learning and self-supervised learning are
         | more reasonable and widely used. You still consider all the
         | data for training. You just don't annotate it manually.
        
           | parnoux wrote:
           | You are right. Being able to learn good feature
           | representations through SSL is very powerful. We leverage
           | such representation to perform tasks like semantic search to
           | tackle problems like long tail sampling. We have seen pretty
           | good results mining for edge cases. Let me know if you'd like
           | to chat about it.
        
       | wanderingmind wrote:
       | This is an interesting problem to solve. For the sake of better
       | understanding, can OP or someone else here suggest research
       | papers or code that describes similar approaches to detect and
       | remove outlier data by analyzing the embedding space.
        
       | fxtentacle wrote:
       | I feel like your starting assumption already diverges from my
       | world.
       | 
       | > "code" has become a commodity: many powerful ML models are open
       | source today. The real challenge is to grow and curate quality
       | data.
       | 
       | The main recent improvements in translation and speech
       | recognition were all new mathematical methods that enable us to
       | use uncurated random data and still get good results. CTC loss
       | allows to use un-aligned text as groundtruth. wav2vec allows to
       | use random sound without text for pre-training. OPUS is basically
       | a processing pipeline for generating bilingual corpora. Word
       | embeddings allow to use random monolingual text for pre-training.
       | We've also seen a lot of one-shot or zero-shot methods. Plus
       | XLS-R was all about transfer learning to reuse knowledge from
       | cheap abundant data for resource-constrained environments.
       | 
       | My prediction for the future of ML would be that we'll soon need
       | so little training data that a single person can label it on a
       | weekend.
       | 
       | On the other hand, I know first-hand that almost nobody can use
       | the "open source" ML model and deploy it effectively in
       | production. Doing so requires sparsity, quantization,
       | conversions, and in many cases l33t C++ skillz to implement
       | optimized fused SSE ops so that a model will run with decent
       | speed on cheap CPU hardware rather than mandating expensive GPU
       | servers.
        
         | parnoux wrote:
         | I don't think our assumptions are so far appart. The methods
         | you mentioned made it from research to the open source
         | community fairly quickly. In fact, most companies rely on this
         | kind of open research to develop their models. In a lot of use
         | cases, it has become more about finding the right data that
         | improving the model code. (I like Andrew Ng thoughts on this:
         | https://datacentricai.org/) At the same time, there are still a
         | lot of unsolved engineering challenges with the code when it
         | comes to productionalizing models, especially for real time
         | speech transcription.
         | 
         | And we agree with your prediction. That's why we started
         | Dioptra: to come up with a systematic way to curate high
         | quality data so you can annotate just the data that matters.
        
       | krapht wrote:
       | This looks like a dashboard version of my own Python scripts that
       | I use to evaluate training data.
       | 
       | ** reads more into the blurb**
       | 
       | Use our APIs to improve your models! Blech, not interested
       | anymore. I can't hand out access to a 3rd party server for our
       | data for legal reasons. Your startup can't afford to jump through
       | the hoops to sell to my employer either, not that I would even
       | recommend the app right now without trialing it myself.
       | 
       | I need something I can afford to purchase individually through my
       | manager's discretionary budget (or myself) and run on our own
       | servers. Most ML startups are SAAS and fail this test for me.
        
         | parnoux wrote:
         | I don't know how tight your legal restrictions are but we work
         | from metadata only. We don't need your text / img / audio. We
         | just need their embeddings and a few other stuff. And we are
         | working on a self hosted version as well. Out of curiosity,
         | what would you expect in terms of pricing ?
        
         | carom wrote:
         | (Not OP) Out of curiosity, how much is your manager's
         | discretionary budget?
         | 
         | I feel like any own-infrastructure per-seat license is going to
         | be way beyond that. Maybe as a marketplace app. [1]
         | 
         | 1. https://aws.amazon.com/marketplace/
        
       ___________________________________________________________________
       (page generated 2022-06-21 23:01 UTC)