[HN Gopher] Launch HN: Dioptra (YC W22) - Improve ML models by i...
___________________________________________________________________
Launch HN: Dioptra (YC W22) - Improve ML models by improving their
training data
Hi HN! We're Pierre, Jacques, and Farah from Dioptra
(https://dioptra.ai). Dioptra tracks ML metrics to identify model
error patterns and suggest the best data curation strategy to fix
them. We've seen a shift in paradigm in recent years in ML: the
"code" has become a commodity: many powerful ML models are open
source today. The real challenge is to grow and curate quality
data. This raises the need for new data centric tools: IDEs,
debuggers, monitoring. Dioptra is a data centric tool that helps
debug models and fix them by systematically curating and growing
the best data, at scale. We experienced this problem, first hand,
deploying and retraining models. Once a model was in production,
maintenance was a huge pain. First, it was hard to assess model
performance. Accessing the right production data to diagnose was
complicated. We had to build custom scripts to connect to DBs,
download production data (Compliance, look the other way!) and
analyze it. Second, it was hard to translate the diagnosis into
concrete next steps: find the best data to fix and retrain my
model. It required another set of scripts to sample new data, label
it and retrain. With a large enough labeling budget, we were able
to improve our models, but it wasn't optimal: labeling is
expensive, and random data sampling doesn't yield the best results.
And since the process relied on our individual domain expertise
(aka gut feelings) it was inconsistent from one data scientist to
the next and not scalable. We talked to a couple hundred ML
practitioners who helped us validate and refine our thinking (we
thank every single one of them!). For example, one NLP team had to
read more than 10 long legal contracts per week per person. The
goal was to track any model errors. Once a month, they synthesized
an Excel sheet to detect patterns of errors. Once detected, they
had to read more contracts to build their retraining dataset! There
were multiple issues with that process. First, the assessment of
errors was subjective since it depended on individual
interpretations of the legal language. Second, the sourcing of
retraining data was time consuming and anecdotal. Finally, they had
to spend a lot of time coaching new members to minimize
subjectivity. Processes like this highlight how model improvement
needs to be less anecdotal and more systematic. A related problem
is lack of tooling, which puts a huge strain on ML teams that are
constantly asked to innovate and take on new projects. Dioptra
computes a comprehensive set of metrics to give ML teams a full
view of their model and detect failure modes. Teams can objectively
prioritize their efforts based on the impact of each error pattern.
They can also slice and dice to root-cause errors, zero in on
faulty data, and visualize it. What used to take days of reading
can now be done in a couple hours. Teams can then quality check and
curate the best data for retraining using our embedding similarity
search or active learning techniques. They can easily understand,
customize and systematically engineer their data curation strategy
with our automation APIs in order to get the best model at each
iteration and stay on top of the latest production patterns.
Additionally, Dioptra fits within any ML stack. We have native
integrations with major deep learning frameworks. Some of our
customers reduced their data ops costs by 30%. Others improved
their model accuracy by 20% in one retraining cycle thanks to
Dioptra. Active Learning, which has been around for a while but
was sort of confidential until recently, makes intentional
retraining possible. This approach has been validated by ML
organizations like Tesla, Cruise and Waymo. Recently, other
companies like Pinterest started building similar infrastructure.
However it is costly to build and requires specialized skills. We
want to make it accessible to everybody. We created an interactive
demo for HN:
https://capture.navattic.com/cl4hciffr2881909mv2qrlsc9g Please
share any feedback and thoughts. Thanks for reading!
Author : farahg
Score : 37 points
Date : 2022-06-21 16:00 UTC (7 hours ago)
| kkouddous wrote:
| We've been trying to implement an active-learning retraining loop
| for our critical NLP models for Koko but have never found the
| time to prioritize the work as it was multi-sprint level of
| effort. We've been working with them for the for a few weeks and
| we and we are seeing meaningful performance improvement with our
| models. I highly recommend trying them out.
| nshm wrote:
| For many domains active learning is not that efficient
| actually. The promise is that you make a subset of labels and
| train on them the model with the same accuracy. The reality is
| that in order to estimate long tail properly you need all the
| data points in the training set, not just a subset.
|
| Consider simple language model case. In order to learn some
| specific phrases you need to see them in the training, and
| phrases of interest are rare (usually 1-2 cases per terabytes
| of data). You simply can not select a half.
|
| A semi-supervised learning and self-supervised learning are
| more reasonable and widely used. You still consider all the
| data for training. You just don't annotate it manually.
| parnoux wrote:
| You are right. Being able to learn good feature
| representations through SSL is very powerful. We leverage
| such representation to perform tasks like semantic search to
| tackle problems like long tail sampling. We have seen pretty
| good results mining for edge cases. Let me know if you'd like
| to chat about it.
| wanderingmind wrote:
| This is an interesting problem to solve. For the sake of better
| understanding, can OP or someone else here suggest research
| papers or code that describes similar approaches to detect and
| remove outlier data by analyzing the embedding space.
| fxtentacle wrote:
| I feel like your starting assumption already diverges from my
| world.
|
| > "code" has become a commodity: many powerful ML models are open
| source today. The real challenge is to grow and curate quality
| data.
|
| The main recent improvements in translation and speech
| recognition were all new mathematical methods that enable us to
| use uncurated random data and still get good results. CTC loss
| allows to use un-aligned text as groundtruth. wav2vec allows to
| use random sound without text for pre-training. OPUS is basically
| a processing pipeline for generating bilingual corpora. Word
| embeddings allow to use random monolingual text for pre-training.
| We've also seen a lot of one-shot or zero-shot methods. Plus
| XLS-R was all about transfer learning to reuse knowledge from
| cheap abundant data for resource-constrained environments.
|
| My prediction for the future of ML would be that we'll soon need
| so little training data that a single person can label it on a
| weekend.
|
| On the other hand, I know first-hand that almost nobody can use
| the "open source" ML model and deploy it effectively in
| production. Doing so requires sparsity, quantization,
| conversions, and in many cases l33t C++ skillz to implement
| optimized fused SSE ops so that a model will run with decent
| speed on cheap CPU hardware rather than mandating expensive GPU
| servers.
| parnoux wrote:
| I don't think our assumptions are so far appart. The methods
| you mentioned made it from research to the open source
| community fairly quickly. In fact, most companies rely on this
| kind of open research to develop their models. In a lot of use
| cases, it has become more about finding the right data that
| improving the model code. (I like Andrew Ng thoughts on this:
| https://datacentricai.org/) At the same time, there are still a
| lot of unsolved engineering challenges with the code when it
| comes to productionalizing models, especially for real time
| speech transcription.
|
| And we agree with your prediction. That's why we started
| Dioptra: to come up with a systematic way to curate high
| quality data so you can annotate just the data that matters.
| krapht wrote:
| This looks like a dashboard version of my own Python scripts that
| I use to evaluate training data.
|
| ** reads more into the blurb**
|
| Use our APIs to improve your models! Blech, not interested
| anymore. I can't hand out access to a 3rd party server for our
| data for legal reasons. Your startup can't afford to jump through
| the hoops to sell to my employer either, not that I would even
| recommend the app right now without trialing it myself.
|
| I need something I can afford to purchase individually through my
| manager's discretionary budget (or myself) and run on our own
| servers. Most ML startups are SAAS and fail this test for me.
| parnoux wrote:
| I don't know how tight your legal restrictions are but we work
| from metadata only. We don't need your text / img / audio. We
| just need their embeddings and a few other stuff. And we are
| working on a self hosted version as well. Out of curiosity,
| what would you expect in terms of pricing ?
| carom wrote:
| (Not OP) Out of curiosity, how much is your manager's
| discretionary budget?
|
| I feel like any own-infrastructure per-seat license is going to
| be way beyond that. Maybe as a marketplace app. [1]
|
| 1. https://aws.amazon.com/marketplace/
___________________________________________________________________
(page generated 2022-06-21 23:01 UTC)