[HN Gopher] How to train large deep learning models as a startup
___________________________________________________________________
How to train large deep learning models as a startup
Author : dylanbfox
Score : 210 points
Date : 2021-10-07 14:45 UTC (8 hours ago)
(HTM) web link (www.assemblyai.com)
(TXT) w3m dump (www.assemblyai.com)
| sandGorgon wrote:
| the hardest part here is horizontal scaling. OpenAI handrolled
| its own MPI+SSH stack (https://openai.com/blog/scaling-
| kubernetes-to-7500-nodes/)
|
| I wonder what is the state of art for horizontal scaling here
| ...preferably on kubernetes.
|
| Pytorch is tricky to integrate (using TorchElastic). You could
| use Dask or Ray Distributed. Tensorflow has its own mechanism
| that doesnt play nice with Kubernetes.
|
| How are others doing it ?
| gonab wrote:
| Use a small network, train it in a local GPU
| [deleted]
| endisneigh wrote:
| If you wanted to do something like "OK Google" with AssemblyAI
| would you have to transcribe everything and then process the
| substring "OK Google" on the application layer (and therefore
| incur all of the cost of listening constantly)?
|
| It'd be cool if there was the ability to train a phrase locally
| on your own premises and then use that to begin the real
| transcription.
|
| This probably wouldn't be super difficult to build, but was
| wondering if it was available (didn't see anything at a glance)
| dylanbfox wrote:
| Great question. This is technically referred to as "Wake Word
| Detection". You run a really small model locally that is just
| processing 500ms (for example) of audio at a time through a
| light weight CNN or RNN. The idea here is that it's just binary
| classification (vs actual speech recognition).
|
| There are some open source libraries that make this relatively
| easy:
|
| - https://github.com/Kitt-AI/snowboy (looks to be shutdown now)
| - https://github.com/cmusphinx/pocketsphinx
|
| This avoids having to stream audio 24x7 to a cloud model which
| would be super expensive. This being said, I'm pretty sure what
| the Alexa does, for example, is send any positive wake word to
| a cloud model (that is bigger and more accurate) to verify the
| prediction of the local wake word detection model AFAIK.
|
| Once you are positive you have a positive wake word detected -
| that's when you start streaming to an accurate cloud based
| transcription model like Assembly to minimize costs!
| asdff wrote:
| Bose used to have some pre internet system that recognized the
| song you liked to play right after another song (like in a
| random shuffle) and attempted to learn what you liked to hear,
| and queue up the song you were likely to skip to anyway. No
| idea how they pulled it off since this must have been on
| hardware from 15 years ago iirc.
| tintinmovie wrote:
| Ah yes Bose uMusic. From the manual it extracts 30 feature
| points from the songs to define your preference.
|
| uMusic patent:
| https://patents.google.com/patent/CN1637743A/en
|
| Further reading: http://products.bose.com/pdf/customer_servic
| e/owners/uMusic_...
| debbiedowner wrote:
| This is actually a much simpler task than ASR and you can even
| easily train on a normal CPU even.
|
| The best do it yourself instructions are in a book called Tiny
| ML.
|
| Compared to super deep transformers, you'll find that deployed
| WW detectors are as simple as SVMs or 2 layer NNs.
| mdda wrote:
| The search term you're looking for is "Keyword Spotting" (or
| "Wake Word Detection") - and that's what's implemented locally
| for ~embedded devices that sit and wait for something relevant
| to come along so that they know when to start sending data up
| to the mothership (or even turn on additional higher-power
| cores locally).
|
| Here's an example repo that might be interesting (from initial
| impressions, though there are many more out there) :
| https://github.com/vineeths96/Spoken-Keyword-Spotting
| stayfrosty420 wrote:
| in my experience it's often more like "just use linear regression
| and tell everyone you're using AI"
| cardosof wrote:
| Thats for structured data, for non structured it's more like
| "create a NN and stack more layers until you have your MVP"
| andyxor wrote:
| it's more like try different off-the-shelf models on some
| sample of data until the performance is somewhat acceptable
|
| Unless you're Google, who even trains models from scratch
| these days, at most you do some fine-tuning
| jszymborski wrote:
| > "create a NN and stack more layers until you have your MVP"
|
| I mean, that's a pretty good principled approach to a lot of
| ML problems.
| r-zip wrote:
| I think you have a different definition of "principled"
| from most people.
| jszymborski wrote:
| I'm very curious as to what part of that process is not
| explained by the principles by which we understand neural
| networks to work.
|
| I invite the possibility I've gone this long
| misunderstanding the definition of "principled" in this
| context.
| cardosof wrote:
| Only because currently ML is more alchemy than engineering.
| We mix stuff until we make gold while we can't explain why
| more parameters generalize better instead of overfitting.
| potatoman22 wrote:
| Lol, very true haha. In actuality, I don't think most NN's are
| any more 'AI' than simpler models. The definition for AI is
| fleeting, though.
| Barrin92 wrote:
| the serious tip here is to go with gradient boosting which very
| often works so well it hardly makes a difference
| apl wrote:
| Several hints here are severely outdated.
|
| For instance, never train a model in end-to-end FP16. Use mixed
| precision, either via native TF/PyTorch or as a freebie when
| using TF32 on A100s. This'll ensure that only suitable ops are
| run with lower precision; no need to fiddle with anything. Also,
| PyTorch DDP in multi-node regimes hasn't been slower or less
| efficient than Horovod in ages.
|
| Finally, buying a local cluster of TITAN Xs is an outright weird
| recommendation for massive models. VRAM limitations alone make
| this a losing proposition.
| Reubend wrote:
| This is an excellent article, which does a good job of detailing
| several factors involved here. But while it does suggest several
| ways to reduce the cost of training models, I'm left with a huge
| questions at the end.
|
| How much does it ultimately cost to train a model at this size,
| and is it feasible to do without VS funding (and cloud credits)?
| dylanbfox wrote:
| Author here. Thanks for your comments!
|
| In general - this is expensive stuff. Training big, accurate
| models just requires a lot of compute, and there is a "barrier
| to entry" wrt costs, even if you're able to get those costs
| down. I think it's similar to startups not really being able to
| get into the aerospace industry unless they raise lots of
| funding (ie, Boom Supersonic).
|
| Practically speaking though, for startups without funding, or
| access to cloud credits, my advice would be to just train the
| best model you can, with the compute resources you have
| available. Try to close your first customer with an "MVP"
| model. Even if your model is not good enough for most customers
| - you can close one, get some incremental revenue, and keep
| iterating.
|
| When we first started (2017), I trained models that were ~1/10
| the size of our current models on a few K80s in AWS. These
| models were much worse compared to our models today, but they
| helped us make incremental progress to get to where we are now.
| etrain wrote:
| Check out Determined https://github.com/determined-ai/determined
| to help manage this kind of work at scale: Determined leverages
| Horovod under the hood, automatically manages cloud resources and
| can get you up on spot instances, T4's, etc. and will work on
| your local cluster as well. Gives you additional features like
| experiment management, scheduling, profiling, model registry,
| advanced hyperparameter tuning, etc.
|
| Full disclosure: I'm a founder of the project.
| PickleAI wrote:
| Thanks for going open source!
| dylanbfox wrote:
| Interesting. How do you guys manage spot interruptions when
| training on spot instances?
| etrain wrote:
| Users expose their model to our Trial API
| (https://docs.determined.ai/latest/topic-guides/model-
| definit...), the base class then implements a training loop
| (which can be enhanced with user-supplied callbacks, metrics,
| etc.) that has a whole bunch of bells and whistles. Easy
| distributed (multi-GPU and multi-node) training, automatic
| checkpointing, fault tolerance, etc.
|
| Concretely, the system is regularly taking checkpoints (which
| include model weights and optimizer state) and so if the
| spots disappear (as they do), the system has enough
| information to resume from where things were last
| checkpointed when resources become available again.
| shoo wrote:
| tangent: i would dearly love to read a similar article focusing
| on practical advice on industrial application of statistical
| modelling, probabilistic programming & Bayesian inference
| freshthought wrote:
| Does anyone use this? How does AssemblyAI compare to Google's? We
| are considering adding speech recognition to a small part of our
| product.
| singularity2001 wrote:
| Maybe relevant in context: you can now use Siri offline
| transcription inside your apps. (for free)
| trowngon wrote:
| I believe most people already moved to offline engines. No need
| to send the data to some random guys like this Assembly. Nemo
| Conformer from Nvidia, Robust Wav2Vec from Facebook, Vosk.
| There are dozen options. And the cost is $0.01 per hour, not
| $0.89 per hour like here.
|
| Another advantage is that you can do more custom things - add
| words to vocabulary, detect speakers with biometric features,
| detect emotions.
| endisneigh wrote:
| without talking about accuracy any comparison is meaningless.
| trowngon wrote:
| You don't even need to compare accuracy, you can just check
| the technology. Facebook model is trained on 256 GPU cards
| and you can fine-tune it to your domain in a day or two.
| The release was 2 month ago. There is no way any cloud
| startup can have something better in production given they
| have access to just 4 Titan cards.
| johnsonap wrote:
| We run 10's of thousands of hours of audio through Assembly AI
| each day. We did a boatload of benchmarking on manually
| transcribed audio when we decided to use them and they were by
| far the best across the usual suspects (Amazon, etc.) and
| against smaller startups. They've only gotten better in the 2-3
| years we've been using them
| 6gvONxR4sf7o wrote:
| This doesn't answer the question at all, but huggingface also
| has some decent ASR models available.
| nshm wrote:
| Huggingface ASR models are not really recommended. The simple
| fact they don't use beam decoder with LM makes them much less
| accurate for practical applications. If you compare them to
| setups like Nemo + pyctcdecode, they will be 30% less
| accurate.
|
| Also, most of the models there are undertrained.
| makaimc wrote:
| I used both Google's speech-to-text APIs and Assembly's APIs as
| well as some other ones to build Twilio Voice phone calling
| applications. The out of the box accuracy was way better with
| Assembly and its far easier to quickly customize the language
| model for higher accuracy in specific domains (for example
| programming language keywords). Generally I avoid using Google
| APIs whenever possible since they always seem overly
| complicated to get started with and have incomplete
| documentation even when I'm working in Python which should be
| one of the better supported languages.
| rememberlenny wrote:
| I would strongly advise against using Google's ML apis.
|
| First, at my company Milk Video, we are huge fans of Assembly
| AI. The quality, speed and cost of their transcription is
| galaxies beyond the competition.
|
| Having worked in machine learning focused companies for a few
| years, I have been researching this exact question. I'm curious
| how I can better forecast the amount of ML talent I should
| expect to build into our team (we are a seed stage company),
| and how much I can confidently outsource to best-in-class.
|
| A lot of the ML services we use now are utilities that we don't
| want to manage (speech-to-text, video content processing, etc),
| and also want to see improve. We took a lot of time to decide
| who we outsource these things to, like working with AssemblyAI,
| because we were very conscious of the pace of improvement in
| speech-to-text quality.
|
| When we were comparing products, the most important questions
| were:
|
| 1. How accurate is the speech-to-text API
|
| 1.a Word error rate
|
| 1.b Time attributed to start/end word
|
| 2. How fast does it process our content
|
| 3. How much does it cost
|
| AssemblyAI was the only tool that used modern web patterns (ie.
| not Googles horrible API or other non-tech based companies
| trying to provide transcript services) that made it easy to
| integrate with in a short Sunday morning. The API is also
| surprisingly better than other speech-to-text services, because
| its trained for the kind of audio/video content being produced
| today (instead of old call center data, or perfect audio from
| studio-grade media).
|
| Google's api forced you to manage your asset hosting in GCP,
| handle tons of unnecessary configuration around auth/file
| access/identity, and its insanely slow/inaccurate. Some other
| transcription services we used were embarrassingly horrible
| from a developer experience perspective, in that they also
| required you to actually talk to a person before giving you
| access.
|
| The reason Assembly is so great is that you can literally make
| an API request with a media file url (video or audio), and
| _boom_ , you get a nice intuitive JSON formatted transcript
| response. You can also add params to get speakers, get topic
| analysis, personal information detection, and it's just a
| matter of changing the payload in the first API request.
|
| I'm very passionate about this because I spent so much time
| fighting previously implemented transcript services, and want
| to help anyone avoid the pain because Assembly really does it
| correctly.
| subpar wrote:
| How good is their speaker labeling? We've been using the
| Google API but their diarization has been basically unusable
| for our application (transcripts of group conversations).
| dylanbfox wrote:
| Dylan from Assembly here. If you want to send me one of
| your audio files (my email is in my profile) I'd be happy
| to send you back the diarized results from our API.
|
| You can also signup for a free account and test from the
| dashboard without having to write any code if that's
| easier.
|
| Other than lots of crosstalk in your group conversations -
| is there anything else challenging about your audio (eg,
| distance from microphones, background noise, etc?)
| dylanbfox wrote:
| Dylan from Assembly here. Most of our customers have actually
| switched over to us from Google - this Launch HN from a YC
| startup that uses our API goes into a bit more detail if you're
| interested:
|
| https://news.ycombinator.com/item?id=26251322
|
| My email is in my profile if you want to reach out to chat
| more!
| PickleAI wrote:
| We use assemblyai at our YC startup https://pickleai.com for
| our transcripts and deploy our own sentiment and summary models
| to help users take more efficient notes on Zoom calls! Super
| happy with them!
| monkeydust wrote:
| Also curious, are there any 'independent' performance
| benchmarks in this space?
| dylanbfox wrote:
| This is tricky. The de facto metric to evaluate an ASR model
| is Word Error Rate (WER). But results can vary widely
| depending on the pre-processing that's done (or not done) to
| transcription text before calculating a WER.
|
| For example if you take the WER of "I live in New York" and
| "i live in new york" the WER would be 60% because you're
| comparing a capitalized version vs an uncapitalized version.
|
| This is why public WER results vary so widely.
|
| We publish our own WER results and normalize the human and
| automatic transcription text as much as possible to get as
| close to "true" numbers as possible. But in reality, we see a
| lot of people comparing ASR services simply by doing diffs of
| transcripts.
| cleverpebble wrote:
| I definitely enjoyed reading your article!
|
| Did you play around with any AI-specific accelerators (eg TPUs?)
|
| Looking at some basic cost analysis from a stranger on the
| Internet - https://medium.com/bigdatarepublic/cost-comparison-of-
| deep-l... - you can probably get a decent price reduction in
| training, especially using preemptive instances (and perhaps a
| better pricing contract with Google/AWS)
|
| It's kind of crazy how the shortage of GPUs is affecting pricing
| on physical devices. My RTX Titan I bought in 2019 for $2,499
| runs almost $5k on Amazon and is in short supply. The Titan V
| options you linked (although I think theres a typo because you
| referred it it as a Titan X) is an option - but it is still super
| overpriced for it's performance. Of course, this will probably
| settle down in the next year or two, and by then there will be
| new GPUs that are ~2-4x flop/$ compared to the V100/A100.
| Learnedvector wrote:
| Last I've checked (a year or two ago) PyTorch support for TPU's
| were atrocious. Have they gotten any better?
| ypcx wrote:
| https://github.com/pytorch/xla/
| tubby12345 wrote:
| PyTorch XLA is mature backend. In fact several other
| accelerators support PyTorch by lowering from XLA.
| kettleballroll wrote:
| At these sizes, tpu would definitely be the way to go, and
| would likely be a lot cheaper (and potentially faster) than
| GPUs.
| peter_retief wrote:
| 500 million parameters seems like a lot, are there not
| duplication or redundancies that can reduce the parameters. One
| could also use batches of data. Seems very expensive!
| cagataygurturk wrote:
| Aren't preemptible/spot instances a way of dramatically reducing
| the public cloud cost if the training jobs are designed to be
| resumable/resilient to interruptions? Most providers offer also
| GPUs with this pricing model.
| hedgehog wrote:
| One thing to note on the "Train with lower precision" is on newer
| hardware with TF32 support that gives you much of the speedup of
| FP16 without being as finicky. Doesn't save memory, but still
| useful. Automatic in PyTorch, not sure in TensorFlow:
|
| https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-...
|
| This is mostly important because these settings can significantly
| affect the price/perf evaluation for your specific model & the
| available hardware.
| [deleted]
| visarga wrote:
| > that still adds up to $2,451,526.58 to run 1,024 A100 GPUs for
| 34 days
|
| Salary costs are probably even higher than compute costs.
| Automatic Speech Recognition is an industrial scale application,
| it costs a lot to train, but so do many other projects in
| different fields. How expensive is a plane or a ship? How much
| can a single building cost? A rocket launch?
| dylanbfox wrote:
| > Salary costs are probably even higher than compute costs.
|
| Yes exactly. Managing that much compute requires many humans!
| dinvlad wrote:
| I wouldn't be so sure :-)
| 6gvONxR4sf7o wrote:
| In what way are salary costs higher? This is on the order of 10
| of their people's annual salaries. This is for a single
| training run (meaning overall compute costs are higher), and it
| isn't the only thing those ten or so people would have done
| that year (also meaning overall compute costs are higher).
| cpill wrote:
| yeah, but a cluster of the resulting model can transcribe
| thousands of hours of speech / second, 24/7 with a fixed
| accuracy, what can 10 humans do?
| somebodythere wrote:
| This is an entire seed round's worth of money on an operational
| expenditure.
| aledalgrande wrote:
| > How to train large deep learning models as a startup
|
| How to train large deep learning models at a well founded
| startup*
|
| Everything described here is absolutely not affordable by
| bootstrappers and startups with little funding, unless the model
| to train is not that deep.
| m_ke wrote:
| As a bootstrapper I camped all night outside of bestbuy to get
| some 3090s.
|
| Other tips not mentioned in the article:
|
| 1. Tune your hyper parameters on a subset of the data.
|
| 2. Validate new methods with smaller models on public datasets.
|
| 3. Tune models instead of training from scratch (either public
| models or your previously trained ones).
| aledalgrande wrote:
| Great hacks, although you have to be aware of the trade-offs:
|
| 1. if you choose the wrong subset, you'll find a non optimum
| local min
|
| 2. still risk dead ends when expanding the model and lengthen
| the time to finding that out
|
| 3. a lot of public models are made from inaccurate datasets,
| so beware
|
| Overall you have to start somewhere though, and your points
| are still valid.
| aabaker99 wrote:
| 1. Gradient descent almost always finds a non optimum local
| min (it is not guaranteed to find a global min).
| m_ke wrote:
| 1. The small subset is to test that your training pipeline
| works and converges near 0 loss.
|
| 2. Sure, but for most new hacks like mixup, randaugment and
| etc the results usually transfer over. Problem with deep
| learning is that most of the new results don't replicate so
| it's good to have a way to quickly validate things.
|
| 3. The lower level features are usually pretty data
| agnostic and transfer well to new tasks.
| kartayyar wrote:
| TLDR of the top two point: "get accepted to YC and use cloud
| credits." and use dedicated servers from Cirrascale.
|
| Saved you a click.
| elmomle wrote:
| Excellent and informative article--and a good bit of brand-
| building, I might say :-). One thing I'd love to see more writing
| about is prototyping and iterative development in these contexts
| --deep NNs are notoriously hard to get "right", and there seems
| to be a constant tension between model architecting, tuning
| hyperparameters, etc.--for example, you presumably don't want to
| have to wait a couple of weeks (and burn through thousands of
| dollars) seeing if one choice of hyperparameters works well for
| your chosen architecture.
|
| Of course, some development practices, such as ensuring that your
| loss function works in a basic sense, are covered in many places.
| But I'd love to see more in-depth coverage of architecture
| development & development best practices. Does anyone know of any
| particularly good resources / discussions there?
| mkolodny wrote:
| This is an awesome blog post by Andrej Karpathy (the Director
| of AI at Tesla) about his recipe for training neural networks:
| https://karpathy.github.io/2019/04/25/recipe/
___________________________________________________________________
(page generated 2021-10-07 23:00 UTC)