[HN Gopher] How to train large deep learning models as a startup
       ___________________________________________________________________
        
       How to train large deep learning models as a startup
        
       Author : dylanbfox
       Score  : 210 points
       Date   : 2021-10-07 14:45 UTC (8 hours ago)
        
 (HTM) web link (www.assemblyai.com)
 (TXT) w3m dump (www.assemblyai.com)
        
       | sandGorgon wrote:
       | the hardest part here is horizontal scaling. OpenAI handrolled
       | its own MPI+SSH stack (https://openai.com/blog/scaling-
       | kubernetes-to-7500-nodes/)
       | 
       | I wonder what is the state of art for horizontal scaling here
       | ...preferably on kubernetes.
       | 
       | Pytorch is tricky to integrate (using TorchElastic). You could
       | use Dask or Ray Distributed. Tensorflow has its own mechanism
       | that doesnt play nice with Kubernetes.
       | 
       | How are others doing it ?
        
       | gonab wrote:
       | Use a small network, train it in a local GPU
        
       | [deleted]
        
       | endisneigh wrote:
       | If you wanted to do something like "OK Google" with AssemblyAI
       | would you have to transcribe everything and then process the
       | substring "OK Google" on the application layer (and therefore
       | incur all of the cost of listening constantly)?
       | 
       | It'd be cool if there was the ability to train a phrase locally
       | on your own premises and then use that to begin the real
       | transcription.
       | 
       | This probably wouldn't be super difficult to build, but was
       | wondering if it was available (didn't see anything at a glance)
        
         | dylanbfox wrote:
         | Great question. This is technically referred to as "Wake Word
         | Detection". You run a really small model locally that is just
         | processing 500ms (for example) of audio at a time through a
         | light weight CNN or RNN. The idea here is that it's just binary
         | classification (vs actual speech recognition).
         | 
         | There are some open source libraries that make this relatively
         | easy:
         | 
         | - https://github.com/Kitt-AI/snowboy (looks to be shutdown now)
         | - https://github.com/cmusphinx/pocketsphinx
         | 
         | This avoids having to stream audio 24x7 to a cloud model which
         | would be super expensive. This being said, I'm pretty sure what
         | the Alexa does, for example, is send any positive wake word to
         | a cloud model (that is bigger and more accurate) to verify the
         | prediction of the local wake word detection model AFAIK.
         | 
         | Once you are positive you have a positive wake word detected -
         | that's when you start streaming to an accurate cloud based
         | transcription model like Assembly to minimize costs!
        
         | asdff wrote:
         | Bose used to have some pre internet system that recognized the
         | song you liked to play right after another song (like in a
         | random shuffle) and attempted to learn what you liked to hear,
         | and queue up the song you were likely to skip to anyway. No
         | idea how they pulled it off since this must have been on
         | hardware from 15 years ago iirc.
        
           | tintinmovie wrote:
           | Ah yes Bose uMusic. From the manual it extracts 30 feature
           | points from the songs to define your preference.
           | 
           | uMusic patent:
           | https://patents.google.com/patent/CN1637743A/en
           | 
           | Further reading: http://products.bose.com/pdf/customer_servic
           | e/owners/uMusic_...
        
         | debbiedowner wrote:
         | This is actually a much simpler task than ASR and you can even
         | easily train on a normal CPU even.
         | 
         | The best do it yourself instructions are in a book called Tiny
         | ML.
         | 
         | Compared to super deep transformers, you'll find that deployed
         | WW detectors are as simple as SVMs or 2 layer NNs.
        
         | mdda wrote:
         | The search term you're looking for is "Keyword Spotting" (or
         | "Wake Word Detection") - and that's what's implemented locally
         | for ~embedded devices that sit and wait for something relevant
         | to come along so that they know when to start sending data up
         | to the mothership (or even turn on additional higher-power
         | cores locally).
         | 
         | Here's an example repo that might be interesting (from initial
         | impressions, though there are many more out there) :
         | https://github.com/vineeths96/Spoken-Keyword-Spotting
        
       | stayfrosty420 wrote:
       | in my experience it's often more like "just use linear regression
       | and tell everyone you're using AI"
        
         | cardosof wrote:
         | Thats for structured data, for non structured it's more like
         | "create a NN and stack more layers until you have your MVP"
        
           | andyxor wrote:
           | it's more like try different off-the-shelf models on some
           | sample of data until the performance is somewhat acceptable
           | 
           | Unless you're Google, who even trains models from scratch
           | these days, at most you do some fine-tuning
        
           | jszymborski wrote:
           | > "create a NN and stack more layers until you have your MVP"
           | 
           | I mean, that's a pretty good principled approach to a lot of
           | ML problems.
        
             | r-zip wrote:
             | I think you have a different definition of "principled"
             | from most people.
        
               | jszymborski wrote:
               | I'm very curious as to what part of that process is not
               | explained by the principles by which we understand neural
               | networks to work.
               | 
               | I invite the possibility I've gone this long
               | misunderstanding the definition of "principled" in this
               | context.
        
             | cardosof wrote:
             | Only because currently ML is more alchemy than engineering.
             | We mix stuff until we make gold while we can't explain why
             | more parameters generalize better instead of overfitting.
        
         | potatoman22 wrote:
         | Lol, very true haha. In actuality, I don't think most NN's are
         | any more 'AI' than simpler models. The definition for AI is
         | fleeting, though.
        
         | Barrin92 wrote:
         | the serious tip here is to go with gradient boosting which very
         | often works so well it hardly makes a difference
        
       | apl wrote:
       | Several hints here are severely outdated.
       | 
       | For instance, never train a model in end-to-end FP16. Use mixed
       | precision, either via native TF/PyTorch or as a freebie when
       | using TF32 on A100s. This'll ensure that only suitable ops are
       | run with lower precision; no need to fiddle with anything. Also,
       | PyTorch DDP in multi-node regimes hasn't been slower or less
       | efficient than Horovod in ages.
       | 
       | Finally, buying a local cluster of TITAN Xs is an outright weird
       | recommendation for massive models. VRAM limitations alone make
       | this a losing proposition.
        
       | Reubend wrote:
       | This is an excellent article, which does a good job of detailing
       | several factors involved here. But while it does suggest several
       | ways to reduce the cost of training models, I'm left with a huge
       | questions at the end.
       | 
       | How much does it ultimately cost to train a model at this size,
       | and is it feasible to do without VS funding (and cloud credits)?
        
         | dylanbfox wrote:
         | Author here. Thanks for your comments!
         | 
         | In general - this is expensive stuff. Training big, accurate
         | models just requires a lot of compute, and there is a "barrier
         | to entry" wrt costs, even if you're able to get those costs
         | down. I think it's similar to startups not really being able to
         | get into the aerospace industry unless they raise lots of
         | funding (ie, Boom Supersonic).
         | 
         | Practically speaking though, for startups without funding, or
         | access to cloud credits, my advice would be to just train the
         | best model you can, with the compute resources you have
         | available. Try to close your first customer with an "MVP"
         | model. Even if your model is not good enough for most customers
         | - you can close one, get some incremental revenue, and keep
         | iterating.
         | 
         | When we first started (2017), I trained models that were ~1/10
         | the size of our current models on a few K80s in AWS. These
         | models were much worse compared to our models today, but they
         | helped us make incremental progress to get to where we are now.
        
       | etrain wrote:
       | Check out Determined https://github.com/determined-ai/determined
       | to help manage this kind of work at scale: Determined leverages
       | Horovod under the hood, automatically manages cloud resources and
       | can get you up on spot instances, T4's, etc. and will work on
       | your local cluster as well. Gives you additional features like
       | experiment management, scheduling, profiling, model registry,
       | advanced hyperparameter tuning, etc.
       | 
       | Full disclosure: I'm a founder of the project.
        
         | PickleAI wrote:
         | Thanks for going open source!
        
         | dylanbfox wrote:
         | Interesting. How do you guys manage spot interruptions when
         | training on spot instances?
        
           | etrain wrote:
           | Users expose their model to our Trial API
           | (https://docs.determined.ai/latest/topic-guides/model-
           | definit...), the base class then implements a training loop
           | (which can be enhanced with user-supplied callbacks, metrics,
           | etc.) that has a whole bunch of bells and whistles. Easy
           | distributed (multi-GPU and multi-node) training, automatic
           | checkpointing, fault tolerance, etc.
           | 
           | Concretely, the system is regularly taking checkpoints (which
           | include model weights and optimizer state) and so if the
           | spots disappear (as they do), the system has enough
           | information to resume from where things were last
           | checkpointed when resources become available again.
        
       | shoo wrote:
       | tangent: i would dearly love to read a similar article focusing
       | on practical advice on industrial application of statistical
       | modelling, probabilistic programming & Bayesian inference
        
       | freshthought wrote:
       | Does anyone use this? How does AssemblyAI compare to Google's? We
       | are considering adding speech recognition to a small part of our
       | product.
        
         | singularity2001 wrote:
         | Maybe relevant in context: you can now use Siri offline
         | transcription inside your apps. (for free)
        
         | trowngon wrote:
         | I believe most people already moved to offline engines. No need
         | to send the data to some random guys like this Assembly. Nemo
         | Conformer from Nvidia, Robust Wav2Vec from Facebook, Vosk.
         | There are dozen options. And the cost is $0.01 per hour, not
         | $0.89 per hour like here.
         | 
         | Another advantage is that you can do more custom things - add
         | words to vocabulary, detect speakers with biometric features,
         | detect emotions.
        
           | endisneigh wrote:
           | without talking about accuracy any comparison is meaningless.
        
             | trowngon wrote:
             | You don't even need to compare accuracy, you can just check
             | the technology. Facebook model is trained on 256 GPU cards
             | and you can fine-tune it to your domain in a day or two.
             | The release was 2 month ago. There is no way any cloud
             | startup can have something better in production given they
             | have access to just 4 Titan cards.
        
         | johnsonap wrote:
         | We run 10's of thousands of hours of audio through Assembly AI
         | each day. We did a boatload of benchmarking on manually
         | transcribed audio when we decided to use them and they were by
         | far the best across the usual suspects (Amazon, etc.) and
         | against smaller startups. They've only gotten better in the 2-3
         | years we've been using them
        
         | 6gvONxR4sf7o wrote:
         | This doesn't answer the question at all, but huggingface also
         | has some decent ASR models available.
        
           | nshm wrote:
           | Huggingface ASR models are not really recommended. The simple
           | fact they don't use beam decoder with LM makes them much less
           | accurate for practical applications. If you compare them to
           | setups like Nemo + pyctcdecode, they will be 30% less
           | accurate.
           | 
           | Also, most of the models there are undertrained.
        
         | makaimc wrote:
         | I used both Google's speech-to-text APIs and Assembly's APIs as
         | well as some other ones to build Twilio Voice phone calling
         | applications. The out of the box accuracy was way better with
         | Assembly and its far easier to quickly customize the language
         | model for higher accuracy in specific domains (for example
         | programming language keywords). Generally I avoid using Google
         | APIs whenever possible since they always seem overly
         | complicated to get started with and have incomplete
         | documentation even when I'm working in Python which should be
         | one of the better supported languages.
        
         | rememberlenny wrote:
         | I would strongly advise against using Google's ML apis.
         | 
         | First, at my company Milk Video, we are huge fans of Assembly
         | AI. The quality, speed and cost of their transcription is
         | galaxies beyond the competition.
         | 
         | Having worked in machine learning focused companies for a few
         | years, I have been researching this exact question. I'm curious
         | how I can better forecast the amount of ML talent I should
         | expect to build into our team (we are a seed stage company),
         | and how much I can confidently outsource to best-in-class.
         | 
         | A lot of the ML services we use now are utilities that we don't
         | want to manage (speech-to-text, video content processing, etc),
         | and also want to see improve. We took a lot of time to decide
         | who we outsource these things to, like working with AssemblyAI,
         | because we were very conscious of the pace of improvement in
         | speech-to-text quality.
         | 
         | When we were comparing products, the most important questions
         | were:
         | 
         | 1. How accurate is the speech-to-text API
         | 
         | 1.a Word error rate
         | 
         | 1.b Time attributed to start/end word
         | 
         | 2. How fast does it process our content
         | 
         | 3. How much does it cost
         | 
         | AssemblyAI was the only tool that used modern web patterns (ie.
         | not Googles horrible API or other non-tech based companies
         | trying to provide transcript services) that made it easy to
         | integrate with in a short Sunday morning. The API is also
         | surprisingly better than other speech-to-text services, because
         | its trained for the kind of audio/video content being produced
         | today (instead of old call center data, or perfect audio from
         | studio-grade media).
         | 
         | Google's api forced you to manage your asset hosting in GCP,
         | handle tons of unnecessary configuration around auth/file
         | access/identity, and its insanely slow/inaccurate. Some other
         | transcription services we used were embarrassingly horrible
         | from a developer experience perspective, in that they also
         | required you to actually talk to a person before giving you
         | access.
         | 
         | The reason Assembly is so great is that you can literally make
         | an API request with a media file url (video or audio), and
         | _boom_ , you get a nice intuitive JSON formatted transcript
         | response. You can also add params to get speakers, get topic
         | analysis, personal information detection, and it's just a
         | matter of changing the payload in the first API request.
         | 
         | I'm very passionate about this because I spent so much time
         | fighting previously implemented transcript services, and want
         | to help anyone avoid the pain because Assembly really does it
         | correctly.
        
           | subpar wrote:
           | How good is their speaker labeling? We've been using the
           | Google API but their diarization has been basically unusable
           | for our application (transcripts of group conversations).
        
             | dylanbfox wrote:
             | Dylan from Assembly here. If you want to send me one of
             | your audio files (my email is in my profile) I'd be happy
             | to send you back the diarized results from our API.
             | 
             | You can also signup for a free account and test from the
             | dashboard without having to write any code if that's
             | easier.
             | 
             | Other than lots of crosstalk in your group conversations -
             | is there anything else challenging about your audio (eg,
             | distance from microphones, background noise, etc?)
        
         | dylanbfox wrote:
         | Dylan from Assembly here. Most of our customers have actually
         | switched over to us from Google - this Launch HN from a YC
         | startup that uses our API goes into a bit more detail if you're
         | interested:
         | 
         | https://news.ycombinator.com/item?id=26251322
         | 
         | My email is in my profile if you want to reach out to chat
         | more!
        
         | PickleAI wrote:
         | We use assemblyai at our YC startup https://pickleai.com for
         | our transcripts and deploy our own sentiment and summary models
         | to help users take more efficient notes on Zoom calls! Super
         | happy with them!
        
         | monkeydust wrote:
         | Also curious, are there any 'independent' performance
         | benchmarks in this space?
        
           | dylanbfox wrote:
           | This is tricky. The de facto metric to evaluate an ASR model
           | is Word Error Rate (WER). But results can vary widely
           | depending on the pre-processing that's done (or not done) to
           | transcription text before calculating a WER.
           | 
           | For example if you take the WER of "I live in New York" and
           | "i live in new york" the WER would be 60% because you're
           | comparing a capitalized version vs an uncapitalized version.
           | 
           | This is why public WER results vary so widely.
           | 
           | We publish our own WER results and normalize the human and
           | automatic transcription text as much as possible to get as
           | close to "true" numbers as possible. But in reality, we see a
           | lot of people comparing ASR services simply by doing diffs of
           | transcripts.
        
       | cleverpebble wrote:
       | I definitely enjoyed reading your article!
       | 
       | Did you play around with any AI-specific accelerators (eg TPUs?)
       | 
       | Looking at some basic cost analysis from a stranger on the
       | Internet - https://medium.com/bigdatarepublic/cost-comparison-of-
       | deep-l... - you can probably get a decent price reduction in
       | training, especially using preemptive instances (and perhaps a
       | better pricing contract with Google/AWS)
       | 
       | It's kind of crazy how the shortage of GPUs is affecting pricing
       | on physical devices. My RTX Titan I bought in 2019 for $2,499
       | runs almost $5k on Amazon and is in short supply. The Titan V
       | options you linked (although I think theres a typo because you
       | referred it it as a Titan X) is an option - but it is still super
       | overpriced for it's performance. Of course, this will probably
       | settle down in the next year or two, and by then there will be
       | new GPUs that are ~2-4x flop/$ compared to the V100/A100.
        
         | Learnedvector wrote:
         | Last I've checked (a year or two ago) PyTorch support for TPU's
         | were atrocious. Have they gotten any better?
        
           | ypcx wrote:
           | https://github.com/pytorch/xla/
        
           | tubby12345 wrote:
           | PyTorch XLA is mature backend. In fact several other
           | accelerators support PyTorch by lowering from XLA.
        
         | kettleballroll wrote:
         | At these sizes, tpu would definitely be the way to go, and
         | would likely be a lot cheaper (and potentially faster) than
         | GPUs.
        
       | peter_retief wrote:
       | 500 million parameters seems like a lot, are there not
       | duplication or redundancies that can reduce the parameters. One
       | could also use batches of data. Seems very expensive!
        
       | cagataygurturk wrote:
       | Aren't preemptible/spot instances a way of dramatically reducing
       | the public cloud cost if the training jobs are designed to be
       | resumable/resilient to interruptions? Most providers offer also
       | GPUs with this pricing model.
        
       | hedgehog wrote:
       | One thing to note on the "Train with lower precision" is on newer
       | hardware with TF32 support that gives you much of the speedup of
       | FP16 without being as finicky. Doesn't save memory, but still
       | useful. Automatic in PyTorch, not sure in TensorFlow:
       | 
       | https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-...
       | 
       | This is mostly important because these settings can significantly
       | affect the price/perf evaluation for your specific model & the
       | available hardware.
        
       | [deleted]
        
       | visarga wrote:
       | > that still adds up to $2,451,526.58 to run 1,024 A100 GPUs for
       | 34 days
       | 
       | Salary costs are probably even higher than compute costs.
       | Automatic Speech Recognition is an industrial scale application,
       | it costs a lot to train, but so do many other projects in
       | different fields. How expensive is a plane or a ship? How much
       | can a single building cost? A rocket launch?
        
         | dylanbfox wrote:
         | > Salary costs are probably even higher than compute costs.
         | 
         | Yes exactly. Managing that much compute requires many humans!
        
           | dinvlad wrote:
           | I wouldn't be so sure :-)
        
         | 6gvONxR4sf7o wrote:
         | In what way are salary costs higher? This is on the order of 10
         | of their people's annual salaries. This is for a single
         | training run (meaning overall compute costs are higher), and it
         | isn't the only thing those ten or so people would have done
         | that year (also meaning overall compute costs are higher).
        
           | cpill wrote:
           | yeah, but a cluster of the resulting model can transcribe
           | thousands of hours of speech / second, 24/7 with a fixed
           | accuracy, what can 10 humans do?
        
         | somebodythere wrote:
         | This is an entire seed round's worth of money on an operational
         | expenditure.
        
       | aledalgrande wrote:
       | > How to train large deep learning models as a startup
       | 
       | How to train large deep learning models at a well founded
       | startup*
       | 
       | Everything described here is absolutely not affordable by
       | bootstrappers and startups with little funding, unless the model
       | to train is not that deep.
        
         | m_ke wrote:
         | As a bootstrapper I camped all night outside of bestbuy to get
         | some 3090s.
         | 
         | Other tips not mentioned in the article:
         | 
         | 1. Tune your hyper parameters on a subset of the data.
         | 
         | 2. Validate new methods with smaller models on public datasets.
         | 
         | 3. Tune models instead of training from scratch (either public
         | models or your previously trained ones).
        
           | aledalgrande wrote:
           | Great hacks, although you have to be aware of the trade-offs:
           | 
           | 1. if you choose the wrong subset, you'll find a non optimum
           | local min
           | 
           | 2. still risk dead ends when expanding the model and lengthen
           | the time to finding that out
           | 
           | 3. a lot of public models are made from inaccurate datasets,
           | so beware
           | 
           | Overall you have to start somewhere though, and your points
           | are still valid.
        
             | aabaker99 wrote:
             | 1. Gradient descent almost always finds a non optimum local
             | min (it is not guaranteed to find a global min).
        
             | m_ke wrote:
             | 1. The small subset is to test that your training pipeline
             | works and converges near 0 loss.
             | 
             | 2. Sure, but for most new hacks like mixup, randaugment and
             | etc the results usually transfer over. Problem with deep
             | learning is that most of the new results don't replicate so
             | it's good to have a way to quickly validate things.
             | 
             | 3. The lower level features are usually pretty data
             | agnostic and transfer well to new tasks.
        
       | kartayyar wrote:
       | TLDR of the top two point: "get accepted to YC and use cloud
       | credits." and use dedicated servers from Cirrascale.
       | 
       | Saved you a click.
        
       | elmomle wrote:
       | Excellent and informative article--and a good bit of brand-
       | building, I might say :-). One thing I'd love to see more writing
       | about is prototyping and iterative development in these contexts
       | --deep NNs are notoriously hard to get "right", and there seems
       | to be a constant tension between model architecting, tuning
       | hyperparameters, etc.--for example, you presumably don't want to
       | have to wait a couple of weeks (and burn through thousands of
       | dollars) seeing if one choice of hyperparameters works well for
       | your chosen architecture.
       | 
       | Of course, some development practices, such as ensuring that your
       | loss function works in a basic sense, are covered in many places.
       | But I'd love to see more in-depth coverage of architecture
       | development & development best practices. Does anyone know of any
       | particularly good resources / discussions there?
        
         | mkolodny wrote:
         | This is an awesome blog post by Andrej Karpathy (the Director
         | of AI at Tesla) about his recipe for training neural networks:
         | https://karpathy.github.io/2019/04/25/recipe/
        
       ___________________________________________________________________
       (page generated 2021-10-07 23:00 UTC)