[HN Gopher] Show HN: SpotML - Managed ML Training on Cheap AWS/G...
       ___________________________________________________________________
        
       Show HN: SpotML - Managed ML Training on Cheap AWS/GCP Spot
       Instances
        
       Author : vishnukool
       Score  : 101 points
       Date   : 2021-10-03 15:57 UTC (7 hours ago)
        
 (HTM) web link (spotml.io)
 (TXT) w3m dump (spotml.io)
        
       | Gatesyp wrote:
       | Read through the homepage, but not entirely sure --
       | 
       | Why not just train on Spot Instances with a retry implemented?
       | 
       | I see that SpotML has a configurable fall back to On-Demand
       | instances, and perhaps their value prop is that it saves the
       | state of your run up to the interruption + resumes it on the On-
       | Demand instance, but why not just set a retry on the Spot
       | Instance if its interrupted?
       | 
       | I'm failing to see what is different about SpotML vs Metaflow's
       | @retry decorator and using AWS Batch:
       | https://docs.metaflow.org/metaflow/failures#retrying-tasks-w...
       | 
       | If you're in the comment still, Vishnu, would love to hear your
       | thoughts
        
         | vishnukool wrote:
         | Interesting, thanks, we weren't aware of Metaflow.
         | 
         | I've read through the docs, the one difference that comes to my
         | mind is the automatic fallback to on-Demand and resume back to
         | spot when available. I can't readily see a way to do this yet
         | in Metaflow, but it's possible I've missed something.
        
       | mjaques wrote:
       | I'd like to point out that this seems extremely similar to Nimbo
       | (https://nimbo.sh), to the extent that even some of the terminal
       | messages are exactly the same, and even parts of the docs are
       | copy pasted. E.g:
       | 
       | Nimbo docs: "In order to run this job on Nimbo, all you need is
       | one tiny config file and a Conda environment file (to set the
       | remote environment), and Nimbo does the following for you:"
       | 
       | SpotML docs: "In order to run this job on SpotML, all you need is
       | one tiny config file and a Docker file (to set the remote
       | environment), and SpotML does the following for you:"
       | 
       | Make of that what you will :).
        
         | vishnukool wrote:
         | Yes we liked the elegance of both the tool and the docs. So
         | it's very much inspired from it. I must also give credit to
         | another great tool https://spotty.cloud/ from which this
         | project was adopted.
        
           | ShamelessC wrote:
           | Optics on direct copying without attribution aren't great for
           | trust in open source software. Count me out and thanks for
           | pointing me to the place where people are _actually_ working
           | in the open/cooperatively.
        
             | nerdponx wrote:
             | Docs are also still copyrighted works. Copying them
             | verbatim might be violating the rights of the original
             | authors.
        
               | vishnukool wrote:
               | Thanks for pointing it out, We realize our mistake here.
               | We also should've done proper attribution. Will be
               | correcting this.
        
               | avatar042 wrote:
               | Seems like Nimbo (https://nimbo.sh) has a Business Source
               | License (https://github.com/nimbo-
               | sh/nimbo/blob/master/LICENSE), so you might want to check
               | with them regarding licensing terms for a startup that is
               | using their code and/or docs in "production"?
               | 
               | Otherwise, this idea is interesting and probably
               | generalizable to other applications. Maybe it's not
               | crystal clear to me, but what are the advantages of your
               | service over existing solutions such as Nimbo and Spotty?
               | FWIW it might be worthwhile adding this to your website.
               | 
               | Good luck!
        
               | vishnukool wrote:
               | Thanks, Makes sense. It doesn't use any "code" from
               | nimbo. The documentation and the design simplicity of the
               | tool were the things that was appealing to us and
               | adopted. The project itself was forked from spotty which
               | has an MIT license.
               | 
               | The biggest advantage which was missing in the Open
               | source options was monitoring on the training job and
               | auto recovery from spot interruptions which spotML does.
        
             | [deleted]
        
       | oakfr wrote:
       | Does it apply to jobs running on multiple instances, e.g using
       | dask?
        
         | vishnukool wrote:
         | Good question, I've heard other engineers ask for this. The MVP
         | version doesn't yet handle multiple instances, but it will be
         | on our roadmap.
        
           | boulos wrote:
           | When you do get there, consider the Bulk Insert API on GCP
           | and "Spot Fleets" on AWS (both make it easier for the
           | provider to satisfy the entire request in one go).
        
             | vishnukool wrote:
             | Hmm.. makes sense, yes. thanks!
        
       | tyingq wrote:
       | Looks very useful. One suggestion...the before/after swipe image
       | is great for showing how it works, but not why you would use it.
       | Might be helpful to overlay the ascending cost line, which would
       | be steep in the "before", but gradual in the "after".
        
         | vishnukool wrote:
         | Thanks for the suggestion, yes makes sense.
        
       | JZL003 wrote:
       | I'm interested in how they 'hibernate'/save the state of the
       | instances within the shutdown time limit. I was also looking into
       | this for myself, there are ways of using docker to save the in-
       | process memory a-la hibernate, which would work well with this.
       | But, especially for GCP where you only get ~60 seconds between
       | the shutdown signal and hard stop, I was worried that it wouldn't
       | save it fast enough. I often work on pretty high-ram instances
       | and thought even saving from ram to disk would take too long for
       | 150-300GB ram uses.
       | 
       | I hadn't heard of nimbo, maybe I can read how they're doing it
       | since it's open sources. Does anyone have any idea how they're
       | saving state so fast (NVME SSD disk?)
        
         | vishnukool wrote:
         | It uses a mounted EBS(Elastic Block Store) so all the
         | checkpoints, data etc. is already in the persistence storage.
         | This is simply be re-attached to the next spot/onDemand
         | instance after interruption.
        
           | JZL003 wrote:
           | Cool yeah that makes sense, makes total sense for ML where
           | you just need to run over epochs, less clear for other
           | workloads.
           | 
           | After looking around I thinking more about CRIU/docker
           | suspend. The google stars aligned and I found this
           | https://github.com/checkpoint-restore/criu-image-streamer + h
           | ttps://linuxplumbersconf.org/event/7/contributions/641/atta..
           | . which actually seems perfect. I wonder how fast it is
           | 
           | Edit: Also no GPU support AFAIK but
           | https://github.com/twosigma/fastfreeze looks really nice,
           | turnkey. I wonder if I write to a fast persistent disk if I
           | can get higher maximum ram than over the NW
           | 
           | (or, hacking on a checkpoint idea, have a daemon periodically
           | 'checkpoint' other programs so even if it's too slow over 60
           | seconds, revert to the last checkpoint. Even an rsync like
           | application where only send the changes)
        
         | JZL003 wrote:
         | Oh I didn't see much in Nimbo with a quick glance but reading
         | more closely > We immediately resume training after
         | interruptions, using the last model checkpoint via persistent
         | EBS volume.
         | 
         | Makes sense, just save checkpoints to disk. What I'm doing is
         | more CPU bound and not straight ML so less easily check-
         | pointed, sadly. Cool though, it's worth jumping through hoops
         | for 70% reduction
        
       | dexter89_kp3 wrote:
       | This looks awesome!
       | 
       | Could you give an approximate cost of fine-tuning a model like
       | Bert or even GAN given this system?
       | 
       | Just want to get a sense of cost of using the system.
        
         | vishnukool wrote:
         | Sure, in our own startup, we used to spend roughly $1000 for
         | training a StyleGAN model for a class and then additional
         | latent space manipulation models. In the early days we
         | recklessly wasted a lot of our AWS credits during
         | experimentation. But later on with spot instances were able to
         | bring it down to $250 to $300 per category class training which
         | was more bearable as cost.
        
       | vishnukool wrote:
       | We built a tool called spotML to make training on AWS/GCP
       | cheaper.
       | 
       | Spot Instances are 70% cheaper than On-Demand instances but are
       | prone to interruptions. We mitigate the downside of these
       | interruptions through the use of persistence features, including
       | optional fallback to On-Demand instances. So you can optimize
       | workflows according to your budget and time constraints.
       | 
       | History: We were working on a neural rendering startup that
       | needed a lot of GAN training which was getting very expensive. We
       | were blowing roughly $1000, to train a single category class.
       | Training on Spot instances was cheaper, but still a mess. It
       | needed lot of hand holding/devops stuff to make it usable. So we
       | built SpotML to automate a lot of things.
       | 
       | Posting it here to see if the community finds this helpful, so
       | that we can open it up to the larger community.
        
         | oakfr wrote:
         | This looks like an obvious approach (in the hindsight of
         | course) to a general problem. I love your pivot.
         | 
         | Congrats on the idea and godspeed. You'll probably have a lot
         | of interest if you execute well.
        
           | vishnukool wrote:
           | Thank you!
        
         | zachthewf wrote:
         | Really excited to try it out. I've had a heck of a time setting
         | up spot instance training on Sagemaker in the past so
         | simplification efforts are much appreciated.
        
           | vishnukool wrote:
           | Curious, did you eventually start using Sagemaker with spot
           | instances or did you give up on it ? Also what would you say
           | were the biggest pain points with Sagemaker?
        
         | talolard wrote:
         | This looks useful, I like the pricing of free/$9.99 a month.
         | 
         | No one asked, but it's HN so I'll say it anyway. I think it's a
         | questionable "vc business" but a great business for 1-2 people.
         | The road from this to an enterprise sales motion, or even a
         | 10K/year contract is hard for me to imagine. At some point, it
         | becomes cost effective for my org to build this functionality
         | in house.
         | 
         | However, as a hobbyist /single dev / small team, $120/Year is a
         | no brainer after the first 2-3K I spend on GPU by mistake. As
         | you know, setting up spots when I just want to get shit done is
         | a pain and I'll gladly pay you (a little) to make that go away.
         | 
         | Still no one asked but... One thing that plays to your
         | advantage is that the current price point is something anyone
         | in your user group can buy on their own, and there are a lot of
         | us / enough to make a nice business out of.
         | 
         | Good luck!
        
           | vishnukool wrote:
           | Thanks, yeah. At this stage we really just want to validate
           | if this was a real problem in the ML community. Down the line
           | I suppose as we scale to handle multi instance training and
           | other use cases, we could probably charge more, say a % of
           | the cost savings in training.
        
         | ignoramous wrote:
         | Neat. Congratulations on the launch, Vishnu!
         | 
         | Apart from the fact that it could deploy to both GCP and AWS,
         | what does it do differently than AWS Batch [0]?
         | 
         | When we had a similar problem, we ran jobs on spots with AWS
         | Batch and it worked nicely enough.
         | 
         | Some suggestions (for a later date):
         | 
         | 1a. Add built-in support for Ray [1] (you'd essentially be then
         | competing with Anyscale, which _is_ a VC funded startup, just
         | to contrast it with another comment on this thread) and dbt
         | [2].
         | 
         | 1b. Or: Support deploying coin miners (might help widen the
         | product's reach; and stand it up against the likes of
         | consensys).
         | 
         | 3. Get in front of the very many cost optimisation consultants
         | out there, like the Duckbill Group.
         | 
         | If I may, where are you building this product from? And how
         | many are on the team? Thanks.
         | 
         | [0] https://aws.amazon.com/batch/use-cases/
         | 
         | [1] https://ray.io/
         | 
         | [2] https://getdbt.com/
        
           | vishnukool wrote:
           | Thank you! Interesting, we actually tried aws batch
           | ourselves. 1) How were you able to handle spot interruptions
           | and resuming from the latest checkpoint ? 2) Not to mention
           | fallback to OnDemand on spot interruptions 3) then switching
           | back to spot from onDemand would also need additional process
           | to be setup.
           | 
           | Also i'm not sure how straightforward it is to detach/attach
           | persistent volumes to retain data across different spot
           | interruptions ? The latter can be done but it's just the same
           | rote each time you wanna train something new.
           | 
           | Also thanks for the suggestions ! We're a team of 2 right
           | now, I used to be in the bay area but in Mexico temporarily.
        
             | ignoramous wrote:
             | 1. Spot interruptions didn't matter much as AWS Batch looks
             | for spots with low interruption probability. Auto retries
             | kicked-in whenever those did get interrupted.
             | 
             | 2. Checkpointing was a pain (we relied mostly on AWS
             | Batch's JobState and S3, not ideal), but the current
             | capability to mount EFS (Elastic Filesystem) looks like it
             | would solve this?
             | 
             | 3. No hot swapping on-demand with spot and vice versa.
             | Interestingly, ALB (Application Load Balancer) supports
             | such mixed EC2 configurations (AWS Batch doesn't).
        
         | cpalumbo wrote:
         | Ha! That's interesting. We use AWS instances for model
         | development and costs are definitely an issue. Sending this to
         | my team. Good luck!
        
       | barefeg wrote:
       | How does it compare with grid.ai?
        
         | vishnukool wrote:
         | Thanks for this, I wasn't aware of it. From reading the docs, I
         | don't see anywhere if Grid automatically handles spot
         | interruptions to resume from last checkpoint which was the main
         | focus of our internal tool.
         | 
         | Have you used it btw ? and what has your experience been with
         | Grid ?
        
       | hallqv wrote:
       | Cool!
        
       | [deleted]
        
       | ShamelessC wrote:
       | This looks cool. How much SpotML-specific code is needed to
       | target SpotML? I'm assuming it doesn't just magically detect the
       | training loop in an existing pytorch codebase and needs hooks
       | implemented for resume, inference, error handling, etc.
        
         | vishnukool wrote:
         | We tried to keep it minimal. All you need to do is specify the
         | format to resume last checkpoint in the spotml.yml file. So
         | let's say your checkpoint files are saved as ckpt00.pt,
         | ckpt01.pt, ckpt03.pt and so on. You can configure checkpoint
         | regex file format ^ckpt[0-9]{2}$ and spotml resumes by picking
         | the latest of it.
         | 
         | For detecting if the training process is still running or
         | errored out it registers the training command pid when
         | launching the task and then monitors for the Pid for
         | completion. It also registers and monitors the instance state
         | itself to check for interruptions and resuming.
        
           | ShamelessC wrote:
           | Oh wow that's rather simple and works with more
           | configurations than I expected. Thanks.
        
       | boulos wrote:
       | Disclosure: I used to work for GCP and launched Preemptible VMs.
       | 
       | Congrats! Can I suggest charging more?
       | 
       | IIUC, your business plan is $10/month for the _company_ ,
       | regardless of number of users?
       | 
       | You probably save a company that $10 in a day or less for one
       | GPU: One A100 is ~$3/hr on demand and ~$.90/hr as Preemptible,
       | saving over $2/hr.
       | 
       | Said another way, your pitch is to recover a lot of the 70%
       | discount that they aren't going to do themselves. If you were a
       | managed training service, you could pitch yourself as "half the
       | price of AWS or GCP" and keep the 20%+ margin with both parties
       | being happy. (The problem is that pass through billing makes that
       | obvious, you need to support lots of bucket security and IAM
       | controls, etc.).
       | 
       | Fwiw, I would also branch out into inference! Preemptible and
       | Spot T4s are commonly used for heavy image models, but many
       | people pay full price. Inference that takes X ms can easily be
       | handled "without errors" in the shutdown time. The risk is
       | handling all the capacity swings.
        
         | vishnukool wrote:
         | Thanks for the feedback, yeah at this stage we really just
         | wanted to find out if we can build something useful for the
         | community. Agreed on the pricing suggestion.
         | 
         | Also, interesting point about inference. I'm not sure though
         | how common it is for companies to need GPUs for inference.
         | Because if you can have a CPU based inference model, which I
         | thought was most common, it's probably not a big usecase?
        
           | boulos wrote:
           | I've got some comment somewhere on HN that says exactly that
           | "try CPU inference first, it's pretty good".
           | 
           | The need to reach for a T4 comes when someone is doing a big
           | model on images or video _and_ wants sub-second response
           | time. (Think some of the stuff on Snapchat, etc.)
        
       | twistedpair wrote:
       | I'm not sure what this provides over what GCP already offers. 4
       | years ago I switched my co's ML training to use GKE (Google
       | Kubernetes Engine) using a cluster of preemptible nodes.
       | 
       | All you need to do is schedule your jobs (just call the
       | Kubernetes API and schedule your container to run). In the rare
       | case the node gets preempted, your job will be restarted and
       | restored by Kubernetes. Let your node pool scale to near zero
       | when not in use and get billed by the _second_ of compute used.
        
         | [deleted]
        
         | boulos wrote:
         | Reminder: this is Show HN, so please try to be constructive.
         | 
         | However, there's at least a couple of things that matter here
         | that aren't covered by "just use a preemptible node pool":
         | 
         | * SpotML configures checkpoints (yes this is easy, but next
         | point)
         | 
         | * SpotML sends those checkpoints to a persistent volume (by
         | default in GKE, you would not use a cluster-wide persistent
         | volume claim, and instead only have a local ephemeral one,
         | losing your checkpoint)
         | 
         | * SpotML seems to have logic around "retry on preemptible, and
         | then switch to on-demand if needed" (you could do this on GKE
         | by just having two pools, but it won't be as "directed")
        
           | sitkack wrote:
           | Looks like SpotML is a fork of https://github.com/nimbo-
           | sh/nimbo and https://spotty.cloud/
           | 
           | This is a hustle to gauge interest (and collect emails) in a
           | service that is a clone of nimbo.
        
       | eyesasc wrote:
       | This looks promising! Will definitely try it out for my next
       | passion project.
        
       | sknzl wrote:
       | How does this work? From the documentation it's clear that AWS
       | credentials are required. Which permissions are necessary? This
       | leads me to the assumption that the SpotML cli creates with boto3
       | the necessary resource (EBS Volume, spot instances, S3 bucket) on
       | my AWS account. If this is the case, how does billing work if
       | this is "just" a cli?
        
         | vishnukool wrote:
         | That's right it will need aws credentials with access to create
         | EBS volumes, S3 Bucket, spawn instances, etc. In addition to
         | the "cli" a cloud service constantly monitors the progress of
         | the Jobs(by registering the pid when launching it), and the
         | instance states. So the billing will be based on the hours of
         | training run and $ saved.
        
       | site-packages1 wrote:
       | This is really cool!
       | 
       | I used to have these same pains. My trick with spot instances has
       | been to set my maximum price to the price of a regular instance
       | of that class or higher and to sync weights to S3 in the
       | background on every save. The former is a parameter when starting
       | the instance in the console or terraform, the latter is basically
       | a do...while loop. I've noticed that often one gets booted from a
       | spot instance causing interruption because the market price
       | increases a few cents above the "70% savings price." Increasing
       | the maximum to on demand is basically free money because you
       | don't get booted often, and your max price is the regular on
       | demand price.
       | 
       | This has seemed to mitigate about all of the spot downsides (like
       | interruptions or losing data) because you don't easily get kicked
       | out of the spot unless there's a run on machines and the prices
       | rarely fluctuate that much (at least for the higher end p3
       | instances). This has seemed to prevent data loss and protected
       | the downside risk by setting the instance to a knowable max
       | price. There are instances where the spot price goes higher than
       | the on demand price, so you still get booted once in a while, but
       | it's very infrequent as you can imagine most people don't want to
       | spend more for spot than on demand.
       | 
       | Anecdotally I still average out to getting the vast majority of
       | the spot savings with this method with very few interruptions.
       | Looking at the SpotML it seems to be a lot of tooling that
       | achieves these same goals (assuming one would be interrupted when
       | a spot dies and moves to full-freight on-demand with SpotML),
       | which makes SpotML's solution feel very over-engineered to me
       | given that the majority of what SpotML provides can be had with a
       | simple maximum cost parameter change when spinning up a spot
       | instance.
       | 
       | I would be very interested in using anything that doesn't have
       | great overhead and saves money. Our bill seems "big" to me (but I
       | realize it may be small to many others), so even these small
       | savings add up. Would you compare the potential benefits of
       | SpotML to the method I described above?
        
         | vishnukool wrote:
         | Thanks for the feedback. The biggest upside is, if you have a
         | long running training(say hours, days) and the spot training is
         | interrupted. You probably don't want to manually monitor to
         | check for the next available spot instance and kick start the
         | training. SpotML takes care of that part. Also optionally, you
         | can configure it to resume with an onDemand instance on
         | interruption until the next available spot instance. In essence
         | we try to do it make i) creating buckets/EBS ii) code to save
         | to S3 in loop iii) Monitor for interruptions and resume from
         | checkpoint parts easy.
        
         | a-dub wrote:
         | i always envisioned that if i ever did this, i would set the
         | spot price to the equivalent of infinity/max and then monitor
         | it myself, terminating myself if i determined that continuing
         | to run at the spot price was more expensive.
         | 
         | why? sure, you don't want to pay more than the on demand price,
         | but iirc spot prices often spike very momentarily. so the
         | question becomes whether the sum over spike time cost at the
         | higher spot price exceeds the time to boot/shutdown/migration
         | overhead at the on demand price.
         | 
         | but i've never actually tried it so...
        
           | vishnukool wrote:
           | This is interesting, from what I've read AWS no longer
           | recommends setting spot instance prices so that they manage
           | it themselves. I wonder, if there's an actual advantage of
           | avoiding spot interruptions by setting the spot price even
           | higher than onDemand.
        
             | a-dub wrote:
             | yeah true. it's an interesting question that involves not
             | only startup/shutdown of instance/os overheads (i suppose
             | they've probably thought about this) but also overhead for
             | checkpoint/restart of your application.
             | 
             | i also suppose this way of thinking about it comes from
             | thinking around how to minimize cost from a purely
             | mathematical standpoint. when you think about how and why
             | the spot market is operated, and what those short term
             | spikes may actually be, it may run counter to the intended
             | purpose. (cheap capacity that they may recall at any time,
             | because capacity is actually fixed)
             | 
             | funny how market based approaches can gamify things
             | sufficiently that sometimes they obscure the underlying
             | intention or purpose of having a market in the first place.
        
         | electroly wrote:
         | FWIW your "trick" has been the default for years. Unless you
         | specify otherwise, the default max price is the on-demand
         | price. They changed this back in 2017. Maximum price isn't
         | really used by anyone any more.
         | https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pri...
         | 
         | We spend about $10k/month on spot instances and I don't specify
         | any max price. The way to avoid terminations is just to make
         | sure you spread your workload over a large number of instance
         | types and availability zones.
        
       ___________________________________________________________________
       (page generated 2021-10-03 23:00 UTC)