[HN Gopher] Show HN: SpotML - Managed ML Training on Cheap AWS/G...
___________________________________________________________________
Show HN: SpotML - Managed ML Training on Cheap AWS/GCP Spot
Instances
Author : vishnukool
Score : 101 points
Date : 2021-10-03 15:57 UTC (7 hours ago)
(HTM) web link (spotml.io)
(TXT) w3m dump (spotml.io)
| Gatesyp wrote:
| Read through the homepage, but not entirely sure --
|
| Why not just train on Spot Instances with a retry implemented?
|
| I see that SpotML has a configurable fall back to On-Demand
| instances, and perhaps their value prop is that it saves the
| state of your run up to the interruption + resumes it on the On-
| Demand instance, but why not just set a retry on the Spot
| Instance if its interrupted?
|
| I'm failing to see what is different about SpotML vs Metaflow's
| @retry decorator and using AWS Batch:
| https://docs.metaflow.org/metaflow/failures#retrying-tasks-w...
|
| If you're in the comment still, Vishnu, would love to hear your
| thoughts
| vishnukool wrote:
| Interesting, thanks, we weren't aware of Metaflow.
|
| I've read through the docs, the one difference that comes to my
| mind is the automatic fallback to on-Demand and resume back to
| spot when available. I can't readily see a way to do this yet
| in Metaflow, but it's possible I've missed something.
| mjaques wrote:
| I'd like to point out that this seems extremely similar to Nimbo
| (https://nimbo.sh), to the extent that even some of the terminal
| messages are exactly the same, and even parts of the docs are
| copy pasted. E.g:
|
| Nimbo docs: "In order to run this job on Nimbo, all you need is
| one tiny config file and a Conda environment file (to set the
| remote environment), and Nimbo does the following for you:"
|
| SpotML docs: "In order to run this job on SpotML, all you need is
| one tiny config file and a Docker file (to set the remote
| environment), and SpotML does the following for you:"
|
| Make of that what you will :).
| vishnukool wrote:
| Yes we liked the elegance of both the tool and the docs. So
| it's very much inspired from it. I must also give credit to
| another great tool https://spotty.cloud/ from which this
| project was adopted.
| ShamelessC wrote:
| Optics on direct copying without attribution aren't great for
| trust in open source software. Count me out and thanks for
| pointing me to the place where people are _actually_ working
| in the open/cooperatively.
| nerdponx wrote:
| Docs are also still copyrighted works. Copying them
| verbatim might be violating the rights of the original
| authors.
| vishnukool wrote:
| Thanks for pointing it out, We realize our mistake here.
| We also should've done proper attribution. Will be
| correcting this.
| avatar042 wrote:
| Seems like Nimbo (https://nimbo.sh) has a Business Source
| License (https://github.com/nimbo-
| sh/nimbo/blob/master/LICENSE), so you might want to check
| with them regarding licensing terms for a startup that is
| using their code and/or docs in "production"?
|
| Otherwise, this idea is interesting and probably
| generalizable to other applications. Maybe it's not
| crystal clear to me, but what are the advantages of your
| service over existing solutions such as Nimbo and Spotty?
| FWIW it might be worthwhile adding this to your website.
|
| Good luck!
| vishnukool wrote:
| Thanks, Makes sense. It doesn't use any "code" from
| nimbo. The documentation and the design simplicity of the
| tool were the things that was appealing to us and
| adopted. The project itself was forked from spotty which
| has an MIT license.
|
| The biggest advantage which was missing in the Open
| source options was monitoring on the training job and
| auto recovery from spot interruptions which spotML does.
| [deleted]
| oakfr wrote:
| Does it apply to jobs running on multiple instances, e.g using
| dask?
| vishnukool wrote:
| Good question, I've heard other engineers ask for this. The MVP
| version doesn't yet handle multiple instances, but it will be
| on our roadmap.
| boulos wrote:
| When you do get there, consider the Bulk Insert API on GCP
| and "Spot Fleets" on AWS (both make it easier for the
| provider to satisfy the entire request in one go).
| vishnukool wrote:
| Hmm.. makes sense, yes. thanks!
| tyingq wrote:
| Looks very useful. One suggestion...the before/after swipe image
| is great for showing how it works, but not why you would use it.
| Might be helpful to overlay the ascending cost line, which would
| be steep in the "before", but gradual in the "after".
| vishnukool wrote:
| Thanks for the suggestion, yes makes sense.
| JZL003 wrote:
| I'm interested in how they 'hibernate'/save the state of the
| instances within the shutdown time limit. I was also looking into
| this for myself, there are ways of using docker to save the in-
| process memory a-la hibernate, which would work well with this.
| But, especially for GCP where you only get ~60 seconds between
| the shutdown signal and hard stop, I was worried that it wouldn't
| save it fast enough. I often work on pretty high-ram instances
| and thought even saving from ram to disk would take too long for
| 150-300GB ram uses.
|
| I hadn't heard of nimbo, maybe I can read how they're doing it
| since it's open sources. Does anyone have any idea how they're
| saving state so fast (NVME SSD disk?)
| vishnukool wrote:
| It uses a mounted EBS(Elastic Block Store) so all the
| checkpoints, data etc. is already in the persistence storage.
| This is simply be re-attached to the next spot/onDemand
| instance after interruption.
| JZL003 wrote:
| Cool yeah that makes sense, makes total sense for ML where
| you just need to run over epochs, less clear for other
| workloads.
|
| After looking around I thinking more about CRIU/docker
| suspend. The google stars aligned and I found this
| https://github.com/checkpoint-restore/criu-image-streamer + h
| ttps://linuxplumbersconf.org/event/7/contributions/641/atta..
| . which actually seems perfect. I wonder how fast it is
|
| Edit: Also no GPU support AFAIK but
| https://github.com/twosigma/fastfreeze looks really nice,
| turnkey. I wonder if I write to a fast persistent disk if I
| can get higher maximum ram than over the NW
|
| (or, hacking on a checkpoint idea, have a daemon periodically
| 'checkpoint' other programs so even if it's too slow over 60
| seconds, revert to the last checkpoint. Even an rsync like
| application where only send the changes)
| JZL003 wrote:
| Oh I didn't see much in Nimbo with a quick glance but reading
| more closely > We immediately resume training after
| interruptions, using the last model checkpoint via persistent
| EBS volume.
|
| Makes sense, just save checkpoints to disk. What I'm doing is
| more CPU bound and not straight ML so less easily check-
| pointed, sadly. Cool though, it's worth jumping through hoops
| for 70% reduction
| dexter89_kp3 wrote:
| This looks awesome!
|
| Could you give an approximate cost of fine-tuning a model like
| Bert or even GAN given this system?
|
| Just want to get a sense of cost of using the system.
| vishnukool wrote:
| Sure, in our own startup, we used to spend roughly $1000 for
| training a StyleGAN model for a class and then additional
| latent space manipulation models. In the early days we
| recklessly wasted a lot of our AWS credits during
| experimentation. But later on with spot instances were able to
| bring it down to $250 to $300 per category class training which
| was more bearable as cost.
| vishnukool wrote:
| We built a tool called spotML to make training on AWS/GCP
| cheaper.
|
| Spot Instances are 70% cheaper than On-Demand instances but are
| prone to interruptions. We mitigate the downside of these
| interruptions through the use of persistence features, including
| optional fallback to On-Demand instances. So you can optimize
| workflows according to your budget and time constraints.
|
| History: We were working on a neural rendering startup that
| needed a lot of GAN training which was getting very expensive. We
| were blowing roughly $1000, to train a single category class.
| Training on Spot instances was cheaper, but still a mess. It
| needed lot of hand holding/devops stuff to make it usable. So we
| built SpotML to automate a lot of things.
|
| Posting it here to see if the community finds this helpful, so
| that we can open it up to the larger community.
| oakfr wrote:
| This looks like an obvious approach (in the hindsight of
| course) to a general problem. I love your pivot.
|
| Congrats on the idea and godspeed. You'll probably have a lot
| of interest if you execute well.
| vishnukool wrote:
| Thank you!
| zachthewf wrote:
| Really excited to try it out. I've had a heck of a time setting
| up spot instance training on Sagemaker in the past so
| simplification efforts are much appreciated.
| vishnukool wrote:
| Curious, did you eventually start using Sagemaker with spot
| instances or did you give up on it ? Also what would you say
| were the biggest pain points with Sagemaker?
| talolard wrote:
| This looks useful, I like the pricing of free/$9.99 a month.
|
| No one asked, but it's HN so I'll say it anyway. I think it's a
| questionable "vc business" but a great business for 1-2 people.
| The road from this to an enterprise sales motion, or even a
| 10K/year contract is hard for me to imagine. At some point, it
| becomes cost effective for my org to build this functionality
| in house.
|
| However, as a hobbyist /single dev / small team, $120/Year is a
| no brainer after the first 2-3K I spend on GPU by mistake. As
| you know, setting up spots when I just want to get shit done is
| a pain and I'll gladly pay you (a little) to make that go away.
|
| Still no one asked but... One thing that plays to your
| advantage is that the current price point is something anyone
| in your user group can buy on their own, and there are a lot of
| us / enough to make a nice business out of.
|
| Good luck!
| vishnukool wrote:
| Thanks, yeah. At this stage we really just want to validate
| if this was a real problem in the ML community. Down the line
| I suppose as we scale to handle multi instance training and
| other use cases, we could probably charge more, say a % of
| the cost savings in training.
| ignoramous wrote:
| Neat. Congratulations on the launch, Vishnu!
|
| Apart from the fact that it could deploy to both GCP and AWS,
| what does it do differently than AWS Batch [0]?
|
| When we had a similar problem, we ran jobs on spots with AWS
| Batch and it worked nicely enough.
|
| Some suggestions (for a later date):
|
| 1a. Add built-in support for Ray [1] (you'd essentially be then
| competing with Anyscale, which _is_ a VC funded startup, just
| to contrast it with another comment on this thread) and dbt
| [2].
|
| 1b. Or: Support deploying coin miners (might help widen the
| product's reach; and stand it up against the likes of
| consensys).
|
| 3. Get in front of the very many cost optimisation consultants
| out there, like the Duckbill Group.
|
| If I may, where are you building this product from? And how
| many are on the team? Thanks.
|
| [0] https://aws.amazon.com/batch/use-cases/
|
| [1] https://ray.io/
|
| [2] https://getdbt.com/
| vishnukool wrote:
| Thank you! Interesting, we actually tried aws batch
| ourselves. 1) How were you able to handle spot interruptions
| and resuming from the latest checkpoint ? 2) Not to mention
| fallback to OnDemand on spot interruptions 3) then switching
| back to spot from onDemand would also need additional process
| to be setup.
|
| Also i'm not sure how straightforward it is to detach/attach
| persistent volumes to retain data across different spot
| interruptions ? The latter can be done but it's just the same
| rote each time you wanna train something new.
|
| Also thanks for the suggestions ! We're a team of 2 right
| now, I used to be in the bay area but in Mexico temporarily.
| ignoramous wrote:
| 1. Spot interruptions didn't matter much as AWS Batch looks
| for spots with low interruption probability. Auto retries
| kicked-in whenever those did get interrupted.
|
| 2. Checkpointing was a pain (we relied mostly on AWS
| Batch's JobState and S3, not ideal), but the current
| capability to mount EFS (Elastic Filesystem) looks like it
| would solve this?
|
| 3. No hot swapping on-demand with spot and vice versa.
| Interestingly, ALB (Application Load Balancer) supports
| such mixed EC2 configurations (AWS Batch doesn't).
| cpalumbo wrote:
| Ha! That's interesting. We use AWS instances for model
| development and costs are definitely an issue. Sending this to
| my team. Good luck!
| barefeg wrote:
| How does it compare with grid.ai?
| vishnukool wrote:
| Thanks for this, I wasn't aware of it. From reading the docs, I
| don't see anywhere if Grid automatically handles spot
| interruptions to resume from last checkpoint which was the main
| focus of our internal tool.
|
| Have you used it btw ? and what has your experience been with
| Grid ?
| hallqv wrote:
| Cool!
| [deleted]
| ShamelessC wrote:
| This looks cool. How much SpotML-specific code is needed to
| target SpotML? I'm assuming it doesn't just magically detect the
| training loop in an existing pytorch codebase and needs hooks
| implemented for resume, inference, error handling, etc.
| vishnukool wrote:
| We tried to keep it minimal. All you need to do is specify the
| format to resume last checkpoint in the spotml.yml file. So
| let's say your checkpoint files are saved as ckpt00.pt,
| ckpt01.pt, ckpt03.pt and so on. You can configure checkpoint
| regex file format ^ckpt[0-9]{2}$ and spotml resumes by picking
| the latest of it.
|
| For detecting if the training process is still running or
| errored out it registers the training command pid when
| launching the task and then monitors for the Pid for
| completion. It also registers and monitors the instance state
| itself to check for interruptions and resuming.
| ShamelessC wrote:
| Oh wow that's rather simple and works with more
| configurations than I expected. Thanks.
| boulos wrote:
| Disclosure: I used to work for GCP and launched Preemptible VMs.
|
| Congrats! Can I suggest charging more?
|
| IIUC, your business plan is $10/month for the _company_ ,
| regardless of number of users?
|
| You probably save a company that $10 in a day or less for one
| GPU: One A100 is ~$3/hr on demand and ~$.90/hr as Preemptible,
| saving over $2/hr.
|
| Said another way, your pitch is to recover a lot of the 70%
| discount that they aren't going to do themselves. If you were a
| managed training service, you could pitch yourself as "half the
| price of AWS or GCP" and keep the 20%+ margin with both parties
| being happy. (The problem is that pass through billing makes that
| obvious, you need to support lots of bucket security and IAM
| controls, etc.).
|
| Fwiw, I would also branch out into inference! Preemptible and
| Spot T4s are commonly used for heavy image models, but many
| people pay full price. Inference that takes X ms can easily be
| handled "without errors" in the shutdown time. The risk is
| handling all the capacity swings.
| vishnukool wrote:
| Thanks for the feedback, yeah at this stage we really just
| wanted to find out if we can build something useful for the
| community. Agreed on the pricing suggestion.
|
| Also, interesting point about inference. I'm not sure though
| how common it is for companies to need GPUs for inference.
| Because if you can have a CPU based inference model, which I
| thought was most common, it's probably not a big usecase?
| boulos wrote:
| I've got some comment somewhere on HN that says exactly that
| "try CPU inference first, it's pretty good".
|
| The need to reach for a T4 comes when someone is doing a big
| model on images or video _and_ wants sub-second response
| time. (Think some of the stuff on Snapchat, etc.)
| twistedpair wrote:
| I'm not sure what this provides over what GCP already offers. 4
| years ago I switched my co's ML training to use GKE (Google
| Kubernetes Engine) using a cluster of preemptible nodes.
|
| All you need to do is schedule your jobs (just call the
| Kubernetes API and schedule your container to run). In the rare
| case the node gets preempted, your job will be restarted and
| restored by Kubernetes. Let your node pool scale to near zero
| when not in use and get billed by the _second_ of compute used.
| [deleted]
| boulos wrote:
| Reminder: this is Show HN, so please try to be constructive.
|
| However, there's at least a couple of things that matter here
| that aren't covered by "just use a preemptible node pool":
|
| * SpotML configures checkpoints (yes this is easy, but next
| point)
|
| * SpotML sends those checkpoints to a persistent volume (by
| default in GKE, you would not use a cluster-wide persistent
| volume claim, and instead only have a local ephemeral one,
| losing your checkpoint)
|
| * SpotML seems to have logic around "retry on preemptible, and
| then switch to on-demand if needed" (you could do this on GKE
| by just having two pools, but it won't be as "directed")
| sitkack wrote:
| Looks like SpotML is a fork of https://github.com/nimbo-
| sh/nimbo and https://spotty.cloud/
|
| This is a hustle to gauge interest (and collect emails) in a
| service that is a clone of nimbo.
| eyesasc wrote:
| This looks promising! Will definitely try it out for my next
| passion project.
| sknzl wrote:
| How does this work? From the documentation it's clear that AWS
| credentials are required. Which permissions are necessary? This
| leads me to the assumption that the SpotML cli creates with boto3
| the necessary resource (EBS Volume, spot instances, S3 bucket) on
| my AWS account. If this is the case, how does billing work if
| this is "just" a cli?
| vishnukool wrote:
| That's right it will need aws credentials with access to create
| EBS volumes, S3 Bucket, spawn instances, etc. In addition to
| the "cli" a cloud service constantly monitors the progress of
| the Jobs(by registering the pid when launching it), and the
| instance states. So the billing will be based on the hours of
| training run and $ saved.
| site-packages1 wrote:
| This is really cool!
|
| I used to have these same pains. My trick with spot instances has
| been to set my maximum price to the price of a regular instance
| of that class or higher and to sync weights to S3 in the
| background on every save. The former is a parameter when starting
| the instance in the console or terraform, the latter is basically
| a do...while loop. I've noticed that often one gets booted from a
| spot instance causing interruption because the market price
| increases a few cents above the "70% savings price." Increasing
| the maximum to on demand is basically free money because you
| don't get booted often, and your max price is the regular on
| demand price.
|
| This has seemed to mitigate about all of the spot downsides (like
| interruptions or losing data) because you don't easily get kicked
| out of the spot unless there's a run on machines and the prices
| rarely fluctuate that much (at least for the higher end p3
| instances). This has seemed to prevent data loss and protected
| the downside risk by setting the instance to a knowable max
| price. There are instances where the spot price goes higher than
| the on demand price, so you still get booted once in a while, but
| it's very infrequent as you can imagine most people don't want to
| spend more for spot than on demand.
|
| Anecdotally I still average out to getting the vast majority of
| the spot savings with this method with very few interruptions.
| Looking at the SpotML it seems to be a lot of tooling that
| achieves these same goals (assuming one would be interrupted when
| a spot dies and moves to full-freight on-demand with SpotML),
| which makes SpotML's solution feel very over-engineered to me
| given that the majority of what SpotML provides can be had with a
| simple maximum cost parameter change when spinning up a spot
| instance.
|
| I would be very interested in using anything that doesn't have
| great overhead and saves money. Our bill seems "big" to me (but I
| realize it may be small to many others), so even these small
| savings add up. Would you compare the potential benefits of
| SpotML to the method I described above?
| vishnukool wrote:
| Thanks for the feedback. The biggest upside is, if you have a
| long running training(say hours, days) and the spot training is
| interrupted. You probably don't want to manually monitor to
| check for the next available spot instance and kick start the
| training. SpotML takes care of that part. Also optionally, you
| can configure it to resume with an onDemand instance on
| interruption until the next available spot instance. In essence
| we try to do it make i) creating buckets/EBS ii) code to save
| to S3 in loop iii) Monitor for interruptions and resume from
| checkpoint parts easy.
| a-dub wrote:
| i always envisioned that if i ever did this, i would set the
| spot price to the equivalent of infinity/max and then monitor
| it myself, terminating myself if i determined that continuing
| to run at the spot price was more expensive.
|
| why? sure, you don't want to pay more than the on demand price,
| but iirc spot prices often spike very momentarily. so the
| question becomes whether the sum over spike time cost at the
| higher spot price exceeds the time to boot/shutdown/migration
| overhead at the on demand price.
|
| but i've never actually tried it so...
| vishnukool wrote:
| This is interesting, from what I've read AWS no longer
| recommends setting spot instance prices so that they manage
| it themselves. I wonder, if there's an actual advantage of
| avoiding spot interruptions by setting the spot price even
| higher than onDemand.
| a-dub wrote:
| yeah true. it's an interesting question that involves not
| only startup/shutdown of instance/os overheads (i suppose
| they've probably thought about this) but also overhead for
| checkpoint/restart of your application.
|
| i also suppose this way of thinking about it comes from
| thinking around how to minimize cost from a purely
| mathematical standpoint. when you think about how and why
| the spot market is operated, and what those short term
| spikes may actually be, it may run counter to the intended
| purpose. (cheap capacity that they may recall at any time,
| because capacity is actually fixed)
|
| funny how market based approaches can gamify things
| sufficiently that sometimes they obscure the underlying
| intention or purpose of having a market in the first place.
| electroly wrote:
| FWIW your "trick" has been the default for years. Unless you
| specify otherwise, the default max price is the on-demand
| price. They changed this back in 2017. Maximum price isn't
| really used by anyone any more.
| https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pri...
|
| We spend about $10k/month on spot instances and I don't specify
| any max price. The way to avoid terminations is just to make
| sure you spread your workload over a large number of instance
| types and availability zones.
___________________________________________________________________
(page generated 2021-10-03 23:00 UTC)