[HN Gopher] Replicate vs. Fly GPU cold-start latency
___________________________________________________________________
Replicate vs. Fly GPU cold-start latency
Author : venkii
Score : 44 points
Date : 2024-02-17 18:03 UTC (4 hours ago)
(HTM) web link (venki.dev)
(TXT) w3m dump (venki.dev)
| timenova wrote:
| Is the 100 MB model being downloaded from HuggingFace on Fly too?
|
| I ask this because Fly has immutable Docker containers which
| wouldn't store any data unless you use Fly Volumes. So it could
| be that Fly is downloading the 100MB model each time it cold-
| boots.
|
| If that's the case, a multi-stage Dockerfile could help in
| bundling the model in, and perhaps reducing cold-boot time even
| further.
| Szpadel wrote:
| why multi-stage would make any difference? isn't multi stage
| just implementation detail for resulting image? AFAIK only
| build caching could benefit, but pulling image should be
| unaltered with or without using multi-stage build
| timenova wrote:
| I am guessing that it would be possible to download the model
| from HuggingFace in the build step. I'm not sure though. Plus
| the Cog image is 14GB in size (mentioned in the article), I
| hope there's a way to reduce that.
| Szpadel wrote:
| Sure, but I'm still do not understand why multi-stage
| should make any difference, because at the end it's just
| flattened as layer from copy between stages.
|
| So the problem of size is exactly the same, and RUN with
| curl will have identical size as COPY layer from COPY
| --from=stage
|
| Am I missing something?
|
| I can see only benefit for build cache reuse, so download
| is independed from building code so you won't redownload
| 14G when you change code, is that what you had in mind?
| timenova wrote:
| You're right! Thanks for pointing this out.
|
| I don't know Docker that well. I literally figured it out
| as I went along to deploy on Fly...
| Szpadel wrote:
| No problem.
|
| I didn't wanted to sound snarky, I thought that I wasn't
| aware about some cool docker optimization hack :)
| simonw wrote:
| Fly recommend you use one of their mountable volumes for
| large model files rather than building it into the Docker
| image - you get faster cold starts that way, plus I think
| the images have size limits.
| venkii wrote:
| Yes, correct the 100MB is being downloaded on every boot!
|
| I tested it initially because it's the naivest implementation.
| The right implementation would bundle it in.
|
| But I ended up primarily reporting timings that stop counting
| up as soon as control is handed over to user generated code -
| since that's the number you care about the most.
| timenova wrote:
| That's fair.
|
| Perhaps a good future idea would be to benchmark between
| bundling it in the Docker image, vs. using Fly Volumes as
| Simon suggested in a sibling comment.
| iambateman wrote:
| To make sure I understand...this would provide a private API
| endpoint for a developer to call an LLM model in a serverless
| way?
|
| They could call it and just pay for the time spent, not a
| persistent server.
| dartos wrote:
| Sure, there are platforms that do this (replicate)
|
| The issue is cold start time for custom models. It takes time
| to pull in dependencies, the model, and load the model into
| memory.
|
| It's difficult because someone has to pay for the cold start
| time.
| moscicky wrote:
| Replicate has really long boot times for custom models - 2/3
| minutes if you are lucky and up to 30 minutes if they are having
| problems.
|
| While we loved the dev experience we just couldn't make it work
| with frequently switching models / LORA weights.
|
| We switched to beam (https://www.beam.cloud) and it's so much
| better. Their cold start times are consistently small and they
| provide caching layer for model files i.e volumes which make
| switching between models a breeze.
|
| Beam also has much better pricing policy. For custom models on
| replicate you pay for boot times (which are very long!) so you
| are paying a lot of $ for a single request.
|
| With beam you only pay for inference and idle time.
| bfirsh wrote:
| Founder of Replicate here. Our cold boots do suck (see my other
| comment), but you aren't charged for the boot time on
| Replicate, just the time that your `setup()` function runs.
|
| Incentives are aligned for us to make it better. :)
| treesciencebot wrote:
| Just as a top-level disclaimer, I'm working at one of the
| companies in "this" space (serverless GPU compute) so take
| anything I say with a grain of salt.
|
| This is one of the things we (at https://fal.ai) working very
| hard to solve. Because of ML workloads and their multiple GB
| environments (torch, all those cuda/cudnn libraries, and anything
| else they pull) it is a real challange just to get the container
| to start in a reasonable time frame. We had to write our own
| shared Python virtual environment runtime using SquashFS
| distributed thru a peer-to-peer caching system to bring it down
| sub-second mark.
|
| After the container boots, there is the aspect of storing model
| weights, which IMHO less challenging since it is just big blobs
| of data (compared to Python environments where there are
| thousands of smaller files where each might be sequentially read
| and incur a really major latency penalty). Distributing them once
| we had the system above was super easy since just like squashfs'd
| virtual environments, they are immutable data blobs.
|
| We are also starting to play with GPUDirect on some of our bare
| metal clusters and hopefully planning to expose it to our
| customers, which is especially important if your models is 40GB
| or higher. At that point, you are technically operating at the
| PCIE/SXM speeds which is ~2-3 seconds for a model of that size.
| jonnycoder wrote:
| I wrote a review about Replicate last week and cog I was using,
| insanely-fast-whisper, had boot times exceeding 4 minutes. I wish
| there was more we can observe to find out the cause of the slow
| start up times. I was suspecting it was dependencies.
|
| https://open.substack.com/pub/jonolson/p/replicatecom-review...
| mardifoufs wrote:
| Cold start is super bad on azure machine learning endpoints, at
| least it was when we tried to use it a few months ago. Even
| before it gets to the environment loading step. Seems like even
| these results are better than what we got on AML. So it's
| impressive imo!
| harrisonjackson wrote:
| I spent a couple months hacking on a dreambooth product that let
| users train a model on their own photos and then generate new
| images w/ presets or their own prompts.
|
| The main costs were:
|
| - gpu time for training
|
| - gpu time for inference
|
| - storage costs for the users' models
|
| - egress fees to download model
|
| I ended up using banana.dev and runpod.io for the serverless
| gpus. Both were great, easy to hook into, and highly
| customizable.
|
| I spent a bunch of time trying to optimize download speed, egress
| fees, gpu spot pricing, gpu location, etc.
|
| R2 is cheaper than s3 - free egress! But the download speeds were
| MUCH worse than s3 - enough that it ended up not even being
| competitive.
|
| It was frequently cheaper to use more expensive GPUs w/ better
| location and network speeds. That factored more into the pricing
| than how long the actual inference took on each instance.
|
| Likewise, if your most important metric is time from boot to
| starting inference then network access might be the limiting
| factor.
| hantusk wrote:
| Not affiliated, but happy modal.com user, which has very fast
| cold starts for the few demos i run with them.
| bfirsh wrote:
| Founder of Replicate here. Yeah, our cold boots suck.
|
| Here's what we're doing:
|
| - Fine-tuned models boot fast: https://replicate.com/blog/fine-
| tune-cold-boots
|
| - We've optimized how weights are loaded in GPU memory for some
| of the models we maintain, and we're going to open this up to all
| custom models soon.
|
| - We're going to be distributing images as individual files
| rather than as image layers, which makes it much more efficient.
|
| Although our cold boots do suck, the comparison in this blog post
| is comparing apples to oranges because Fly machines are much
| lower level than Replicate models.
|
| In the blog post, it seems to be using a stopped Fly machine,
| which has already pulled the Docker image onto a node. When it's
| starts, all it's doing is starting the Docker image.
|
| On Replicate, the models auto-scale on a cluster. The model could
| be running anywhere in our cluster so we have to pull the image
| to that node when it starts.
|
| Something funny seems to be going on with the latency too. Our
| round-trip latency is about 200ms for a similar model. Would be
| curious to see the methodology, or maybe something was broken on
| our end.
|
| But we do acknowledge the problem. It's going to get better soon.
___________________________________________________________________
(page generated 2024-02-17 23:00 UTC)