[HN Gopher] Replicate vs. Fly GPU cold-start latency
       ___________________________________________________________________
        
       Replicate vs. Fly GPU cold-start latency
        
       Author : venkii
       Score  : 44 points
       Date   : 2024-02-17 18:03 UTC (4 hours ago)
        
 (HTM) web link (venki.dev)
 (TXT) w3m dump (venki.dev)
        
       | timenova wrote:
       | Is the 100 MB model being downloaded from HuggingFace on Fly too?
       | 
       | I ask this because Fly has immutable Docker containers which
       | wouldn't store any data unless you use Fly Volumes. So it could
       | be that Fly is downloading the 100MB model each time it cold-
       | boots.
       | 
       | If that's the case, a multi-stage Dockerfile could help in
       | bundling the model in, and perhaps reducing cold-boot time even
       | further.
        
         | Szpadel wrote:
         | why multi-stage would make any difference? isn't multi stage
         | just implementation detail for resulting image? AFAIK only
         | build caching could benefit, but pulling image should be
         | unaltered with or without using multi-stage build
        
           | timenova wrote:
           | I am guessing that it would be possible to download the model
           | from HuggingFace in the build step. I'm not sure though. Plus
           | the Cog image is 14GB in size (mentioned in the article), I
           | hope there's a way to reduce that.
        
             | Szpadel wrote:
             | Sure, but I'm still do not understand why multi-stage
             | should make any difference, because at the end it's just
             | flattened as layer from copy between stages.
             | 
             | So the problem of size is exactly the same, and RUN with
             | curl will have identical size as COPY layer from COPY
             | --from=stage
             | 
             | Am I missing something?
             | 
             | I can see only benefit for build cache reuse, so download
             | is independed from building code so you won't redownload
             | 14G when you change code, is that what you had in mind?
        
               | timenova wrote:
               | You're right! Thanks for pointing this out.
               | 
               | I don't know Docker that well. I literally figured it out
               | as I went along to deploy on Fly...
        
               | Szpadel wrote:
               | No problem.
               | 
               | I didn't wanted to sound snarky, I thought that I wasn't
               | aware about some cool docker optimization hack :)
        
             | simonw wrote:
             | Fly recommend you use one of their mountable volumes for
             | large model files rather than building it into the Docker
             | image - you get faster cold starts that way, plus I think
             | the images have size limits.
        
         | venkii wrote:
         | Yes, correct the 100MB is being downloaded on every boot!
         | 
         | I tested it initially because it's the naivest implementation.
         | The right implementation would bundle it in.
         | 
         | But I ended up primarily reporting timings that stop counting
         | up as soon as control is handed over to user generated code -
         | since that's the number you care about the most.
        
           | timenova wrote:
           | That's fair.
           | 
           | Perhaps a good future idea would be to benchmark between
           | bundling it in the Docker image, vs. using Fly Volumes as
           | Simon suggested in a sibling comment.
        
       | iambateman wrote:
       | To make sure I understand...this would provide a private API
       | endpoint for a developer to call an LLM model in a serverless
       | way?
       | 
       | They could call it and just pay for the time spent, not a
       | persistent server.
        
         | dartos wrote:
         | Sure, there are platforms that do this (replicate)
         | 
         | The issue is cold start time for custom models. It takes time
         | to pull in dependencies, the model, and load the model into
         | memory.
         | 
         | It's difficult because someone has to pay for the cold start
         | time.
        
       | moscicky wrote:
       | Replicate has really long boot times for custom models - 2/3
       | minutes if you are lucky and up to 30 minutes if they are having
       | problems.
       | 
       | While we loved the dev experience we just couldn't make it work
       | with frequently switching models / LORA weights.
       | 
       | We switched to beam (https://www.beam.cloud) and it's so much
       | better. Their cold start times are consistently small and they
       | provide caching layer for model files i.e volumes which make
       | switching between models a breeze.
       | 
       | Beam also has much better pricing policy. For custom models on
       | replicate you pay for boot times (which are very long!) so you
       | are paying a lot of $ for a single request.
       | 
       | With beam you only pay for inference and idle time.
        
         | bfirsh wrote:
         | Founder of Replicate here. Our cold boots do suck (see my other
         | comment), but you aren't charged for the boot time on
         | Replicate, just the time that your `setup()` function runs.
         | 
         | Incentives are aligned for us to make it better. :)
        
       | treesciencebot wrote:
       | Just as a top-level disclaimer, I'm working at one of the
       | companies in "this" space (serverless GPU compute) so take
       | anything I say with a grain of salt.
       | 
       | This is one of the things we (at https://fal.ai) working very
       | hard to solve. Because of ML workloads and their multiple GB
       | environments (torch, all those cuda/cudnn libraries, and anything
       | else they pull) it is a real challange just to get the container
       | to start in a reasonable time frame. We had to write our own
       | shared Python virtual environment runtime using SquashFS
       | distributed thru a peer-to-peer caching system to bring it down
       | sub-second mark.
       | 
       | After the container boots, there is the aspect of storing model
       | weights, which IMHO less challenging since it is just big blobs
       | of data (compared to Python environments where there are
       | thousands of smaller files where each might be sequentially read
       | and incur a really major latency penalty). Distributing them once
       | we had the system above was super easy since just like squashfs'd
       | virtual environments, they are immutable data blobs.
       | 
       | We are also starting to play with GPUDirect on some of our bare
       | metal clusters and hopefully planning to expose it to our
       | customers, which is especially important if your models is 40GB
       | or higher. At that point, you are technically operating at the
       | PCIE/SXM speeds which is ~2-3 seconds for a model of that size.
        
       | jonnycoder wrote:
       | I wrote a review about Replicate last week and cog I was using,
       | insanely-fast-whisper, had boot times exceeding 4 minutes. I wish
       | there was more we can observe to find out the cause of the slow
       | start up times. I was suspecting it was dependencies.
       | 
       | https://open.substack.com/pub/jonolson/p/replicatecom-review...
        
       | mardifoufs wrote:
       | Cold start is super bad on azure machine learning endpoints, at
       | least it was when we tried to use it a few months ago. Even
       | before it gets to the environment loading step. Seems like even
       | these results are better than what we got on AML. So it's
       | impressive imo!
        
       | harrisonjackson wrote:
       | I spent a couple months hacking on a dreambooth product that let
       | users train a model on their own photos and then generate new
       | images w/ presets or their own prompts.
       | 
       | The main costs were:
       | 
       | - gpu time for training
       | 
       | - gpu time for inference
       | 
       | - storage costs for the users' models
       | 
       | - egress fees to download model
       | 
       | I ended up using banana.dev and runpod.io for the serverless
       | gpus. Both were great, easy to hook into, and highly
       | customizable.
       | 
       | I spent a bunch of time trying to optimize download speed, egress
       | fees, gpu spot pricing, gpu location, etc.
       | 
       | R2 is cheaper than s3 - free egress! But the download speeds were
       | MUCH worse than s3 - enough that it ended up not even being
       | competitive.
       | 
       | It was frequently cheaper to use more expensive GPUs w/ better
       | location and network speeds. That factored more into the pricing
       | than how long the actual inference took on each instance.
       | 
       | Likewise, if your most important metric is time from boot to
       | starting inference then network access might be the limiting
       | factor.
        
       | hantusk wrote:
       | Not affiliated, but happy modal.com user, which has very fast
       | cold starts for the few demos i run with them.
        
       | bfirsh wrote:
       | Founder of Replicate here. Yeah, our cold boots suck.
       | 
       | Here's what we're doing:
       | 
       | - Fine-tuned models boot fast: https://replicate.com/blog/fine-
       | tune-cold-boots
       | 
       | - We've optimized how weights are loaded in GPU memory for some
       | of the models we maintain, and we're going to open this up to all
       | custom models soon.
       | 
       | - We're going to be distributing images as individual files
       | rather than as image layers, which makes it much more efficient.
       | 
       | Although our cold boots do suck, the comparison in this blog post
       | is comparing apples to oranges because Fly machines are much
       | lower level than Replicate models.
       | 
       | In the blog post, it seems to be using a stopped Fly machine,
       | which has already pulled the Docker image onto a node. When it's
       | starts, all it's doing is starting the Docker image.
       | 
       | On Replicate, the models auto-scale on a cluster. The model could
       | be running anywhere in our cluster so we have to pull the image
       | to that node when it starts.
       | 
       | Something funny seems to be going on with the latency too. Our
       | round-trip latency is about 200ms for a similar model. Would be
       | curious to see the methodology, or maybe something was broken on
       | our end.
       | 
       | But we do acknowledge the problem. It's going to get better soon.
        
       ___________________________________________________________________
       (page generated 2024-02-17 23:00 UTC)