[HN Gopher] Launch HN: Outerport (YC S24) - Instant hot-swapping...
___________________________________________________________________
Launch HN: Outerport (YC S24) - Instant hot-swapping for AI model
weights
Hi HN! We're Towaki and Allen, and we're building Outerport
(https://outerport.com), a distribution network for AI model
weights that enables 'hot-swapping' of AI models to save on GPU
costs. 'Hot-swapping' lets you serve different models on the same
GPU machine with only ~2 second swap times (~150x faster than
baseline). You can see this in action through a live demo where you
try the same prompts on different open source large language models
at https://hotswap.outerport.com and see the docs here
https://docs.outerport.com. Running AI models on the cloud is
expensive. Outerport came from our own experience working on AI
services ourselves and struggling with the cost. Cloud GPUs are
charged by the amount of time used. A long start-up time (from
loading models into GPU memory) means that to serve requests
quickly, we need to acquire extra GPUs with models pre-loaded for
spare capacity (i.e. 'overprovision'). The time spent on loading
models also adds to the cost. Both lead to inefficient use of
expensive hardware. The long start-up times are caused by how
massive modern AI models are, particularly large language models.
These models are often several gigabytes to terabytes in size.
Their sizes continue to grow as models evolve, exacerbating the
issue. GPU capacity also needs to adapt dynamically according to
demand, further complicating the issue. Starting up a new machine
with another GPU is time consuming, and sending a large model there
is also time consuming. Traditional container-based solutions and
orchestration systems (like Docker, Kubernetes) are not optimized
for these large, storage-intensive AI models, as they are designed
for smaller, more numerous containerized applications (which are
usually 50MB to 1GB in size). There needs to be a solution that is
designed specifically for model weights (floating point arrays)
running on GPUs, to take advantage of things like layer sharing,
caching and compression. We made Outerport, a specialized system
to manage and deploy AI models, as a solution to these problems and
to help save GPU costs. Outerport is a caching system for model
weights, allowing read-only models to be cached in pinned RAM for
fast loading into GPU. Outerport is also hierarchical, maintaining
a cache across S3 to local SSD to RAM to GPU memory, optimizing for
reduced data transfer costs and load balancing. Within Outerport,
models are managed by a dedicated daemon process which handles
transfer to GPU, loading models from registry, and orchestrates the
'hot swapping' of multiple models on one machine. 'Hot-swapping'
lets you provision a single GPU machine to be 'multi-tenant', such
that multiple services with different models can run on the same
machine. For example, this can facilitate A/B testing of two
different models or having a text generation & image generation
endpoint on the same machine. We have been busy running
simulations to determine the cost reductions we can get from
leveraging this multi-model service scheme instead of multiple
single-model services. Our initial simulation results show that we
can achieve a 40% reduction in GPU running time costs. This
improvement can be attributed to the multi-model service's ability
to smoothen out peaks of traffic, enabling more effective
horizontal scaling. Overall, less time is wasted on acquiring
additional machines and model loading, significantly saving costs.
Our hypothesis is that the cost savings are substantial enough to
make a viable business while still saving customers significant
amounts of money. We think that there are lots of exciting
directions to take from here--from more sophisticated compression
algorithms to providing a central platform for model management and
governance. Towaki worked on ML systems and model compression at
NVIDIA, and Allen used to do research in operations research, which
is also why we're so excited about this problem as something that
combines both. We're super excited to share Outerport with you
all. We're also intending to release as much as possible of this in
an open core model when we're ready. We would love to know what you
think--and experiences you have in working on this, related
problems, or any other ideas you might have on this problem!
Author : tovacinni
Score : 53 points
Date : 2024-08-21 16:55 UTC (6 hours ago)
| dbmikus wrote:
| This is very cool! Most of the work I've seen on reducing
| inference costs has been via things like LoRAX that lets multiple
| fine-tunes share the same underlying base model.
|
| Do you imagine Outerport being a better fit for OSS model hosts
| like Replicate, Anyscale, etc. or for companies that are trying
| to host multiple models themselves?
|
| Your use case mentioned speaks more to the latter, but it seems
| like the value at scale is with model hosting as a service
| companies.
| tovacinni wrote:
| Thanks!
|
| I think both are fits- we've gotten interest from both types of
| companies and our first customer is a "OSS model host".
|
| Our 40% savings result is also specifically for the 5 model
| services case, so there could be non-trivial cost reduction
| even with a reasonably small number of models.
| samstave wrote:
| Could you craft a model-weight as a preamble to a prompt? So
| you can submit prompts through a layer which will pre-warm
| the model weights for you based on the prompt - Taking the
| output into some next step in your workflow, apply a new
| weight preamble depending on what the next phase is?
|
| Like, for a particular portion of the workflow - assume some
| crawler of weird Insurance Claims data of scale - and you
| want particular weights for the aspects of certain logic that
| youre running to search for fraud.
| tovacinni wrote:
| That's a super neat idea- we should in fact be able to use
| this same system to support the orchestration of a 'system
| prompt caching' sort of thing (across deployments). I'll
| put this on my 'things to hack on' list :)
| harrisonjackson wrote:
| > Outerport is a caching system for model weights, allowing read-
| only models to be cached in pinned RAM for fast loading into GPU.
| Outerport is also hierarchical, maintaining a cache across S3 to
| local SSD to RAM to GPU memory, optimizing for reduced data
| transfer costs and load balancing.
|
| This is really cool. Are the costs to run this mainly storage or
| how much compute is actually tied up in it?
|
| The time/cost to download models on a gpu cloud instance really
| add up when you are paying per second.
| tovacinni wrote:
| Thanks! If you mean the costs for users of Outerport, it'll be
| a subscription model for our hosted registry (with a limit on
| storage / S3 egress) and a license model for self-hosting the
| registry. So mainly storage, since the idea is to also minimize
| egress costs which are associated with the compute tied up in
| it!
| bravura wrote:
| Do all variations of the model need to have the same
| architecture?
|
| Or can they be different types of models with different number of
| layers, etc?
| tovacinni wrote:
| Variants do not have to be the same architecture- the demo
| (https://hotswap.outerport.com/) runs on a couple of different
| open source architectures.
|
| That being said, there is some smart caching / hashing on
| layers such that if you do have models that are similar (i.e. a
| fine-tuned model where only some layers are fine-tuned), it'll
| minimize storage and transfer by reusing those weights.
___________________________________________________________________
(page generated 2024-08-21 23:00 UTC)