hngopher.com

       [HN Gopher] Launch HN: Outerport (YC S24) - Instant hot-swapping...
       ___________________________________________________________________
        
       Launch HN: Outerport (YC S24) - Instant hot-swapping for AI model
       weights
        
       Hi HN! We're Towaki and Allen, and we're building Outerport
       (https://outerport.com), a distribution network for AI model
       weights that enables 'hot-swapping' of AI models to save on GPU
       costs.  'Hot-swapping' lets you serve different models on the same
       GPU machine with only ~2 second swap times (~150x faster than
       baseline). You can see this in action through a live demo where you
       try the same prompts on different open source large language models
       at https://hotswap.outerport.com and see the docs here
       https://docs.outerport.com.  Running AI models on the cloud is
       expensive. Outerport came from our own experience working on AI
       services ourselves and struggling with the cost.  Cloud GPUs are
       charged by the amount of time used. A long start-up time (from
       loading models into GPU memory) means that to serve requests
       quickly, we need to acquire extra GPUs with models pre-loaded for
       spare capacity (i.e. 'overprovision'). The time spent on loading
       models also adds to the cost. Both lead to inefficient use of
       expensive hardware.  The long start-up times are caused by how
       massive modern AI models are, particularly large language models.
       These models are often several gigabytes to terabytes in size.
       Their sizes continue to grow as models evolve, exacerbating the
       issue.  GPU capacity also needs to adapt dynamically according to
       demand, further complicating the issue. Starting up a new machine
       with another GPU is time consuming, and sending a large model there
       is also time consuming.  Traditional container-based solutions and
       orchestration systems (like Docker, Kubernetes) are not optimized
       for these large, storage-intensive AI models, as they are designed
       for smaller, more numerous containerized applications (which are
       usually 50MB to 1GB in size). There needs to be a solution that is
       designed specifically for model weights (floating point arrays)
       running on GPUs, to take advantage of things like layer sharing,
       caching and compression.  We made Outerport, a specialized system
       to manage and deploy AI models, as a solution to these problems and
       to help save GPU costs.  Outerport is a caching system for model
       weights, allowing read-only models to be cached in pinned RAM for
       fast loading into GPU. Outerport is also hierarchical, maintaining
       a cache across S3 to local SSD to RAM to GPU memory, optimizing for
       reduced data transfer costs and load balancing.  Within Outerport,
       models are managed by a dedicated daemon process which handles
       transfer to GPU, loading models from registry, and orchestrates the
       'hot swapping' of multiple models on one machine.  'Hot-swapping'
       lets you provision a single GPU machine to be 'multi-tenant', such
       that multiple services with different models can run on the same
       machine. For example, this can facilitate A/B testing of two
       different models or having a text generation & image generation
       endpoint on the same machine.  We have been busy running
       simulations to determine the cost reductions we can get from
       leveraging this multi-model service scheme instead of multiple
       single-model services. Our initial simulation results show that we
       can achieve a 40% reduction in GPU running time costs. This
       improvement can be attributed to the multi-model service's ability
       to smoothen out peaks of traffic, enabling more effective
       horizontal scaling. Overall, less time is wasted on acquiring
       additional machines and model loading, significantly saving costs.
       Our hypothesis is that the cost savings are substantial enough to
       make a viable business while still saving customers significant
       amounts of money.  We think that there are lots of exciting
       directions to take from here--from more sophisticated compression
       algorithms to providing a central platform for model management and
       governance. Towaki worked on ML systems and model compression at
       NVIDIA, and Allen used to do research in operations research, which
       is also why we're so excited about this problem as something that
       combines both.  We're super excited to share Outerport with you
       all. We're also intending to release as much as possible of this in
       an open core model when we're ready. We would love to know what you
       think--and experiences you have in working on this, related
       problems, or any other ideas you might have on this problem!
        
       Author : tovacinni
       Score  : 53 points
       Date   : 2024-08-21 16:55 UTC (6 hours ago)
        
       | dbmikus wrote:
       | This is very cool! Most of the work I've seen on reducing
       | inference costs has been via things like LoRAX that lets multiple
       | fine-tunes share the same underlying base model.
       | 
       | Do you imagine Outerport being a better fit for OSS model hosts
       | like Replicate, Anyscale, etc. or for companies that are trying
       | to host multiple models themselves?
       | 
       | Your use case mentioned speaks more to the latter, but it seems
       | like the value at scale is with model hosting as a service
       | companies.
        
         | tovacinni wrote:
         | Thanks!
         | 
         | I think both are fits- we've gotten interest from both types of
         | companies and our first customer is a "OSS model host".
         | 
         | Our 40% savings result is also specifically for the 5 model
         | services case, so there could be non-trivial cost reduction
         | even with a reasonably small number of models.
        
           | samstave wrote:
           | Could you craft a model-weight as a preamble to a prompt? So
           | you can submit prompts through a layer which will pre-warm
           | the model weights for you based on the prompt - Taking the
           | output into some next step in your workflow, apply a new
           | weight preamble depending on what the next phase is?
           | 
           | Like, for a particular portion of the workflow - assume some
           | crawler of weird Insurance Claims data of scale - and you
           | want particular weights for the aspects of certain logic that
           | youre running to search for fraud.
        
             | tovacinni wrote:
             | That's a super neat idea- we should in fact be able to use
             | this same system to support the orchestration of a 'system
             | prompt caching' sort of thing (across deployments). I'll
             | put this on my 'things to hack on' list :)
        
       | harrisonjackson wrote:
       | > Outerport is a caching system for model weights, allowing read-
       | only models to be cached in pinned RAM for fast loading into GPU.
       | Outerport is also hierarchical, maintaining a cache across S3 to
       | local SSD to RAM to GPU memory, optimizing for reduced data
       | transfer costs and load balancing.
       | 
       | This is really cool. Are the costs to run this mainly storage or
       | how much compute is actually tied up in it?
       | 
       | The time/cost to download models on a gpu cloud instance really
       | add up when you are paying per second.
        
         | tovacinni wrote:
         | Thanks! If you mean the costs for users of Outerport, it'll be
         | a subscription model for our hosted registry (with a limit on
         | storage / S3 egress) and a license model for self-hosting the
         | registry. So mainly storage, since the idea is to also minimize
         | egress costs which are associated with the compute tied up in
         | it!
        
       | bravura wrote:
       | Do all variations of the model need to have the same
       | architecture?
       | 
       | Or can they be different types of models with different number of
       | layers, etc?
        
         | tovacinni wrote:
         | Variants do not have to be the same architecture- the demo
         | (https://hotswap.outerport.com/) runs on a couple of different
         | open source architectures.
         | 
         | That being said, there is some smart caching / hashing on
         | layers such that if you do have models that are similar (i.e. a
         | fine-tuned model where only some layers are fine-tuned), it'll
         | minimize storage and transfer by reusing those weights.
        
       ___________________________________________________________________
       (page generated 2024-08-21 23:00 UTC)