[HN Gopher] Launch HN: Cerebrium (YC W22) - Serverless Infrastru...
       ___________________________________________________________________
        
       Launch HN: Cerebrium (YC W22) - Serverless Infrastructure Platform
       for ML/AI
        
       Hi HN,  We are Michael & Jono, and we are building Cerebrium
       (https://www.cerebrium.ai), a serverless infrastructure platform
       for ML/AI applications - we make it easy for engineers to build,
       deploy and scale AI applications.  Initially, we've been hyper-
       focused on the inference side of applications, but we're working on
       expanding our functionality to support training and data processing
       use cases--eventually covering the full AI development lifecycle.
       You can watch a quick loom video of us deploying:
       https://www.loom.com/share/06947794b3bf4bb1bb21c87066dfcc66?...
       How we got here:  Jono and I led the technical team at our previous
       e-commerce startup, which grew rapidly over a few years. As we
       scaled, we were tasked with building out ML applications to make
       the business more efficient. It was tough--every day felt like a
       defeat. We found ourselves stitching together AWS Lambda,
       Sagemaker, and Prefect jobs (this stack alone was enough to make me
       want to give up). By the time we reached production, the costs were
       too high to maintain. Getting these applications live required a
       significant upfront investment of both time and money, making it
       inaccessible for most startups and scale-ups to attempt. We wanted
       to create something that would help us (and others like us)
       implement ML/AI applications easily and cost-effectively.  The
       problem:  There are a ton of challenges to tackle to realize our
       vision, but we've initially focused on a few key ones:  1. GPUs are
       expensive - An A100 is 326 times the cost of a CPU, and companies
       are using LLMs like they're simple APIs. Serverless instances solve
       this to an extent, but minimizing cold starts is difficult.  2.
       Local development - Engineers need local development environments
       to iterate quickly, but production-grade GPUs aren't available on
       consumer hardware. How can we make cloud deployments feel as fast
       as just saving a file locally and retrying?  3. Cost to experiment
       - To run experiments we had to spin up EC2 instances each day,
       recreate our environment and run scripts. It was difficult to
       monitor logs, instance usage metrics as well as run large
       processing jobs or scale endpoints without a significant
       infrastructure investment. Additionally, we often forget to switch
       off instances which cost us money!  Our Approach  We have three
       core areas that we are focused on which we believe are the most
       important for any infrastructure platform:  1. Performance:  We
       have worked hard to get our added network latency <50ms and the
       cold start of our average workloads to 2-4 seconds. Here are a few
       things we did to get our cold starts so low:  - Container Runtime:
       We built our own container runtime that splits container images
       into two parts--metadata and data blobs. Metadata provides the file
       structure, while the actual data blobs are fetched on-demand. This
       allows containers to start before the full image is downloaded. In
       the background, we prefetch the remaining blobs.  - Caching: Once
       an image is on a machine, it's cached for future use. This makes
       subsequent container startups much faster. We also intelligently
       route requests to machines where the image is already cached.  -
       Efficient Inference: We route requests to the optimal machines,
       prioritizing low-latency and high-throughput performance. If no
       containers are immediately available, we efficiently queue the
       requests through our task scheduling system.  - Distributed Storage
       Cache: One of the most resource-intensive parts of AI workloads is
       loading models into VRAM. We use NVME drives (which are much faster
       than network volumes), as close as possible to the machines and we
       orchestrate workloads to nodes that already contain the necessary
       model weights where possible.  2. Developer Experience  We built
       Cerebrium to help developers iterate as quickly as possible by
       streamlining the entire build and deployment process.  To get build
       times as low as possible, we use high-performance machines and
       cache layers where possible. We've reduced first-time build times
       to an average of 2 minutes and 24 seconds, with subsequent builds
       completing in just 19 seconds.  We also offer a wide range of GPU
       types--over 8 different options--so you can easily test performance
       and cost efficiency by adjusting a single line in your
       configuration file.  To reduce friction, we've kept things simple.
       There are no custom Python decorators, no Cerebrium specific syntax
       to learn. You just add a .toml file to define your hardware
       requirements and environment settings. This makes migrating onto or
       off our platform just as easy as migrating off. We aim to impress
       you enough that you will want to stay.  3. Stability  This is
       arguably more important than the first two areas - no one wants to
       get an email at 11pm at night or on a Saturday that their
       application is down or degraded. Since April, we've maintained
       99.999% uptime. We have redundancies in place, monitoring, alerts,
       and a team that covers all time zones to resolve any issues
       quickly.  Why Is This Hard?  Building Cerebrium has been
       challenging because it involves solving multiple interconnected
       problems. It requires optimization at every step--from efficiently
       splitting images to fetching data on-demand without introducing
       latency, handling distributed caching, optimizing our network
       stack, and ensuring redundancies, all while holding true to the
       three areas mentioned above.  Pricing:  We charge you exactly for
       the resources you need and only charge you when your code is
       running ie: usage-based. For example, if you specify you need 1
       A100 GPU, with 2 CPUs and 12 GB of RAM we charge you exactly for
       that and not a full A100 (12 CPU's and 148GB of memory)  You can
       see more about our pricing here: http://www.cerebrium.ai/pricing
       What's Next?  We're builders too, and we know how crucial support
       can be when you're working on something new. Here's what we've put
       together to support teams like yours:  - $30 in free credit to
       start exploring. If you're onto something interesting but need more
       runway, just give us a shout - we'd be happy to extend that for
       compelling use-cases.  - We have worked hard on our docs to make
       onboarding easy as well as have a very elaborate Github repo
       covering AI voice agents, LLM optimizations and much more.  Docs:
       https://docs.cerebrium.ai/cerebrium/getting-started/introduc...
       Github Examples:
       https://github.com/CerebriumAI/examples/tree/master  If you have a
       question or hit a snag, you can directly reach out to the engineers
       who built the platform--we're here to help! We've also set up Slack
       and Discord communities where you can connect with other creators,
       share experiences, ask for advice, or just chat with folks building
       cool things.  We're looking forward to seeing what you all build
       and please give us feedback on what you would like us to
       improve/add
        
       Author : za_mike157
       Score  : 35 points
       Date   : 2024-09-18 13:54 UTC (9 hours ago)
        
       | yuppiepuppie wrote:
       | Very nice demo!
       | 
       | When you ran it the first time, it took a while to load up. Do
       | subsequent runs go faster?
       | 
       | And what cloud provider are you all using under the hood? We work
       | in a specific sector that excludes us from using certain cloud
       | providers (ie. AWS) at my company.
        
         | za_mike157 wrote:
         | You are correct! After the first request, an image will be on a
         | machine and it's cached for future use. This makes subsequent
         | container startups much faster. We also route requests to
         | machines where the image is already cached as well as dedupe
         | content between images in order to make startups faster
         | 
         | We are running on top of AWS however can run on top of any
         | cloud provider as well as are working on you using your own
         | cloud. Happy to hear more about your use case and see if we can
         | help you at all - email me at michael@cerebrium.ai.
         | 
         | PS: I will state that vLLM has shocking load times into VRam
         | that we are resolving.
        
       | ekojs wrote:
       | Congrats on the launch!
       | 
       | We're definitely looking for something like this as we're looking
       | to transition from Azure's (expensive) GPUs. I'm curious how you
       | stack against something like Runpod's serverless offering (which
       | seems quite a bit cheaper). Do you offer faster cold starts? How
       | long would a ~30GB model load takes?
        
         | za_mike157 wrote:
         | Yes RunPod does have cheaper pricing than us however they don't
         | allow you to specify your exact resources but rather charge you
         | the full resource (see example of A100 above) so depending on
         | your resource requirements our pricing could be competitive
         | since we charge you only for the resources you use.
         | 
         | In terms of cold starts, they mentioned their cold starts are
         | 250ms which I am not sure what workload that is on, or if we
         | have the same measure of cold starts. We have had quite a few
         | customers that we have told us we are quite a bit faster 2-4
         | seconds vs ~10 seconds although we haven't confirmed this
         | ourselves.
         | 
         | For a 30GB model, we have a few ways to speed this up such as
         | using the Tensorizer framework from Coreweave, we cache model
         | files in our distributed caching layer but I would need to
         | test. We see reads of up to 1GB/s. If you tell me the model you
         | are running (if open-source) I can get results to you - you can
         | message me on our Slack/Discord community or email me at
         | michael@cerebrium.ai or
        
           | spmurrayzzz wrote:
           | > Yes RunPod does have cheaper pricing than us however they
           | don't allow you to specify your exact resources but rather
           | charge you the full resource (see example of A100 above) so
           | depending on your resource requirements our pricing could be
           | competitive since we charge you only for the resources you
           | use.
           | 
           | I may be misunderstanding your explanation a bit here, but
           | Runpod's serverless "flex" tier looks like the same model (it
           | only charges you for non-idle resources). And at that tier
           | they are still 2x cheaper for A100, at your price point with
           | them you could rent an H100.
        
             | za_mike157 wrote:
             | Ah I see they recently cut their pricing by 40% so you are
             | correct - sorry about that. It seems we are more expensive
             | compared to their new pricing
        
               | spmurrayzzz wrote:
               | FWIW Their most expensive flex price I've ever seen for
               | 80GB A100 was $0.00130 back in January of this year,
               | which is still cheaper albeit by a smaller magnitude, if
               | that's helpful at all for your own competitive market
               | analysis.
               | 
               | (Congrats on the launch as well, by the way).
        
           | risyachka wrote:
           | Yeah Runpods cold start is definitely not 250ms, not even
           | close. Maybe for some models idk but a huggingface model 8B
           | params takes like 30 seconds to cold start in their
           | serverless "flash" configuration.
        
             | za_mike157 wrote:
             | Thanks for confirming! Our cold start, excluding model load
             | is 2-4 seconds typically for HF models.
             | 
             | The only time it gets much longer when companies have done
             | a lot with very specific CUDA implementations
        
       | mceachen wrote:
       | Good luck on your launch! Your loom drops audio after 4m25s.
        
         | za_mike157 wrote:
         | Thanks for pointing that out!
        
       | ribhu97 wrote:
       | How does this compare to modal (modal.com)? Faster cold-start?
       | Easier config? Asking because I've used modal quite a bit for
       | everything from fine-tuning LLMs to running etl pipelines and it
       | works well for me, and I haven't found any real competitors for
       | them to even think of switching.
        
         | za_mike157 wrote:
         | Modal is a great platform!
         | 
         | In terms of cold starts, we seem to be very comparable from
         | what users have mentioned and tests we have run.
         | 
         | Easier config/setup is feedback we have gotten from users since
         | we don't have and special syntax or a "Cerebrium way" of doing
         | things which makes migration pretty easier as well as doesn't
         | lock you in which some engineers appreciate. We just run your
         | Python code as is with an extra .toml setup file.
         | 
         | Additionally, we offer AWS Inferentia/Tranium nodes which offer
         | a great price/performance trade-offs for many open-Source LLM's
         | - even when using TensorRT/vLLM on Nvidia GPU's and gets rid of
         | the scarcity problem. We plan to support TPU's and others in
         | future.
         | 
         | We are listed on AWS Marketplace as well as others which means
         | you can subtract your Cerebrium cost from your commited cloud
         | spend.
         | 
         | Two things we are working on that will hopefully make us a bit
         | different is: - GPU checkpointing - Running compute in your own
         | cluster to use credits/for privacy concerns.
         | 
         | Where Modal does really shine is training/data-processing use
         | cases which we currently don't support too well. However, we do
         | have this on our roadmap for the near future.
        
       | eh9 wrote:
       | Congratulations on the launch!
       | 
       | I just shared this on Slack and it looks like the site
       | description has a typo: "A serverless AI infrastructure platform
       | [...] customers experience a 40%+ cost savings as opposed to AWS
       | of GCP"
        
         | za_mike157 wrote:
         | Thank you - updated! My team makes fun of my spelling all the
         | time!
        
       | benjamaan wrote:
       | Congrats and thank you! We've been a happy customer since early
       | on. Although we don't have much usage, our products are mostly
       | R&D, having Cerebrium made it super easy to launch cost
       | effectively on tight budgets and run our own models within our
       | apps.
       | 
       | The support is next level - team is ready to dive into any
       | problem, response is super fast, and has helped us solve a bunch
       | of dev problems that a normal platform probably won't.
       | 
       | Really excited to see this one grow!!
        
         | za_mike157 wrote:
         | Thank you - appreciate the kind words! Happy to continue
         | supporting you and the team.
        
       | tmshapland wrote:
       | We use Cerebrium for our Mixpanel for Voice AI product
       | (https://voice.canonical.chat). Great product. So much easier to
       | set up and more robust than other model hosting providers we've
       | tried (especially AWS!). Really nice people on the the team, too.
        
         | za_mike157 wrote:
         | Thanks Tom! Excited to to support you and the team as you grow
        
       | abraxas wrote:
       | Would this be a direct competitor of paperspace? If yes what do
       | you feel are your strenghts vis-a-vis paperspace?
        
         | jono_irwin wrote:
         | There are definitely some parallels between Cerebrium and
         | paperspace, but I don't think they are a direct competitor. The
         | biggest difference being that paperspace doesn't have a
         | serverless offering afaik.
         | 
         | Cerebrium abstracts some functionality - like streaming and
         | batching endpoints. I think you would need to build that
         | yourself on paperspace.
        
       ___________________________________________________________________
       (page generated 2024-09-18 23:00 UTC)