[HN Gopher] Static IPs for Serverless Containers
       ___________________________________________________________________
        
       Static IPs for Serverless Containers
        
       Author : ekzhang
       Score  : 64 points
       Date   : 2024-12-02 20:04 UTC (2 hours ago)
        
 (HTM) web link (modal.com)
 (TXT) w3m dump (modal.com)
        
       | ekzhang wrote:
       | Hi! This is a blog post sharing some low-level Linux networking
       | we're doing at Modal with WireGuard.
       | 
       | As a serverless platform we hit a bit of a tricky tradeoff: we
       | run multi-tenant user workloads on machines around the world, and
       | each serverless function is an autoscaling container pool. How do
       | you let users give their functions static IPs, but also decouple
       | them from compute resource flexibility?
       | 
       | We needed a high-availability VPN proxy for containers and didn't
       | find one, so we built our own on top of WireGuard and open-
       | sourced it at https://github.com/modal-labs/vprox
       | 
       | Let us know if you have thoughts! I'm relatively new to low-level
       | container networking, and we (me + my coworkers Luis and Jeffrey
       | + others) have enjoyed working on this.
        
         | xxpor wrote:
         | You're using containers as a multi-tenancy boundary for
         | arbitrary code?
        
           | ekzhang wrote:
           | We use gVisor! It's an open-source application security
           | sandbox spun off from Google. We work with the gVisor team to
           | get the features we need (notably GPUs / CUDA support) and
           | also help test gVisor upstream https://gvisor.dev/users/
           | 
           | It's also used by Google Kubernetes Engine, OpenAI, and
           | Cloudflare among others to run untrusted code.
        
             | doctorpangloss wrote:
             | Are these the facts?
             | 
             | - You are using a container orchestrator like Kubernetes
             | 
             | - You are using gVisor as a container runtime
             | 
             | - Two applications from different users, containerized, are
             | scheduled on the same node.
             | 
             | Then, which of the following are true?
             | 
             | (1) Both have shared access to an NVIDIA GPU
             | 
             | (2) Both share access to the NVIDIA GPU via CUDA MPS
             | 
             | (3) If there were 2 or more MIGs on the node with a MIG-
             | supporting GPU, the NVIDIA container toolkit shim assigned
             | a distinct MIG to each application
        
               | ekzhang wrote:
               | We don't use Kubernetes to run user workloads, we do use
               | gVisor. We don't use MIG (multi-instance GPU) or MPS. If
               | you run a container on Modal using N GPUs, you get the
               | entire N GPUs.
               | 
               | If you'd like to learn more, you can check out our docs
               | here: https://modal.com/docs/guide/gpu
               | 
               | Re not using Kubernetes, we have our own custom container
               | runtime in Rust with optimizations like lazy loading of
               | content-addressed file systems.
               | https://www.youtube.com/watch?v=SlkEW4C2kd4
        
               | doctorpangloss wrote:
               | Suppose I ask for two H100s. Will I have GPU P2P
               | capabilities?
        
               | thundergolfer wrote:
               | Yes it will.
               | 
               | (I work at Modal.)
        
               | ekzhang wrote:
               | Yep! This is something we have internal tests for haha,
               | you have good instincts that it can be tricky. Here's an
               | example of using that for multi-GPU training
               | https://modal.com/docs/examples/llm-finetuning
        
               | ec109685 wrote:
               | If the nvidia driver has a bug, can one workload access
               | data of another running on the physical machine?
               | 
               | E.g. it came up in this thread:
               | https://news.ycombinator.com/item?id=41672168
        
         | crishoj wrote:
         | Neat. I am curious what notable differences there are between
         | Modal and Tailscale.
        
           | ekzhang wrote:
           | Thanks. We did check out Tailscale, but they didn't quite
           | have what we were looking for: some high-availability custom
           | component that plugs into a low-level container runtime.
           | (Which makes sense, it's pretty different from their intended
           | use case.)
           | 
           | Modal is actually a happy customer of Tailscale (but for
           | other purposes). :D
        
       | jimmyl02 wrote:
       | this is a really neat writeup! the design choice to make each
       | "exit node" control the local wireguard connections instead of a
       | global control plane is pretty neat.
       | 
       | an unfinished project I worked on
       | (https://github.com/redpwn/rvpn) was a bit more ambitious with a
       | global control plane and I quickly learned supporting multiple
       | clients especially anything networking related is a tarpit. the
       | focus on linux / aws specifically here and the results achievable
       | from it are nice to see.
       | 
       | networking is challenging and this was a nice deep dive into some
       | networking internals, thanks for sharing the details :)
        
         | ekzhang wrote:
         | Thanks for sharing. I'm interested in seeing what a global
         | control plane might look like, seems like authentication might
         | be tricky to get right!
         | 
         | Controlling our worker environment (like
         | `net.ipv4.conf.all.rp_filter` sysctl) is a big help for us
         | since it means we don't have to deal with the fullness of all
         | possible network configurations.
        
       | cactacea wrote:
       | Static IPs for allowlists need to die already. Its 2024, come on,
       | surely we can do better than this
        
         | ekzhang wrote:
         | What would you suggest as an alternative?
        
           | thatfunkymunki wrote:
           | a more modern, zero-trust solution like mTLS authentication
        
             | ekzhang wrote:
             | That makes sense, mTLS is great. Some services like Google
             | Cloud SQL are really good about support for it.
             | https://cloud.google.com/sql/docs/mysql/configure-ssl-
             | instan...
             | 
             | It's not quite a zero-trust solution though due to the CA
             | chain of trust.
             | 
             | mTLS is security at a different layer though than IP source
             | whitelisting. I'd say that a lot of companies we spoke to
             | would want both as a defense-in-depth measure. Even with
             | mTLS, network whitelisting is relevant. If your certificate
             | were to be exposed for instance, an attacker would still
             | need to be able to forge a source IP address to start a
             | connection.
        
               | thatfunkymunki wrote:
               | I'd put it in the zero-trust category if the server (or
               | owner of the server, etc) is the issuer of the client
               | certificate and the client uses that certificate to
               | authenticate itself, but I'll admit this is a pedantic
               | point that adds nothing of substance. The idea being that
               | you trust your issuance of the certificate and the
               | various things that can be asserted based on how it was
               | issued (stored in TPM, etc), rather than any parameter
               | that could be controlled by the remote party.
        
       | eqvinox wrote:
       | I guess my first question is, why is this built on IPv4 rather
       | than IPv6...
        
         | ekzhang wrote:
         | Yeah, great question. This came up at the beginning of design.
         | A lot of our customers specifically needed IPv4 whitelisting.
         | For example, MongoDB Atlas (a very popular database vendor)
         | only supports IPv4.
         | https://www.mongodb.com/community/forums/t/does-mongodb-atla...
         | 
         | The architecture of vprox is pretty generic though and could
         | support IPv6 as well.
        
       | ATechGuy wrote:
       | > Modal has an isolated container runtime that lets us share each
       | host's CPU and memory between workloads.
       | 
       | Looks like Modal hosts workloads in Containers, not VMs. How do
       | you enforce secure isolation with this design? A single kernel
       | vulnerability could lead to remote execution on the host,
       | impacting all workloads . Am I missing anything?
        
         | ekzhang wrote:
         | I mentioned this in another comment thread, but we use gVisor
         | to enforce isolation. https://gvisor.dev/users/
         | 
         | It's also used by Google Kubernetes Engine, OpenAI, and
         | Cloudflare among others to run untrusted code.
        
           | yegle wrote:
           | And Google's own serverless offerings (App Engine, Cloud Run,
           | Cloud Functions) :-)
           | 
           | Disclaimer: I'm an SRE on the GCP Serverless products.
        
             | ekzhang wrote:
             | Neat, thanks for sharing! Glad to know we're in good
             | company here.
        
       ___________________________________________________________________
       (page generated 2024-12-02 23:00 UTC)