[HN Gopher] Static IPs for Serverless Containers
___________________________________________________________________
Static IPs for Serverless Containers
Author : ekzhang
Score : 64 points
Date : 2024-12-02 20:04 UTC (2 hours ago)
(HTM) web link (modal.com)
(TXT) w3m dump (modal.com)
| ekzhang wrote:
| Hi! This is a blog post sharing some low-level Linux networking
| we're doing at Modal with WireGuard.
|
| As a serverless platform we hit a bit of a tricky tradeoff: we
| run multi-tenant user workloads on machines around the world, and
| each serverless function is an autoscaling container pool. How do
| you let users give their functions static IPs, but also decouple
| them from compute resource flexibility?
|
| We needed a high-availability VPN proxy for containers and didn't
| find one, so we built our own on top of WireGuard and open-
| sourced it at https://github.com/modal-labs/vprox
|
| Let us know if you have thoughts! I'm relatively new to low-level
| container networking, and we (me + my coworkers Luis and Jeffrey
| + others) have enjoyed working on this.
| xxpor wrote:
| You're using containers as a multi-tenancy boundary for
| arbitrary code?
| ekzhang wrote:
| We use gVisor! It's an open-source application security
| sandbox spun off from Google. We work with the gVisor team to
| get the features we need (notably GPUs / CUDA support) and
| also help test gVisor upstream https://gvisor.dev/users/
|
| It's also used by Google Kubernetes Engine, OpenAI, and
| Cloudflare among others to run untrusted code.
| doctorpangloss wrote:
| Are these the facts?
|
| - You are using a container orchestrator like Kubernetes
|
| - You are using gVisor as a container runtime
|
| - Two applications from different users, containerized, are
| scheduled on the same node.
|
| Then, which of the following are true?
|
| (1) Both have shared access to an NVIDIA GPU
|
| (2) Both share access to the NVIDIA GPU via CUDA MPS
|
| (3) If there were 2 or more MIGs on the node with a MIG-
| supporting GPU, the NVIDIA container toolkit shim assigned
| a distinct MIG to each application
| ekzhang wrote:
| We don't use Kubernetes to run user workloads, we do use
| gVisor. We don't use MIG (multi-instance GPU) or MPS. If
| you run a container on Modal using N GPUs, you get the
| entire N GPUs.
|
| If you'd like to learn more, you can check out our docs
| here: https://modal.com/docs/guide/gpu
|
| Re not using Kubernetes, we have our own custom container
| runtime in Rust with optimizations like lazy loading of
| content-addressed file systems.
| https://www.youtube.com/watch?v=SlkEW4C2kd4
| doctorpangloss wrote:
| Suppose I ask for two H100s. Will I have GPU P2P
| capabilities?
| thundergolfer wrote:
| Yes it will.
|
| (I work at Modal.)
| ekzhang wrote:
| Yep! This is something we have internal tests for haha,
| you have good instincts that it can be tricky. Here's an
| example of using that for multi-GPU training
| https://modal.com/docs/examples/llm-finetuning
| ec109685 wrote:
| If the nvidia driver has a bug, can one workload access
| data of another running on the physical machine?
|
| E.g. it came up in this thread:
| https://news.ycombinator.com/item?id=41672168
| crishoj wrote:
| Neat. I am curious what notable differences there are between
| Modal and Tailscale.
| ekzhang wrote:
| Thanks. We did check out Tailscale, but they didn't quite
| have what we were looking for: some high-availability custom
| component that plugs into a low-level container runtime.
| (Which makes sense, it's pretty different from their intended
| use case.)
|
| Modal is actually a happy customer of Tailscale (but for
| other purposes). :D
| jimmyl02 wrote:
| this is a really neat writeup! the design choice to make each
| "exit node" control the local wireguard connections instead of a
| global control plane is pretty neat.
|
| an unfinished project I worked on
| (https://github.com/redpwn/rvpn) was a bit more ambitious with a
| global control plane and I quickly learned supporting multiple
| clients especially anything networking related is a tarpit. the
| focus on linux / aws specifically here and the results achievable
| from it are nice to see.
|
| networking is challenging and this was a nice deep dive into some
| networking internals, thanks for sharing the details :)
| ekzhang wrote:
| Thanks for sharing. I'm interested in seeing what a global
| control plane might look like, seems like authentication might
| be tricky to get right!
|
| Controlling our worker environment (like
| `net.ipv4.conf.all.rp_filter` sysctl) is a big help for us
| since it means we don't have to deal with the fullness of all
| possible network configurations.
| cactacea wrote:
| Static IPs for allowlists need to die already. Its 2024, come on,
| surely we can do better than this
| ekzhang wrote:
| What would you suggest as an alternative?
| thatfunkymunki wrote:
| a more modern, zero-trust solution like mTLS authentication
| ekzhang wrote:
| That makes sense, mTLS is great. Some services like Google
| Cloud SQL are really good about support for it.
| https://cloud.google.com/sql/docs/mysql/configure-ssl-
| instan...
|
| It's not quite a zero-trust solution though due to the CA
| chain of trust.
|
| mTLS is security at a different layer though than IP source
| whitelisting. I'd say that a lot of companies we spoke to
| would want both as a defense-in-depth measure. Even with
| mTLS, network whitelisting is relevant. If your certificate
| were to be exposed for instance, an attacker would still
| need to be able to forge a source IP address to start a
| connection.
| thatfunkymunki wrote:
| I'd put it in the zero-trust category if the server (or
| owner of the server, etc) is the issuer of the client
| certificate and the client uses that certificate to
| authenticate itself, but I'll admit this is a pedantic
| point that adds nothing of substance. The idea being that
| you trust your issuance of the certificate and the
| various things that can be asserted based on how it was
| issued (stored in TPM, etc), rather than any parameter
| that could be controlled by the remote party.
| eqvinox wrote:
| I guess my first question is, why is this built on IPv4 rather
| than IPv6...
| ekzhang wrote:
| Yeah, great question. This came up at the beginning of design.
| A lot of our customers specifically needed IPv4 whitelisting.
| For example, MongoDB Atlas (a very popular database vendor)
| only supports IPv4.
| https://www.mongodb.com/community/forums/t/does-mongodb-atla...
|
| The architecture of vprox is pretty generic though and could
| support IPv6 as well.
| ATechGuy wrote:
| > Modal has an isolated container runtime that lets us share each
| host's CPU and memory between workloads.
|
| Looks like Modal hosts workloads in Containers, not VMs. How do
| you enforce secure isolation with this design? A single kernel
| vulnerability could lead to remote execution on the host,
| impacting all workloads . Am I missing anything?
| ekzhang wrote:
| I mentioned this in another comment thread, but we use gVisor
| to enforce isolation. https://gvisor.dev/users/
|
| It's also used by Google Kubernetes Engine, OpenAI, and
| Cloudflare among others to run untrusted code.
| yegle wrote:
| And Google's own serverless offerings (App Engine, Cloud Run,
| Cloud Functions) :-)
|
| Disclaimer: I'm an SRE on the GCP Serverless products.
| ekzhang wrote:
| Neat, thanks for sharing! Glad to know we're in good
| company here.
___________________________________________________________________
(page generated 2024-12-02 23:00 UTC)