[HN Gopher] Infrastructure SaaS - a control plane first architec...
       ___________________________________________________________________
        
       Infrastructure SaaS - a control plane first architecture
        
       Author : infra_dev
       Score  : 63 points
       Date   : 2022-06-22 17:42 UTC (5 hours ago)
        
 (HTM) web link (docs.thenile.dev)
 (TXT) w3m dump (docs.thenile.dev)
        
       | ed wrote:
       | "Data plane" and "control plane" aren't terms I've seen before,
       | and I'm having trouble understanding what they are, even after
       | reading the post.
       | 
       | Can you explain them in a more concrete, conversational way?
       | 
       | Does this service let me e.g. take any docker image and turn it
       | into a SaaS, handling user accounts and billing etc?
        
         | infra_dev wrote:
         | Hey ed, this is Ram, the author of the post. In the context of
         | Infrastructure SaaS, a data plane is the system that is the
         | infrastructure that you provide as service. For example, let us
         | say you are building a company that provides Postgres as a
         | service. In this case, Postgres is your data plane. Typically,
         | your users will want the Postgres to be deployed in a specific
         | region or cloud provider.They would run queries against the
         | Postgres cluster.
         | 
         | Control plane is the central lifecycle management system that
         | helps provide all the SaaS experience for your Infra SaaS
         | application, manages the metadata for your application and also
         | pushes this information to all the data planes. Example of
         | lifecycle management operations could be creating an user, a
         | new organization, provisioning your data plane in a specific
         | region, deleting a cluster etc.
         | 
         | The data plane is your product that you want to sell to your
         | customers and control plane is the central system that helps
         | you to make your product work in a self serve way with your
         | customers.
         | 
         | This example can also be mapped to internal use cases. Many
         | companies manage their own infrastructure internally and end up
         | having to build a central control plane to manage all the
         | different infrastructure that they provide as a service to
         | their developers. Hope this helps.
        
         | fragmede wrote:
         | Data plane is the data, what you traditionally think of a
         | service's inputs and outputs. Eg, a database server gets
         | queries and returns rows/results. That's data plane. But when a
         | system becomes large enough, the management stuff adds up to be
         | notable in and of itself.
         | 
         | Pretend we're a SaaS company offering a database as a service.
         | Adding and removing users, and the setting of passwords is
         | control plane stuff. In a sufficiently web scale system, adding
         | and removing users becomes, not just its own microservice, but
         | a collection of microservices to authenticate and send updates
         | to the main product database, and have its own separate
         | database.
        
         | arccy wrote:
         | think old school ftp ports, 1 for data (your actual files being
         | transferred) 1 for control (out of band messaging).
        
         | beberlei wrote:
         | I understood a data plane is for example the MySQL/pgsql/redis
         | servers of all customers of your db as service
        
         | icedchai wrote:
         | In the old days, we'd call this the "app" and the "management"
         | (or "admin", or"provisioning") interface. The app is the
         | application(s) providing the actual service your SaaS offers,
         | the management interface manages/configures the app, handles
         | migrations, updates, etc. Example: Maybe your app needs a
         | separate DB per tenant. Your management thing handles spinning
         | that up, etc.
        
         | squarecog wrote:
         | These terms usually show up in the context of networking
         | protocols. Cloudflare has a very quick explainer:
         | https://www.cloudflare.com/learning/network-layer/what-is-
         | th.... To make it even shorter: a control plane is where all
         | the coordination that controls activity (data) happens. The
         | data plane is where the data actually moves around.
         | 
         | Explainers seem to not cover _why_ you would want to separate
         | these "planes". There are several reasons, and I'm no
         | authority, but for starters: * control messages will have
         | different expectations around them: their amount and frequency,
         | delivery guarantees, urgency with which they are processed.
         | Treating this traffic separately means you can engineer
         | appropriately for data and control traffic. * last thing you
         | want is the control message "stop processing traffic from IP
         | x.x.x.x port y" to be stuck behind traffic from said IP/port...
         | 
         | In this context, the meaning is somewhat different. They are
         | referring to administrative traffic vs "actual work" traffic.
         | Auth, billing/accounting, configuration updates, that sort of
         | thing. If you are running a SaaS, and your customer is very
         | security conscious and wants none of their precious data to
         | ever leave their VPC, you have 2 options: deploy your software
         | into their VPC completely, making it hard to do a variety of
         | things like upgrades, and increasing complexity; or you can
         | separate control actions from your "worker nodes" and storage,
         | and only deploy the latter into the VPC. You can then work on
         | your control panels, monitor usage, continuously evolve various
         | admin panels and config options, etc, using normal SaaS
         | approaches while the security conscious customer knows that
         | their core data is not leaving their virtual walls and only
         | "bob ran a thing and stored results" goes to the vendor.
         | 
         | This post is about abstracting out common bits of how one
         | implements that, and allowing SaaS offerings to provide that
         | sort of separation easier.
        
           | ed wrote:
           | Awesome explanation, thanks! (Particularly "last thing you
           | want is the control message "stop processing traffic from IP
           | x.x.x.x port y" to be stuck behind traffic from said
           | IP/port..")
        
             | squarecog wrote:
             | I should've added, there's an obvious example for the "SaaS
             | control plane" separation, which is equivalent: "stop
             | processing job X that is destabilizing the cluster" should
             | be processed without needing to fight for resources with
             | job X. Same for ACL changes, user deactivations, etc etc.
             | It's generally a good idea to have your control stuff not
             | be subject to whatever instabilities you might be
             | controlling against.
        
       | andrewmutz wrote:
       | Why is this specific to _infrastructure_ SaaS? Most of what you
       | write applies to any B2B SaaS, doesn 't it?
        
         | infra_dev wrote:
         | Great question! For a typical B2B SaaS, you typically will have
         | a multitenant deployment in one region. The control plane APIs
         | and data plane APIs will run in the same region. Each new
         | customer will create a logical tenant in your DB but there are
         | no physical data planes created.
         | 
         | For Infrastructure SaaS, it is a bit different. You typically
         | will have different customers provision your infrastructure in
         | different cloud or regions depending on where they have their
         | infrastructure. This leads to having many physical
         | infrastructure deployed in many regions and cloud providers. At
         | the same time, for the user, you need to provide a single pane
         | of glass experience where they can manage all their
         | infrastructure from a single dashboard. This requires a central
         | control plane that is responsible for all the life cycle
         | management operations and it helps to communicate all the
         | metadata back and forth to all the data planes. Things like
         | upgrades, observability, user and tenant management all need
         | coordination with the data planes. This makes the
         | Infrastructure SaaS use case a bit different from standard B2B
         | SaaS. Hope that helps.
        
           | [deleted]
        
       | epelesis wrote:
       | I work at a somewhat well-known unicorn in the data space that
       | has been using this architecture for a while. In fact I'd wager
       | that any (non-cloud-provider) company that provides any
       | significant amount of compute or storage in their product
       | offering will converge upon a layout that closely resembles this.
       | 
       | Overall I'd imagine there are a lot of parallels to other SaaS-
       | ish architectures, one big divergence is that I'd consider the
       | data-plane to be a special kind of client, (client in the same
       | way that a user's phone or browser is). The big difference is
       | that we (the company) ALSO manage the lifecycle of this "client"
       | (i.e. shutdown, startup, repair, update). Having an untrusted
       | client that you also manage the lifecycle of can lead to some
       | interesting design spaces.
        
         | gwen-shapira wrote:
         | Managing untrusted clients seems extremely challenging. I think
         | these days you'd use something more "sandboxed" like WASM as
         | the basis for the client?
        
       | smashah wrote:
       | Maintainers should be given the tools to monetise their projects
       | using a cloud offering early on so they don't burn out on the way
       | to learning what control plane/data plane nomenclature actually
       | is!
       | 
       | Unfortunately there are limited tools/resources out there to
       | answer the question "how can i build a cloud deployment option
       | for my open source project" without 25 layers of abstractions.
       | This is why open-source projects end up raising millions of
       | dollars (e.g strapi, appsmith just off the top of my head). All
       | this money just for all these companies to essentially build the
       | same thing.
       | 
       | Ideally, there should be a service/tool (maybe thenile will be
       | it) where I answer a few questions.
       | 
       | Containerised? Stateful? Long lived? Keep alive? Allow end users
       | to deploy many instances? % premium per instance over cloud
       | costs? License per instance? Pricing (tiered, volume, stairstep?)
       | On end customer's own infra? Min resources per instance.
       | Airgapped per user/airgapped per instance/all instances running
       | on same cloud? Instance management API? Update strategy? Big
       | Green button to update all instances at once?
       | 
       | And it should spit out a ready to deploy setup where I can start
       | monetising my open-source project while concentrating on
       | maintaining the project.
       | 
       | If the above exists then congrats, you disrupted the main reason
       | for open source projects raising millions of dollars.
        
         | ibeckermayer wrote:
         | Sounds insanely difficult to get right
        
           | smashah wrote:
           | YES! Imagine trying to do all that while maintaining and
           | growing the source project. This is why I've never been able
           | to get there with my project. Instead I have to make
           | deployment buttons where DO makes all the revenue and their
           | broken referral system doesn't even give me anything back.
           | 
           | One shortcut with all this is just to allow maintainers to
           | surcharge a premium with deploy buttons.
        
         | infra_dev wrote:
         | Hey smashah, this is Ram, the author of the post. Everything
         | that you said is pretty spot on. It is a really hard problem
         | and pretty repetitive in many of the companies including open
         | source projects. We hope to build a platform that can help with
         | this and over time cover all the use cases you mentioned.
        
           | gkapur wrote:
           | How do you think about partner versus build? Usage based
           | billing on its own is a pretty complex topic as an example
           | with multiple startups that have tried to do it right. It is
           | pretty data intensive as usage frequency increases, as well.
           | It seems like you may be trying to do too much here. Just one
           | opinion but something to think through (build versus
           | partner.)
        
         | [deleted]
        
       | bluelightning2k wrote:
       | Hey - I don't want to be critical, but I think you should rethink
       | your messaging.
       | 
       | I'll be honest with you: I thought this was a parody. It's SO
       | abstract.
       | 
       | It's like you came from doing this abstract thing inside a big
       | company & decided to do the same abstract thing as a startup. And
       | describe it using the specific terminology used inside that
       | specific team of SuperBigCo.
        
         | itisit wrote:
         | Who is this for? What does it do? Why should I care? I ask all
         | of these non-facetiously as someone who works in enterprise
         | cloud architecture.
        
         | infra_dev wrote:
         | This is Ram, the author of the post. Thanks for the feedback.
         | The architecture that is mentioned in the post is something we
         | built at a startup for our providing our infrastructure as a
         | service. It is true that this architecture gets complex in
         | larger companies. I would love to understand what parts are
         | abstract and how we can improve them. We plan to publish a
         | series of posts to provide more clarity on the different parts
         | mentioned in the blog. Your feedback would be really helpful.
        
       | agentultra wrote:
       | For recent projects I've been using PostgREST + postgresql-
       | replicant + Kafka (or some durable message queue as appropriate).
       | 
       | I don't have to work too hard to implement the control plane as
       | PostgREST takes care of generating the API and Postgres already
       | has authorization controls built-in. Authentication is the part
       | that requires a bit of toil to figure out. The rest is managing
       | the schemas of the control plane entities. Basically, design the
       | data and most of the rest is generated for me.
       | 
       | The data plane is trickier but I'm experimenting with using
       | Postgres' streaming logical replication protocol to convert
       | logical queries on the control plane data into business domain
       | events that are forwarded onto the message queue. This is the
       | part that uses the postgresql-replicant library I wrote but other
       | libraries written in Python exist and can do the same thing.
       | 
       | This then enables me to implement business logic/data-plane
       | actions asynchronously down-stream as isolated, stateful services
       | that follow the event streams and react accordingly to various
       | policies. They can then update the control plane models as they
       | progress, which then could add more domain events to the stream,
       | and so on. It's a bit like a functional-reactive architecture.
       | 
       | I don't know if it's a _production ready_ style architecture.
       | Monitoring replication stream performance can be tricky and
       | integration testing is challenging. And managing changes to the
       | business domain is not a fully-solved problem: still a lot of
       | exploration /tooling/and do-it-yourself duct-taping to do.
       | 
       | But it's simple enough that I can do lot of work with very little
       | code so far.
        
         | [deleted]
        
         | infra_dev wrote:
         | Hey, this is Ram, the author of the post. Would love to know
         | how this architecture works in production for you. For
         | infrastructure saas, the control plane needs to manage 100's of
         | data planes. It also needs to own user and tenant management,
         | security policies, billing based on usage, usage and
         | operational insights for your users. The challenge is in
         | building an infrastructure that can centrally manage the
         | lifecycle of all the metadata (SaaS, app and infra) and
         | orchestrate with all the data planes. Postgres is definitely a
         | good building block to build on top of but there is so much to
         | build around it to make this work well in our experience.
        
           | agentultra wrote:
           | Managing the data plane is the more complex part of it as you
           | would expect.
           | 
           | I do tend to model the business processes at a high level
           | using domain-driven design and map the aggregates to services
           | that ingest event streams. The services then react to the
           | event streams in several ways: emitting events, updating
           | control plane models, issuing new commands to other services,
           | etc.
           | 
           | Each service keeps its own state internally and if I need to,
           | I can blow away their state and replay all of the business
           | domain events thanks to the durable message stream. That part
           | is key... and is also the most duct-tape-and-toil area of
           | this architecture.
           | 
           | I've been toying around with ideas to generalize this into a
           | consolidated application framework but it's still pretty
           | experimental stuff.
           | 
           | The high-level architecture isn't terribly novel or new but
           | having standardized tools for common operations, managing
           | migrations from event schemas, managing checkpoints, etc; is
           | still a work in progress.
        
       ___________________________________________________________________
       (page generated 2022-06-22 23:01 UTC)