[HN Gopher] Infrastructure SaaS - a control plane first architec...
___________________________________________________________________
Infrastructure SaaS - a control plane first architecture
Author : infra_dev
Score : 63 points
Date : 2022-06-22 17:42 UTC (5 hours ago)
(HTM) web link (docs.thenile.dev)
(TXT) w3m dump (docs.thenile.dev)
| ed wrote:
| "Data plane" and "control plane" aren't terms I've seen before,
| and I'm having trouble understanding what they are, even after
| reading the post.
|
| Can you explain them in a more concrete, conversational way?
|
| Does this service let me e.g. take any docker image and turn it
| into a SaaS, handling user accounts and billing etc?
| infra_dev wrote:
| Hey ed, this is Ram, the author of the post. In the context of
| Infrastructure SaaS, a data plane is the system that is the
| infrastructure that you provide as service. For example, let us
| say you are building a company that provides Postgres as a
| service. In this case, Postgres is your data plane. Typically,
| your users will want the Postgres to be deployed in a specific
| region or cloud provider.They would run queries against the
| Postgres cluster.
|
| Control plane is the central lifecycle management system that
| helps provide all the SaaS experience for your Infra SaaS
| application, manages the metadata for your application and also
| pushes this information to all the data planes. Example of
| lifecycle management operations could be creating an user, a
| new organization, provisioning your data plane in a specific
| region, deleting a cluster etc.
|
| The data plane is your product that you want to sell to your
| customers and control plane is the central system that helps
| you to make your product work in a self serve way with your
| customers.
|
| This example can also be mapped to internal use cases. Many
| companies manage their own infrastructure internally and end up
| having to build a central control plane to manage all the
| different infrastructure that they provide as a service to
| their developers. Hope this helps.
| fragmede wrote:
| Data plane is the data, what you traditionally think of a
| service's inputs and outputs. Eg, a database server gets
| queries and returns rows/results. That's data plane. But when a
| system becomes large enough, the management stuff adds up to be
| notable in and of itself.
|
| Pretend we're a SaaS company offering a database as a service.
| Adding and removing users, and the setting of passwords is
| control plane stuff. In a sufficiently web scale system, adding
| and removing users becomes, not just its own microservice, but
| a collection of microservices to authenticate and send updates
| to the main product database, and have its own separate
| database.
| arccy wrote:
| think old school ftp ports, 1 for data (your actual files being
| transferred) 1 for control (out of band messaging).
| beberlei wrote:
| I understood a data plane is for example the MySQL/pgsql/redis
| servers of all customers of your db as service
| icedchai wrote:
| In the old days, we'd call this the "app" and the "management"
| (or "admin", or"provisioning") interface. The app is the
| application(s) providing the actual service your SaaS offers,
| the management interface manages/configures the app, handles
| migrations, updates, etc. Example: Maybe your app needs a
| separate DB per tenant. Your management thing handles spinning
| that up, etc.
| squarecog wrote:
| These terms usually show up in the context of networking
| protocols. Cloudflare has a very quick explainer:
| https://www.cloudflare.com/learning/network-layer/what-is-
| th.... To make it even shorter: a control plane is where all
| the coordination that controls activity (data) happens. The
| data plane is where the data actually moves around.
|
| Explainers seem to not cover _why_ you would want to separate
| these "planes". There are several reasons, and I'm no
| authority, but for starters: * control messages will have
| different expectations around them: their amount and frequency,
| delivery guarantees, urgency with which they are processed.
| Treating this traffic separately means you can engineer
| appropriately for data and control traffic. * last thing you
| want is the control message "stop processing traffic from IP
| x.x.x.x port y" to be stuck behind traffic from said IP/port...
|
| In this context, the meaning is somewhat different. They are
| referring to administrative traffic vs "actual work" traffic.
| Auth, billing/accounting, configuration updates, that sort of
| thing. If you are running a SaaS, and your customer is very
| security conscious and wants none of their precious data to
| ever leave their VPC, you have 2 options: deploy your software
| into their VPC completely, making it hard to do a variety of
| things like upgrades, and increasing complexity; or you can
| separate control actions from your "worker nodes" and storage,
| and only deploy the latter into the VPC. You can then work on
| your control panels, monitor usage, continuously evolve various
| admin panels and config options, etc, using normal SaaS
| approaches while the security conscious customer knows that
| their core data is not leaving their virtual walls and only
| "bob ran a thing and stored results" goes to the vendor.
|
| This post is about abstracting out common bits of how one
| implements that, and allowing SaaS offerings to provide that
| sort of separation easier.
| ed wrote:
| Awesome explanation, thanks! (Particularly "last thing you
| want is the control message "stop processing traffic from IP
| x.x.x.x port y" to be stuck behind traffic from said
| IP/port..")
| squarecog wrote:
| I should've added, there's an obvious example for the "SaaS
| control plane" separation, which is equivalent: "stop
| processing job X that is destabilizing the cluster" should
| be processed without needing to fight for resources with
| job X. Same for ACL changes, user deactivations, etc etc.
| It's generally a good idea to have your control stuff not
| be subject to whatever instabilities you might be
| controlling against.
| andrewmutz wrote:
| Why is this specific to _infrastructure_ SaaS? Most of what you
| write applies to any B2B SaaS, doesn 't it?
| infra_dev wrote:
| Great question! For a typical B2B SaaS, you typically will have
| a multitenant deployment in one region. The control plane APIs
| and data plane APIs will run in the same region. Each new
| customer will create a logical tenant in your DB but there are
| no physical data planes created.
|
| For Infrastructure SaaS, it is a bit different. You typically
| will have different customers provision your infrastructure in
| different cloud or regions depending on where they have their
| infrastructure. This leads to having many physical
| infrastructure deployed in many regions and cloud providers. At
| the same time, for the user, you need to provide a single pane
| of glass experience where they can manage all their
| infrastructure from a single dashboard. This requires a central
| control plane that is responsible for all the life cycle
| management operations and it helps to communicate all the
| metadata back and forth to all the data planes. Things like
| upgrades, observability, user and tenant management all need
| coordination with the data planes. This makes the
| Infrastructure SaaS use case a bit different from standard B2B
| SaaS. Hope that helps.
| [deleted]
| epelesis wrote:
| I work at a somewhat well-known unicorn in the data space that
| has been using this architecture for a while. In fact I'd wager
| that any (non-cloud-provider) company that provides any
| significant amount of compute or storage in their product
| offering will converge upon a layout that closely resembles this.
|
| Overall I'd imagine there are a lot of parallels to other SaaS-
| ish architectures, one big divergence is that I'd consider the
| data-plane to be a special kind of client, (client in the same
| way that a user's phone or browser is). The big difference is
| that we (the company) ALSO manage the lifecycle of this "client"
| (i.e. shutdown, startup, repair, update). Having an untrusted
| client that you also manage the lifecycle of can lead to some
| interesting design spaces.
| gwen-shapira wrote:
| Managing untrusted clients seems extremely challenging. I think
| these days you'd use something more "sandboxed" like WASM as
| the basis for the client?
| smashah wrote:
| Maintainers should be given the tools to monetise their projects
| using a cloud offering early on so they don't burn out on the way
| to learning what control plane/data plane nomenclature actually
| is!
|
| Unfortunately there are limited tools/resources out there to
| answer the question "how can i build a cloud deployment option
| for my open source project" without 25 layers of abstractions.
| This is why open-source projects end up raising millions of
| dollars (e.g strapi, appsmith just off the top of my head). All
| this money just for all these companies to essentially build the
| same thing.
|
| Ideally, there should be a service/tool (maybe thenile will be
| it) where I answer a few questions.
|
| Containerised? Stateful? Long lived? Keep alive? Allow end users
| to deploy many instances? % premium per instance over cloud
| costs? License per instance? Pricing (tiered, volume, stairstep?)
| On end customer's own infra? Min resources per instance.
| Airgapped per user/airgapped per instance/all instances running
| on same cloud? Instance management API? Update strategy? Big
| Green button to update all instances at once?
|
| And it should spit out a ready to deploy setup where I can start
| monetising my open-source project while concentrating on
| maintaining the project.
|
| If the above exists then congrats, you disrupted the main reason
| for open source projects raising millions of dollars.
| ibeckermayer wrote:
| Sounds insanely difficult to get right
| smashah wrote:
| YES! Imagine trying to do all that while maintaining and
| growing the source project. This is why I've never been able
| to get there with my project. Instead I have to make
| deployment buttons where DO makes all the revenue and their
| broken referral system doesn't even give me anything back.
|
| One shortcut with all this is just to allow maintainers to
| surcharge a premium with deploy buttons.
| infra_dev wrote:
| Hey smashah, this is Ram, the author of the post. Everything
| that you said is pretty spot on. It is a really hard problem
| and pretty repetitive in many of the companies including open
| source projects. We hope to build a platform that can help with
| this and over time cover all the use cases you mentioned.
| gkapur wrote:
| How do you think about partner versus build? Usage based
| billing on its own is a pretty complex topic as an example
| with multiple startups that have tried to do it right. It is
| pretty data intensive as usage frequency increases, as well.
| It seems like you may be trying to do too much here. Just one
| opinion but something to think through (build versus
| partner.)
| [deleted]
| bluelightning2k wrote:
| Hey - I don't want to be critical, but I think you should rethink
| your messaging.
|
| I'll be honest with you: I thought this was a parody. It's SO
| abstract.
|
| It's like you came from doing this abstract thing inside a big
| company & decided to do the same abstract thing as a startup. And
| describe it using the specific terminology used inside that
| specific team of SuperBigCo.
| itisit wrote:
| Who is this for? What does it do? Why should I care? I ask all
| of these non-facetiously as someone who works in enterprise
| cloud architecture.
| infra_dev wrote:
| This is Ram, the author of the post. Thanks for the feedback.
| The architecture that is mentioned in the post is something we
| built at a startup for our providing our infrastructure as a
| service. It is true that this architecture gets complex in
| larger companies. I would love to understand what parts are
| abstract and how we can improve them. We plan to publish a
| series of posts to provide more clarity on the different parts
| mentioned in the blog. Your feedback would be really helpful.
| agentultra wrote:
| For recent projects I've been using PostgREST + postgresql-
| replicant + Kafka (or some durable message queue as appropriate).
|
| I don't have to work too hard to implement the control plane as
| PostgREST takes care of generating the API and Postgres already
| has authorization controls built-in. Authentication is the part
| that requires a bit of toil to figure out. The rest is managing
| the schemas of the control plane entities. Basically, design the
| data and most of the rest is generated for me.
|
| The data plane is trickier but I'm experimenting with using
| Postgres' streaming logical replication protocol to convert
| logical queries on the control plane data into business domain
| events that are forwarded onto the message queue. This is the
| part that uses the postgresql-replicant library I wrote but other
| libraries written in Python exist and can do the same thing.
|
| This then enables me to implement business logic/data-plane
| actions asynchronously down-stream as isolated, stateful services
| that follow the event streams and react accordingly to various
| policies. They can then update the control plane models as they
| progress, which then could add more domain events to the stream,
| and so on. It's a bit like a functional-reactive architecture.
|
| I don't know if it's a _production ready_ style architecture.
| Monitoring replication stream performance can be tricky and
| integration testing is challenging. And managing changes to the
| business domain is not a fully-solved problem: still a lot of
| exploration /tooling/and do-it-yourself duct-taping to do.
|
| But it's simple enough that I can do lot of work with very little
| code so far.
| [deleted]
| infra_dev wrote:
| Hey, this is Ram, the author of the post. Would love to know
| how this architecture works in production for you. For
| infrastructure saas, the control plane needs to manage 100's of
| data planes. It also needs to own user and tenant management,
| security policies, billing based on usage, usage and
| operational insights for your users. The challenge is in
| building an infrastructure that can centrally manage the
| lifecycle of all the metadata (SaaS, app and infra) and
| orchestrate with all the data planes. Postgres is definitely a
| good building block to build on top of but there is so much to
| build around it to make this work well in our experience.
| agentultra wrote:
| Managing the data plane is the more complex part of it as you
| would expect.
|
| I do tend to model the business processes at a high level
| using domain-driven design and map the aggregates to services
| that ingest event streams. The services then react to the
| event streams in several ways: emitting events, updating
| control plane models, issuing new commands to other services,
| etc.
|
| Each service keeps its own state internally and if I need to,
| I can blow away their state and replay all of the business
| domain events thanks to the durable message stream. That part
| is key... and is also the most duct-tape-and-toil area of
| this architecture.
|
| I've been toying around with ideas to generalize this into a
| consolidated application framework but it's still pretty
| experimental stuff.
|
| The high-level architecture isn't terribly novel or new but
| having standardized tools for common operations, managing
| migrations from event schemas, managing checkpoints, etc; is
| still a work in progress.
___________________________________________________________________
(page generated 2022-06-22 23:01 UTC)