[HN Gopher] Shipping Multi-Tenant SaaS Using Postgres Row-Level ...
___________________________________________________________________
Shipping Multi-Tenant SaaS Using Postgres Row-Level Security
Author : capiki
Score : 124 points
Date : 2022-07-26 18:16 UTC (4 hours ago)
(HTM) web link (www.thenile.dev)
(TXT) w3m dump (www.thenile.dev)
| kache_ wrote:
| Context aware data access is really cool. And hard :)
| mkurz wrote:
| Be aware when using RLS with views: By default the RLS policy
| will be executed with the permissions of the owner of the view
| instead with the permissions of the user executing the current
| query. This way it can easily happen that the RLS policy will be
| bypassed because the owner of the view is a admin account or the
| same account that owns the underlying table (see the the gotchas
| section of the original post).
|
| However, upcoming PostgreSQL 15 adds support for security invoker
| views:
| https://github.com/postgres/postgres/commit/7faa5fc84bf46ea6...
| That means you can then define the security_invoker attribute
| when creating a view and this "... causes the underlying base
| relations to be checked against the privileges of the user of the
| view rather than the view owner" (see
| https://www.postgresql.org/docs/15/sql-createview.html) PG15 beta
| 1 release notes:
| https://www.postgresql.org/about/news/postgresql-15-beta-1-r...
| simonw wrote:
| I don't fully understand the performance implications here.
|
| Say I was using this for a blog engine, and I wanted to run this
| SQL query: select * from entries;
|
| But I actually only want to get back entries that my current user
| is allowed to view - where author_id = 57 for example.
|
| Would PostgreSQL automatically turn the above query into the
| equivalent of this: select * from entries where
| author_id = 57;
|
| And hence run quickly (assuming there's an index on that
| author_id column)?
|
| Or would it need to run an additional SQL query check for every
| single row returned by my query to check row permissions, adding
| up to a lot of extra overhead?
| ossopite wrote:
| yes, postgres will add such a condition to the query and in
| simple cases like this is able to use a corresponding index
|
| unfortunately this can break down in more complex cases.
| roughly postgres trusts a limited set of functions and
| operators not to leak information about rows (e.g. via error
| messages) that the RLS policy says a query should not be able
| to see. that set includes basic comparisons but not more
| esoteric operations like JSON lookups. at some point postgres
| will insist on checking the RLS policy result for a row before
| doing any further work, which can preclude the use of indexes
| a-dub wrote:
| this is cool. next up, row level encryption with private
| information retrieval methods for enabling queries and searches
| homomorphically (on data encrypted by the client that the service
| provider never has a key for).
| spacemanmatt wrote:
| If I were leaning into RLS today I would do it through PostgREST
| pikdum wrote:
| PostGraphile is also a really neat tool, using GraphQL instead
| of REST.
| ei8ths wrote:
| I needed this two years ago, i was looking at this but couldn't
| figure out how to do it with a existing db connection pool to
| reuse connections. I might be migrating to this soon so that
| things will be more isolated from the tenants.
| mgl wrote:
| Row-level security is always a tricky and hard to enforce
| assumption as this not how we relational databases really.
|
| Much bigger fan of the approach described here:
|
| Scalability, Allocation, and Processing of Data for Multitenancy
|
| https://stratoflow.com/data-scalability-allocation-processin...
| nbevans wrote:
| Using RLS to implement multi-tenancy is a terrible idea. Just
| deploy a database per tenant. It's not hard. Why overcomplicate
| it?
| bicijay wrote:
| Deploying a database per tenant is not that easy. You have a
| lot of new overhead, migrations become a pain in the ass
| (already are) and a lot of other little problems...
|
| I would say a database per tenant is overcomplicating it.
| Gigachad wrote:
| You end up building just a bit of automation around it and
| its fine. The migration isn't any harder, you just run an
| ansible job or something to roll it out to all databases.
| mbreese wrote:
| A database per tenant makes the rest of the workflow
| significantly easier though. No need to add clauses to SQL
| WHERE statements for users/groups. Queries are faster (less
| data). And data can be moved much easier between servers.
|
| Yes, it does add extra overhead at account creation, during
| DB migrations, and for backups.
|
| But if you don't need cross-account or public data access, it
| can make life much easier.
| zie wrote:
| > No need to add clauses to SQL WHERE statements for
| users/groups.
|
| This is basically what RLS does for you. You specify the
| access and you specify the current user(via a connection,
| SET ROLE, etc). Then it does all that complicated query
| filtering stuff for you, to ensure you don't screw it up.
|
| > Queries are faster (less data). And data can be moved
| much easier between servers.
|
| Not really, the overhead is just different(and likely more
| of it) in your solution. It's not wrong nor is using RLS
| right.
| pg_bot wrote:
| Deploying multiple databases is typically costly in the
| infrastructure as a service space. Plus you have more
| operational overhead in ensuring backups work and keeping
| things secure. It's much easier to use Postgres' schemas to
| segment the data within one single database. Frankly schemas
| are much easier to reason about, maintain, scale, and keep
| compliant than row level security.
| [deleted]
| andrewstuart wrote:
| I once implemented RLS/Postgres for Django.
|
| It worked pretty well.
|
| The basic mechanism was to intercept all outbound SQL queries and
| wrap them in postgres environment variables that set up the RLS.
| jzelinskie wrote:
| As the developer of an external authorization system (full
| disclosure)[0], I feel obligated to chime in the critiques of
| external authorization systems in this article. I don't think
| they're far off base, as we do recommend RLS for use cases like
| what the article covers, but anyways, here's my two cents:
|
| 1+2: Cost + Unnecessary complexity: this argument can be used
| against anything that doesn't fit the given use case. There's no
| silver bullet for any choice of solution. You should only adopt
| the solution that makes the most sense for you and vendors should
| be candid about when they wouldn't recommend adopting their
| solution -- it'd be bad for both the users and reputation of the
| solution.
|
| 3: External dependencies: That depends on the toolchain.
| Integration testing against SpiceDB is easier than Postgres, IMO
| [1]. SpiceDB integration tests can run fully parallelized and can
| also model check your schema so that you're certain there are no
| flaws in your design. In practice, I haven't seen folks write
| tests to assert that their assumptions about RLS are maintained
| over time. The last place you want invariants to drift is
| authorization code.
|
| 4: Multi-tenancy is core to our product: I'm not sure I'm steel-
| manning this point, but I'll do my best. Most companies do not
| employ authorization experts and solutions worth their salt
| should support modeling multi-tenant use cases in a safe way.
| SpiceDB has a schema language with idioms and recommendations to
| implement functionality like multi-tenancy, but still leaves it
| in the hands of developers to construct the abstraction that
| matches their domain[2].
|
| [0]: https://github.com/authzed/spicedb
|
| [1]: https://github.com/authzed/examples/tree/main/integration-
| te...
|
| [2]: https://docs.authzed.com/guides/schema
| gwen-shapira wrote:
| The blog explicitly said that if the requirements involve
| actual authorization models (beyond simple tenancy) then RLS is
| not the best fit (see: https://thenile.dev/blog/multi-tenant-
| rls#if-you-have-sophis...).
|
| I think this covers both the complexity aspect and the
| difference between what you get from RLS and what external
| authz brings to the table (schema, for example).
|
| I do think that RLS is a great way for a company without authz
| experts to built a multi-tenant MVP safely. I've yet to see a
| single pre-PMF company that worries about authorization beyond
| that, this is a series-B concern in my experience.
| semitones wrote:
| Btw that's a localhost link
| gwen-shapira wrote:
| oops :) Too many tabs. Fixed and thank you.
| paxys wrote:
| Is having to write "SELECT [...] WHERE user_id=<123>" really
| considered a security hole? Isn't that how like every service in
| existence operates? Coming up with complicated auth systems and
| patterns just because you are scared you will accidentally skip
| that WHERE clause seems bizarre to me.
| bvirb wrote:
| It's pretty nice using RLS that the entire query will follow
| the rules applied in the database. So for complex queries with
| say joins and/or subqueries they will all automatically follow
| the RLS policies as well. In our case we also have some global
| lookup tables that don't have RLS policies which can also be
| joined.
|
| We've found it pretty nice to cut out a whole class of possible
| bugs by being able to defer it to the database level. At the
| application level we end up with a wrapper that sets (and
| guarantees unsetting) multi-tenant access to the correct
| tenant, and then we never have to add "tenant_id = ..."
| anywhere, regardless of the query. Regardless of whether we
| forget in some query (which we almost surely would), it cuts
| out quite a bit of extra code.
|
| You can also do some cool stuff like add RLS policies for read-
| only multi-tenant access. Then you can query data across
| multiple tenants while enforcing that nothing accidentally gets
| written.
| nordsieck wrote:
| > Is having to write "SELECT [...] WHERE user_id=<123>" really
| considered a security hole? Isn't that how like every service
| in existence operates? Coming up with complicated auth systems
| and patterns just because you are scared you will accidentally
| skip that WHERE clause seems bizarre to me.
|
| Is having to avoid use after free really considered a security
| hole? Isn't that how like every program in existence operates?
| Coming up with complicated languages and frameworks just
| because you're scared you will accidentally use a variable
| after it's been freed seems bizarre to me.
|
| As it turns out, humans are bad at being consistent, whereas
| computers are much better. Maybe this particularly solution
| isn't "the right thing", but it's at least an attempt at
| modifying the environment such that mistakes no longer happen.
| And at a meta level, that is precisely the right thing to do.
| thrownaway561 wrote:
| I think this is mainly an issue when you're using RAW SQL
| statements. If you're using an ORM, there are many ways to add
| a where clause to the statements automatically without having
| to update your code every where.
| galaxyLogic wrote:
| When you say 'user_id' do you mean each end-user of the system
| or each customer?
|
| I assume you have a few customers and then very many users
| belonging to each customer.
| mbreese wrote:
| From my perspective, it isn't the security aspects that are
| limiting, but the usability.
|
| If you want to have any access controls that isn't a simple
| user_id==123, SQL WHERE clauses can get complicated.
|
| Users, groups, or any kind of fine grained access control can
| make simple queries non-trivial. It's even worse if a user can
| be authorized to view data across different accounts.
| w-j-w wrote:
| lemax wrote:
| In my experience we've looked toward this kind of solution in a
| large legacy single-tenant application that wants to go multi
| tenant with more safety guarantees.
| fswd wrote:
| I use this for a startup in a re-write of their solution. It
| simplifies my queries and mutations, and security concerns. It
| also drammatically reduces the complexity of my code. There's
| also ROLES (Guest/Public user, Authenticated, Admin) and
| combinding the roles with Row Level Security.
|
| I like it so much I don't want to go back!
| uhoh-itsmaciek wrote:
| >Another issue we caught during testing was that some requests
| were being authorized with a previous request's user id.
|
| This is the terrifying part about RLS to me: having to rely on
| managing the user id as part of the database connection session
| seems like an easy way to shoot yourself in the foot (especially
| when combined with connection pooling). Adding WHERE clauses
| everywhere isn't great, but at least it's explicit.
|
| That said, I've never used RLS, and I am pretty curious: it does
| seem like a great solution other than that one gotcha.
| shaicoleman wrote:
| We're currently using the schema-per-tenant, and it's working
| very well for us:
|
| * No extra operational overhead, it's just one database
|
| * Allows to delete a single schema, useful for GDPR compliance
|
| * Allows to easily backup/restore a single schema
|
| * Easier to view and reason about the data from an admin point of
| view
|
| * An issue in a single tenant doesn't affect other tenants
|
| * Downtime for maintenance is shorter (e.g. database migration,
| non-concurrent REINDEX, VACUUM FULL, etc.)
|
| * Less chance of deadlocks, locking for updates, etc.
|
| * Allows easier testing and development by subsetting tenants
| data
|
| * Smaller indexes, more efficient joins, faster table scans, more
| optimal query plans, etc. With row level security, every index
| needs to be a compound index
|
| * Easy path to sharding per tenant if needed. Just move some
| schemas to a different DB
|
| * Allows to have shared data and per-tenant data on the same
| database. That doesn't work with the tenant-per-database approach
|
| There are a few cons, but they are pretty minor compared to the
| alternative approaches:
|
| * A bit more code to deal in the tenancy, migrations, etc. We
| opted to write our own code rather than use an existing solution
|
| * A bit more hassle when dealing with PostgreSQL extensions .
| It's best to install extensions into a separate extensions schema
|
| * Possible caching bugs so you need to namespace the cache, and
| clear the query cache when switching tenant
|
| * The security guarantees of per tenant solution aren't perfect,
| so you need to ensure you have no SQL injection vulnerabilities
| bvirb wrote:
| We ran a multi-tenant SaaS product for years w/ a schema-per-
| tenant approach. For the most part it all worked pretty great.
|
| We ran into issues here and there but always found a way to
| work around them:
|
| * Incremental backups were a pain because of needing to lock so
| many objects (# of schemas X # of tables per schema).
|
| * The extra code to deal w/ migrations was kinda messy (as you
| mentioned).
|
| * Globally unique IDs become the combination of the row ID +
| the tenant ID, etc...
|
| For us though the real deal-breaker turned out to be that we
| wanted to have real foreign keys pointing to individual rows in
| tenant schemas from outside of the tenant schema and we
| couldn't. No way to fix that one since with multi-schema the
| "tenant" relies on DB metadata (the schema name).
|
| We ended up migrating the whole app to RLS (which itself was a
| pretty interesting journey). We were afraid of performance
| issues since the multi-schema approach kinda gives you
| partitioning for free, but with the index usage on the RLS
| constraints we've had great performance (at least for our use
| case!).
|
| After quite a bit of time working with both multi-schema & RLS
| I probably wouldn't go back to multi-schema unless I had a real
| compelling reason to do so due to the added complexity. I
| really liked the multi-schema approach, and I think most of the
| critiques of it I found were relatively easy to work around,
| but RLS has been a lot simpler for us.
| uhoh-itsmaciek wrote:
| There are some other cons:
|
| Memory usage and I/O can be less efficient. Postgres handles
| table data in 8kb pages, so even if you're just reading a
| single row, that reads 8kb from disk and puts 8kb in the
| Postgres buffer cache, with that row and whatever happens to be
| next to it in the physical layout of the underlying table.
| Postgres does this because of locality of reference: it's
| cheaper to bulk-load data from disk, and, statistically
| speaking, you may need the adjacent data soon. If each user is
| touching separate tables, you're loading a page per row for
| each user, and you're missing out on some of the locality
| benefits.
|
| Another problem is monitoring (disclosure: I work for
| pganalyze, which offers a Postgres monitoring service). The
| pg_stat_statements extension can track execution stats of all
| normalized queries in your database, and that's a very useful
| tool to find and address performance problems. But whereas
| queries like "SELECT * FROM posts WHERE user_id = 123" and
| "SELECT * FROM posts WHERE user_id = 345" normalize to the same
| thing, schema-qualified queries like "SELECT * FROM
| user_123.posts" and "SELECT * FROM user_345.posts" normalize to
| _different_ things, so you cannot easily consider their
| performance in aggregate (not to mention bloating
| pg_stat_statements by tracking so many distinct query stats).
| This is the case even when you 're using search_path so that
| your schema is not explicitly in your query text.
|
| Also, performance of tools like pg_dump is not great with a ton
| of database objects (tables and schemas) and, e.g., you can run
| into max_locks_per_transaction [1] limits, and changing that
| requires a server restart.
|
| I wouldn't say you should never do schema-based multi-tenancy
| (you point out some good advantages above), but I'd be
| extremely skeptical of using it in situations where you expect
| to have a lot of users.
|
| [1]: https://www.postgresql.org/docs/current/runtime-config-
| locks...
| bearjaws wrote:
| This is such a killer feature in PG, my new job uses it and it
| makes audits of our tenancy model dead simple.
|
| Coming from a SaaS company that used MySQL, we would get asked by
| some customers how we guarantee we segmented their data, and it
| always ended at the app layer. One customer (A fortune 10
| company) asked if we could switch to SQL Server to get this
| feature...
|
| Our largest customers ask how we do database multi-tenant and we
| point to our SDLC + PG docs and they go 'K'.
| eastbound wrote:
| I honestly don't understand how Oracle is still alive. Postgres
| has so many of these killer features.
|
| Also, I wonder how others do tenant separation, what other
| solutions there are.
| Scarbutt wrote:
| oracle has flashback
| throwaway787544 wrote:
| They're like a tick. Very good at burrowing in and hard to
| remove. They have a lot of clients for whom a dozen million
| dollars is a drop in the bucket, and moving away is a decade-
| long millions-of-dollars project.
| mritchie712 wrote:
| Legacy.
|
| If you have thousands of lines of code relying on Oracle the
| cost to migrate would be enormous.
| toomuchtodo wrote:
| Amazon has a great post on this topic.
|
| https://aws.amazon.com/blogs/aws/migration-complete-
| amazons-...
|
| I thought it was cool they retrained their Oracle DBAs into
| other roles as part of the project.
| spacemanmatt wrote:
| I work with a few former-Oracle DBAs in a PostgreSQL-
| flavored consultancy now and they are aces. All the root-
| cause analysis and organization skills transfer handily.
| paulmd wrote:
| Postgres is functionally and conceptually extremely
| similar to Oracle. There are a few oddities (in
| particular, oracle's "nulls are never in indexes" is
| kinda weird) but the redo log is similar to the WAL, etc.
| In most cases, similar approaches will perform similarly
| and experience pretty much transfers over with a few
| months of experience.
| gav wrote:
| Ignoring the cost, there's the risk/reward alignment you
| see in large enterprises.
|
| Imagine you're a new CIO. You know you're probably looking
| at a 3-5 year tenure at this new company and you want to
| lead with some big wins to set the tone and show your
| value.
|
| You're reviewing proposals from your senior leadership. One
| of the options is an Oracle migration. It could cost a
| million dollars to migrate, but you'd save a million
| dollars a year going forward. Oracle runs your mission-
| critical internal systems, any issues with the migration
| and the system you migrate to is going to cause significant
| financial and reputation damage. You'll have to defend this
| decision if anything goes wrong, i.e. you've absorbed a lot
| of risk but a lot less upside to you personally.
|
| What do you do? You put the proposal to the side and look
| for something that has a lot better upside.
| beckingz wrote:
| Exactly. The risk/cost profile for migrations is bad: If
| it goes well, decent return. If it goes poorly,
| catastrophic.
| abraae wrote:
| Oracle has had the ability to do this for decades, so
| whatever is keeping them alive, it's nothing to do with this
| particular nifty Postgres feature.
| revskill wrote:
| Sharding is the only scalable way per my experience. The
| point about scalability here is, i can control the load as
| the data gets bigger.
| e1g wrote:
| Every B2B client who asked us how we handle multi-tenancy also
| asked how we ensure their data is erased at the end of the
| contract. Using a shared database with RLS means you have to go
| through all DB backups, delete individual rows for that tenant,
| then re-generate the backup. That's a non-starter, so we opted
| for having one DB per tenant which also makes sharding,
| scaling, balancing, and handling data-residency challenges
| easier.
| axlee wrote:
| It makes BI work an absolute hellscape as well. Tradeoffs.
| brightball wrote:
| I filled out a ton of enterprise questionnaires on this stuff
| before and we just told people that it would be deleted when
| the backups expired after X days because we didn't have the
| capability to delete specific rows from our backups. Nobody
| ever argued.
|
| There's not a single customer I've ever run across who's
| going to halt a contract because you can't purge their data
| from your backups fast enough. They're signing up because of
| what you offer, not the termination clause.
| coenhyde wrote:
| i did the same with the same results
| mikeodds wrote:
| thirded
| wahnfrieden wrote:
| same. but! it's a liability nonetheless, go talk with
| legal etc
| e1g wrote:
| A recent example from the other side: a client contacts me
| and says they will have to exit from our _existing_
| contract unless we can update our (AWS) infrastructure to
| use their (AWS) encryption keys for servers and databases
| handling their tenancy. In Enterprise, some tenants are
| very opinionated about what cloud you use and how their
| data lives /flows within it. I run all our infosec,
| including SOC2 & ISO27001 programs, and I know that using
| their encryption keys is nothing but security theater. But
| with $500k p.a. on the line, I also know when it's
| showtime.
| voberoi wrote:
| This is the way -- also never had an issue across US
| healthcare and enterprise SaaS.
| gervwyk wrote:
| Also second this, we even split our AWS org into an AWS
| account per tentant. Although, this will maybe be a problem
| if we have +100s of clients. But it makes onboarding and off-
| loading simple.
| throw03172019 wrote:
| Each client is running on their own instances / load
| balancers?
| HatchedLake721 wrote:
| It depends on an annual contract value (ACV), doesn't it?
| You can't give an AWS account to every $99 p/m plan, but
| you can for enterprise $50-100k+ deals.
| gervwyk wrote:
| Yeah, that sums it up. I guess it means it can't be
| labelled as "multi-tenant" then..
| SahAssar wrote:
| I think the way to handle this (based on how many companies
| handle GDPR compliance) is to not keep backups older than X
| months (usually 3 months) and have a clause that all data
| past that time is deleted.
| bearjaws wrote:
| We usually write a "reasonable best effort" clause into our
| deletion, that it will 100% be deleted from production within
| 30 days and automatically fall out of backups 60 days from
| there. This also helps since we can't control our downstream
| vendors such as Twilio, AWS SES, etc, who all have their own
| legal obligations and time frames.
|
| Even for large health systems they have been okay with it.
| jerryjerryjerry wrote:
| I think TTL feature provided by some DB vendors are
| actually orthogonal to multi tendency, where the former
| deals with life cycles policy of data but the original
| problem to delete the data of certain user is more related
| to privacy policy of data, though overlap may exists.
| hobs wrote:
| Agree - performance on row level security (at least on SQL
| Server) is terrible, sharing by database is fairly easy.
| jquery_dev wrote:
| How do you manage your backend in this case? Do you have an
| insurance of backend for each customer or do you allow
| backend to make connections to all the DBs.
|
| I'm interested in doing similar and wondering about the best
| way to handle the routing between the databases from a single
| backend.
| mason55 wrote:
| It really depends on your requirements, both functional and
| cost. Having a full stack per customer can be great for a
| lot of reasons. It's probably the safest because you never
| have to worry about something getting messed up in the code
| and creating cross-customer access. Once you're sure the
| environment is set up correctly you can sleep well at
| night. You also don't have to worry about one customer
| doing something to impact the performance of other
| environments (unless you're using shared infra, like
| putting all your DBs on a single cluster). And it can make
| maintenance easier, for example you can data migrations can
| start with small and/or less important customers for
| practice. It also can give you more flexibility if you need
| to make special snowflakes (i.e. some big customer wants
| weird IP whitelisting rules that can't work with your
| normal env setup).
|
| Downsides are that it's probably more expensive and more
| work. Even if your infra spin up is totally automated, you
| still need to keep track of all the environments, you still
| need to keep your Infrastructure-as-Code (e.g. your
| terraform scripts) up to date, more can go wrong when you
| make changes, there's more chance for environments to
| drift.
|
| So, in short, separate stacks usually means more safety &
| simpler application architecture in exchange for more cost
| and more effort to manage the fleet.
| sgarman wrote:
| Am I right in my understanding that EVERY request that comes in
| to their api creates a new connection to the database? What about
| reusing connections with connection pools or one level up using
| pgbouncer or thing. Can you actually use RLS while reusing
| connections?
| cstejerean wrote:
| You can reuse the connection with a connection pool and use SET
| ROLE when you check it out.
| rst wrote:
| Nope. Quoting the article itself:
|
| "In the traditional use case of direct db access, RLS works by
| defining policies on tables that filter rows based on the
| current db user. For a SaaS application, however, defining a
| new db user for each app user is clunky. For an application use
| case you can dynamically set and retrieve users using Postgres'
| current_settings() function ( i.e: SET
| app.current_app_user = 'usr_123'
|
| and SELECT
| current_settings('app.current_app_user)
|
| )."
|
| The policies that they define reference these settings, so they
| can do a "set" at the start of processing every web request, on
| a pre-existing db connection.
| lemax wrote:
| It's possible to implement this without creating new
| connections to the database for each request by using SET LOCAL
| and wrapping every query in a transaction. Instead of applying
| RLS based on the current user, you apply RLS based on the
| parameter value you set at the beginning of the transaction.
| You can set this parameter value based on the user session in
| your application.
|
| Your RLS policy looks as follows: CREATE POLICY
| tenant_${tableName}_isolation_policy ON "${tableName}" USING
| ("tenant_id" = current_setting('app.current_tenant');
|
| Your queries look something like this: BEGIN TRANSACTION SET
| LOCAL app.current_tenant = '${tenant}'; SELECT * from
| some_table END TRANSACTION;
|
| You can even initialize your writes with a `tenant_id` column
| defaulted to your `current_setting('app.current_tenant')`
___________________________________________________________________
(page generated 2022-07-26 23:00 UTC)