[HN Gopher] How we give every user SQL access to a shared ClickH...
___________________________________________________________________
How we give every user SQL access to a shared ClickHouse cluster
Author : eallam
Score : 60 points
Date : 2026-03-17 15:50 UTC (4 days ago)
(HTM) web link (trigger.dev)
(TXT) w3m dump (trigger.dev)
| senorrib wrote:
| Reasons 1-3 could very well be done with ClickHouse policies
| (RLS) and good data warehouse design. In fact, that's more secure
| than a compiler adding a where to a query ran by an all mighty
| user.
|
| Reason 4 is probably an improvement, but could probably be done
| with CH functions.
|
| The problem with custom DSLs like this is that tradeoff a massive
| ecosystem for very little benefit.
| efromvt wrote:
| As long as you don't deviate too much from ANSI, I think the
| 'light sql DSL' approach has a lot of pros when you control the
| UX. (so UIs, in particular, are fantastic for this approach -
| what they seem to be targeting with queryies and dashboards).
| It's more of a product experience; tables are a terrible
| product surface to manage.
|
| Agreed with the ecosystem cons getting much heavier as you move
| outside the product surface area.
| skeeter2020 wrote:
| Personally I think that's worse. SQL - which is almost
| ubiqutous - already suffers from a fragmentation problem
| because of the complex and dated standardization setup. When
| I learn a new DBMS the two questions I ask at the very start
| are: 1. what common but non-standard features are supported?
| 2. what new anchor-features (often cool but also often
| intended to lock me to the vendor) am I going to pick up?
|
| First I need to learn a new (even easy & familiar) language,
| second I need to be aware of what's proprietary & locks me to
| the vendor platform. I'd suspect they see the second as a
| benefit they get IF they can convince people to accept the
| first.
| mattaitken wrote:
| You're right RLS can go a long way here. With complex RBAC
| rules it can get tricky though.
|
| The main advantages of a DSL are you can expose a nicer
| interface to users (table names, columns, virtual columns,
| automatic joins, query optimization).
|
| We very intentionally kept the syntax as close to regular
| ClickHouse as possible but added some functions.
| theowaway213456 wrote:
| > table names, columns, virtual columns
|
| This sounds solvable with clickhouse views?
|
| > automatic joins
|
| Is this also not solvable with views? Also, clickhouse
| heavily discourages joins so I wonder how often this winds up
| being beneficial? For us, we only ever join against tenant
| metadata (i.e. resolving ID to name)
|
| > query optimization
|
| This sounds potentially interesting - clickhouse's query
| optimizer is not great IME, but it's definitely getting
| better
| hrmtst93837 wrote:
| A DSL for access control hides the risky part, nobody audits it
| until someone pokes a hole and prod becomes the test env.
| zie wrote:
| We do the same thing, every employee can access our main
| financial/back office SQL database, but we just use PostgreSQL
| with row level security[0]. We never bothered to complicate it
| like the post does.
|
| 0: https://www.postgresql.org/docs/18/ddl-rowsecurity.html
| orf wrote:
| Back office, employee access is a completely different problem
| to what is described in the post.
|
| How do you enforce tenant isolation with that method, or
| prevent unbounded table reads?
| tossandthrow wrote:
| They likely don't need tenant isolation and unbound table
| reads can be mitigated using timeouts.
|
| We do something similar for our backoffice - just with the
| difference that it is Claude that has full freedom to write
| queries.
| weird-eye-issue wrote:
| RLS...
| staticassertion wrote:
| I'd be so uncomfortable with this. It sounds like you're
| placing the full burden of access on a single boundary. I mean,
| maybe there's more to it that you haven't spoken about here,
| but "everything rests on this one postgres feature" is an
| unacceptably unsafe state to me.
| weird-eye-issue wrote:
| It's not like RLS is just some random feature they are
| misusing. It's specifically for security and is absolutely
| reliable. Maybe you should do a bit more research before
| making comments like this.
| staticassertion wrote:
| Of course it's designed for security... that has nothing to
| do with my statement. No single boundary is "absolutely
| reliable", that's my entire point.
| weird-eye-issue wrote:
| This is the real world not everybody can get a perfectly
| isolated database instance. Also you do realize that is
| not necessarily any more secure than RLS right? Something
| still has to control what database the code connects to.
| That could have a flaw just as much as you could have a
| flaw when setting up RLS.
| staticassertion wrote:
| I don't think you have any idea what you're talking
| about, certainly you don't know what I'm talking about.
| weird-eye-issue wrote:
| RLS is one of the simplest things to set up properly, if
| you can't figure it out I don't think I'm the one who
| doesn't know what they're talking about.
| skeeter2020 wrote:
| You can (and in some cases should) combine this with
| other boundaries, like access control or specific query
| logic. RLS moves the core checks closer to the data being
| controlled (i.e. the database) specifically to prevent
| the errors like forgetting to add the "where user_id =
| xxx" clause. It is super-valuable in compliance scenarios
| where someone like your DB Admin has permission to
| control access but not see any data, and consumers (both
| devs and end users) can write queries to see different
| levels of access but are not allowed to control data.
|
| Obviously it's not a silver bullet and the isolation can
| be confusing when debugging, but generally a single point
| for your applying RBAC is a feature not a shortcoming.
| The next level of security might be how you define your
| roles.
|
| I actually believe the simplest, most secure client
| scenario is physical isolation, where you give the
| user/consumer only the data they are allowed to use and
| then don't try to control it (someone mentioned this
| above, using parquet & duckdb). There's downsides here
| too: doesn't work for write scenarios, can be resource
| intensive or time delayed, doesn't handle chain of
| custody well, etc. You typically have two strategies:
|
| 1. pick the best approach for the specific situation.
|
| 2. pick your one tool as your hammer and be a d!ck about
| it.
| staticassertion wrote:
| Just to be clear, I am extremely pro-RLS.
| zie wrote:
| Correct, but you have yet to mention any other solution
| where there is more than one boundary at the data access
| layer. Because there really isn't any.
|
| You can code it yourself in your bespoke app, have your
| vendor maintain it with their bespoke access control, or
| let RLS do it. There aren't really any other options that
| I'm aware of.
|
| Personally, having done the "code it yourself in your
| bespoke app" it's a PITA and it's generally not nearly as
| good as RLS. That's what we did before RLS and it sucked.
|
| On top of that, you can do things like SSO, data
| encryption, etc, but those are not data access layers,
| those are different layers. We do these things too(tho
| very little of the data encryption part, since it's such
| a PITA to make work very reliably even with vault/boa
| holding the encryption keys for us).
| Philip-J-Fry wrote:
| Conceptually that's no different to any security measures
| that prevent you from accessing data you're not supposed to?
| At the end of the day with all data that is colocated you're
| trusting that some permission feature somewhere is preventing
| you from accessing data you're not supposed to.
|
| We trust that Amazon or Google or Microsoft are successful in
| protecting customer data for example. We trust that when you
| log into your bank account the money you see is yours, and
| when you deposit it we trust that the money goes into your
| account. But it's all just mostly logical separation.
| staticassertion wrote:
| > At the end of the day with all data that is colocated
| you're trusting that some permission feature somewhere is
| preventing you from accessing data you're not supposed to.
|
| Right but ideally more than one.
|
| > But it's all just mostly logical separation.
|
| Yes, ideally multiple layers of this. You don't all share
| one RDS instance and then get row level security.
| Philip-J-Fry wrote:
| Can you give an example of more than one layer of logical
| separation at the data layer?
|
| We all know that authentication should have multiple
| factors. But that's a different problem. Fundamentally at
| the point you're reading or writing data you're asking
| the question "does X has permission to read/write Y".
|
| I don't see what you're getting at.
| staticassertion wrote:
| I don't know their use case enough to understand what
| would or would not be an appropriate mitigation. For
| example, with regards to financial data, you could have
| client side encryption on values where those keys are
| brokered separately. I can't exactly design their system
| for them, but they're describing a system in which every
| employee has direct database access and the database
| holds financial information.
| Philip-J-Fry wrote:
| Right, encryption would protect the data. But still, at
| the end of the day you're trusting the permission model
| of the database. Encryption won't prevent you updating a
| row or deleting a row if the database permission model
| failed.
| staticassertion wrote:
| Well, I think we basically agree? My suggestion is merely
| that a database holding financial data should have more
| than a single layer of security. Granting direct access
| to a database is a pretty scary thing. A simple example
| would be that any vulnerability in the database is
| directly accessible, even just by placing a broker in
| between users and the database I'd likely start to feel a
| lot better, and now I'd have a primitive for layering on
| additional security measures.
|
| Encryption is an extremely powerful measure for this use
| case. If the data does not need to be indexed, you could
| literally take over the database process entirely and
| still not have access, it definitely doesn't rely on the
| permission model of the db because the keys would be
| brokered elsewhere.
| zie wrote:
| > My suggestion is merely that a database holding
| financial data should have more than a single layer of
| security.
|
| We require SSO(Azure via vault) to authenticate to the
| DB. We also don't expose PostgreSQL to the public
| internet. We aren't complete monsters :)
|
| > Granting direct access to a database is a pretty scary
| thing.
|
| For you maybe, because you were taught it's _scary_ or it
| just seems different? I dunno. I 'm very surprised with
| all the pushback about it being a single layer. Every
| other data access architecture will be a single layer
| too, it just can be made to look like it isn't. Or people
| think their bespoke access control system will be better
| because they have more control. Our experience taught us
| that's just bad thinking.
|
| We've been doing direct access to PostgreSQL since 1993
| without many issues. Though RLS is "recent" in terms of
| deployment(it came about in PG 10 I think). Before that
| we had a bespoke solution(written with lots of views and
| some C/pgsql code, it was slow and kind of sucked). RLS
| was a little buggy when it first was released, but within
| a year or so it was reliable and we moved everything over
| as quick as we could and haven't looked back.
|
| > Encryption is an extremely powerful measure for this
| use case.
|
| We do this with some data in some tables, but it's a PITA
| to do it right, so it's use is quite limited. We use
| Hashicorp Vault(now openbao) to hold the
| encryption/decryption keys.
| skeeter2020 wrote:
| row level security is not a feature specific to Postgres, but
| more a pretty standard and acceptable way to control access
| in a multitenant or multicontext environment that pretty much
| every data provider supports/implements. When it comes to
| answering a single specific question (like the one RLS
| targets) I believe you DO want a single, simple answer, vs.
| something like "it uses these n independent things working in
| conjunction..."
| staticassertion wrote:
| Right, RLS is great. What they are saying is this:
|
| > every employee can access our main financial/back office
| SQL database
|
| This means that there is no access gate other than RLS,
| which includes financial data. That is a _lot_ of pressure
| on _one_ control.
| zie wrote:
| Your SSO system is a lot of pressure on one control too.
| Nobody seems to have problems with Azure or Okta or
| whatever SSO system you use having every key to the
| kingdom.
|
| RLS has been around a long time and is very stable and
| doesn't change much. SSO providers keep adding stuff
| _ALL_ the time, and they regularly have issues. PG RLS is
| very boring in comparison.
|
| I don't remember the last CVE or outage we had with PG
| that broke stuff. I can't remember a single instance of
| RLS causing us access control problems on a wide scale.
| Since we tied their job(s) to their access control many
| years ago, it's very rare that we even have the random
| fat-fingered access control issue for a single user
| anymore either. I think the last one was a year ago?
| staticassertion wrote:
| > Your SSO system is a lot of pressure on one control
| too. Nobody seems to have problems with Azure or Okta or
| whatever SSO system you use having every key to the
| kingdom.
|
| Some do, which is why they want MFA on the target side as
| well as on their SSO. But yes, SSO is very scary and
| there's a ton of security pressure on it. I don't think
| that's a very good argument for why we should think that
| every system should only require one layer of defense.
|
| I'm going to sort of skip over any comparison to SSO
| since I'm not going to defend the position of "SSO is
| fine as a single barrier", especially as SSO is rarely
| implemented with one policy - there's device attestation,
| 2FA, etc.
|
| > RLS has been around a long time and is very stable and
| doesn't change much.
|
| RLS is great, I'm a fan.
|
| > I don't remember the last CVE or outage we had with PG
| that broke stuff.
|
| It doesn't really matter. The fact is that you're one CVE
| away from _every employee_ having access to arbitrary
| data, including financial data. I feel a bit like a
| broken record saying this.
| zie wrote:
| > It doesn't really matter. The fact is that you're one
| CVE away from every employee having access to arbitrary
| data, including financial data. I feel a bit like a
| broken record saying this.
|
| Sure, but it's the same with pretty much any other app
| architecture.
|
| Either your app has all the data access and you put your
| access control there, or you do the access control in the
| database. There really aren't other options here. There
| isn't access control defense in depth here. The best you
| can really do is do some data encryption of the data in
| the tables. We do _some_ of that, but it 's such a PITA
| that we relegate it to special stuff only.
|
| > especially as SSO is rarely implemented with one policy
| - there's device attestation, 2FA, etc.
|
| Sure but _ALL_ of that relies on the SSO system behaving
| as advertised, so you think of it as separate policies,
| but it really isn 't. It's once SSO CVE away from giving
| away the store. We use SSO with PG, that's how they
| authenticate to PG, we are fans of SSO too.
| zie wrote:
| Well, I mean it's not only RLS, but yes it's only PostgreSQL
| doing the access control as far as if they can see a
| particular table or row.
|
| Every user gets their own role in PG, so the rest of the PG
| access control system is also used.
|
| We have your normal SSO system(Azure) and if Tootie employee
| doesn't need access to Asset Control, they don't get any
| access to the asset schema for instance.
|
| What would be your method?
|
| You would have some app that your dev team runs that handles
| access control, so your app gets unrestricted access to the
| DB. Now your app is the single boundary, and it forces
| everyone to go through your app. How is that better? It also
| complicates your queries, with a ton of extra where
| conditions.
|
| A bunch of bespoke access control code you hope is reliable
| or a feature of the database that's well tested and been
| around for a long time. pgtap[0] is amazing for ensuring our
| access control (and the rest of the DB) works.
|
| If some random utility wants to access data, you either have
| to do something special access wise, or have them also go
| through your app(let's hope you have an API and it allows for
| whatever the special is). For us, that random utility gets
| SQL access just like everyone else. They get RLS applied,
| etc. They can be naive and assume they have total control,
| because when they do select * from employees; they get access
| to only the employee column and rows we want that utility to
| have.
|
| We have a bunch of tools over the decades that need access to
| various bits of our data for reason(s). Rather than make them
| all do wacky stuff with specialized API's, they just get bog
| standard PG SQL. We don't have to train vendor Tito how to
| deal with our stuff, we just hand them their auth info to PG
| and they can go to town. When people want Excel spreadsheets,
| they just launch excel, do a data query and their data just
| shows up magically. All from within Excel, using the standard
| excel data query tools, no SQL needed.
|
| 0: https://pgtap.org/
| staticassertion wrote:
| > What would be your method?
|
| I don't know because I don't know your use case. At
| minimum, direct db access means that every postgres CVE
| something I'd have to consider deeply. Even just gating
| access behind an API where the API is the one that gets the
| role or accepts some sort of token etc would make me feel
| more comfortable.
|
| > Now your app is the single boundary,
|
| No, the app would still use RLS.
|
| I'm not saying what you're doing is bad, but as described
| I'd be pretty uncomfortable with that deployment model.
| zie wrote:
| > No, the app would still use RLS.
|
| I don't think you thought this through? The problem with
| the app being constrained to RLS is you have User A and
| User B accessing your API, how do you get them access to
| the different data they need? It means the RLS is very
| wide open, since it needs to be able to see what User A
| and B can see. This forces your app to be the single
| boundary in pretty much all cases. Sure maybe you can
| give it a role where it has limited DDL rights(i.e not
| create table access or whatever).
|
| > At minimum, direct db access means that every postgres
| CVE something I'd have to consider deeply.
|
| I mean, not really, in practice? Most are just denial of
| service type bugs, not instant exploits. . Most of the
| DoS issues are not that big of a deal for us. They could
| affect us, but 99.9% of the time, they don't in reality,
| before we upgrade. RLS has been in PG for a good many
| years, it's quite stable. Sure, we upgrade PostgreSQL
| regularly, but you should do that _anyway_ , regardless
| of RLS usage or not.
| staticassertion wrote:
| > I don't think you thought this through?
|
| Well I'm not designing some arbitrary system. Don't
| expect a full spec.
|
| > The problem with the app being constrained to RLS is
| you have User A and User B accessing your API, how do you
| get them access to the different data they need?
|
| You can still have users provide the access to the
| service (ie: the password to get to the role), or
| otherwise map between User A and the role, etc. The
| service just brokers and constrains access.
|
| > Sure maybe you can give it a role where it has limited
| DDL rights(i.e not create table access or whatever).
|
| Yes, of course. Just as you would with users.
|
| > I mean, not really, in practice?
|
| I don't think it's contentious to say that if RLS is your
| only security boundary then your pressure is entirely on
| that one boundary. How could it be any other way? If you
| want to say "It's an extremely good boundary", okay.
| There have been relevant vulnerabilities though and I
| really don't know that we should say that we should
| expect 0 vulnerabilities in RLS in the future such that
| every employee having access to a db containing financial
| data is fine. The point of layering is to avoid having to
| put all pressure on this one thing.
|
| I don't even understand how this is contentious or
| confusing. If you have one boundary, you have one
| boundary. I'm suggesting that I'm uncomfortable with
| systems having one boundary.
| lyjackal wrote:
| I want to build a shared postgres db with hundreds of small
| apps (OLTP) accessing shared tables using a RLS model model
| against well defined tables.
|
| What are other limitations and mitigations folks have used or
| encountered to support stability and security? Things like
| - Query timeouts to prevent noisy neighbors - connection
| pooling (e.g. pgbouncer) also for noisy neighbors -
| client schema compatibility (e.g. some applications running
| older versions, have certain assumptions about the schema that
| may change over time)
| zie wrote:
| If you have people running crappy SQL SELECT, it can be a
| problem. statement-timeout[0] is your friend here. You still
| have to be on watch, and teach your users not to write crappy
| SQL.
|
| You can also limit it by creating read-only replica's and
| making SELECT's happen on the replica. We don't usually
| bother, since 99% of our users are employees of ours, we can
| teach them to not be stupid. Since their usage doesn't change
| much over time, we can usually just hand them a SQL query and
| say: here run this instead.
|
| Most of our employees don't even know they have SQL access,
| it's not like we force people to learn SQL to get their job
| done. Because of RLS and views, the ones that do SQL don't
| have to know much SQL, even if they do happen to use it.
| SELECT * from employees; gets them access to basically all
| the employee info they could want, but only to the employees
| they have access to. If you are a manager with 10 people,
| your select returns only your 10 people.
|
| The payroll staff runs the same query and gets all of the
| employees they handle payroll for. Since our payroll is done
| inside of PostgreSQL(thanks plpython[1]), we can do some
| crazy access control stuff that most systems would never even
| dream about. Whenever new auditors come in and see that our
| payroll staff is limited to seeing only the info they need to
| do payroll, and only for their subset of employees they
| actually pay, they are awestruck.
|
| The random vendors that can't be taught, we usually hand them
| a nightly SQLite dump instead. I.e let _them_ pay the CPU
| cost of their crappy SQL.
|
| Around client schema compatibility. This happens with other
| models too(API, etc). It's not unique to PG or SQL Databases.
| You have to plan for it. Since most all of our users interact
| with views and not with the actual underlying tables, it's
| not usually that big of a deal. In the extreme cases, where
| we can't just keep around a view for them, we have to help
| them along(sometimes kicking and screaming) into a new
| version.
|
| 0: https://www.postgresql.org/docs/current/runtime-config-
| clien...
|
| 1: https://www.postgresql.org/docs/current/plpython.html
| jelder wrote:
| We did this with MotherDuck, and without introducing a new
| language. Every tenant has their own isolated storage and
| compute, so it's trivial to grant internal users access to
| specific tenants as needed. DuckDB's SQL dialect is mostly just
| Postgres' with some nice ergonomic additions and a host of extra
| functionality.
| raw_anon_1111 wrote:
| This is explicitly not the problem they are trying to solve. In
| a single tenant database you don't have to by definition worry
| about multi tenant databases
| DangitBobby wrote:
| I guess the question then becomes, what problem does a multi-
| tenancy setup solve that an isolated database setup doesn't?
| Are they really not solving the same problem for a user
| perspective, or is it only from their own engineering
| perspective? And how do those decisions ultimately impact the
| product they can surface to users?
| raw_anon_1111 wrote:
| Off the top of my head, managing 100 different database
| instances takes a lot more work from the business
| standpoint than managing 1 database with 100 users.
|
| The article also mentioned that they isolate by project_id.
| That implies one customer (assume a business) can isolate
| permissions more granulary.
| mattaitken wrote:
| Yes it's exactly this. There's not a neat permission
| boundary when you have users, orgs, projects,
| environments. Let alone when you add RBAC too.
| steveBK123 wrote:
| With multi-tenant vs multi-database decision one driver
| would be the level of legal/compliance/risk/cost/resource
| drivers around how segregated users really are.
|
| Multi-database is more expensive generally but is a more
| brain dead guaranteed way to ensure the users are properly
| segregated, resilient across cloud/database/etc software
| releases that may regress something in a multi-tenant
| setup.
|
| Multi-tenant you always run the risk of a software update,
| misconfiguration or operational error exposing existence of
| other users / their metadata / their data / their usage /
| etc. You also have a lot more of a challenge engineering
| for resource contention.
| zie wrote:
| Agreed, in Multi-tenant, where the user/customer owns the
| data, I always reach for SQLite first. Each user/customer
| gets their own SQLite DB. Then you have a common
| PG/SQLite DB for any common metadata, billing, etc.
|
| That way when a customer leaves or they want a backup
| copy of their data, it's a rm <customer>.sqlite3 or
| .backup away.
|
| Sometimes you can't do that for various (almost always
| non-technical) reason(s), but it's always my 1st choice.
| mattaitken wrote:
| In a system with organizations, projects and advanced user
| access permissions having separate databases doesn't full
| solve the problem. You still need access control inside each
| tenanted database. It also makes cross-cutting queries
| impossible which means users can't query across all their
| orgs for example.
|
| The DSL approach has other advantages too: like rewriting
| queries to not expose underlying tables, doing automatic
| performance optimizations...
| elnatro wrote:
| New to ClickHouse here. Would you thing this kind of database has
| a niche when compared to usual RDBMS like MySQL and PostgreSQL?
| mattaitken wrote:
| ClickHouse is a high performance OLAP database. It's good for
| analytics and search.
|
| We use it (I'm the author or the article) so users can search
| every run they do and graph all sorts of metrics.
| baalimago wrote:
| The evolution of this is to use agents, and have users "chat with
| the data"
| mattaitken wrote:
| Yes, you can actually do this already because we expose a REST
| API and TypeScript SDK functions to execute the queries.
| cjonas wrote:
| We just create mini data "ponds" on the fly by copying tenant
| isolated gold tier data to parquet in s3. The users/agent queries
| are executed with duckdb. We run this process when the user start
| a session and generate an STS token scoped to their tenant bucket
| path. Its extremely simple and works well (at least with our data
| volumes).
| Waterluvian wrote:
| Is that why it's called DuckDb? Because data ponds?
| cjonas wrote:
| Idk but I named everything in the related code "duckpond" :)
| mritchie712 wrote:
| Hannes (one of the creators) had a pet duck
| QuantumNomad_ wrote:
| The DuckDB website has the following to say about the name:
|
| > Why call it DuckDB?
|
| > Ducks are amazing animals. They can fly, walk and swim.
| They can also live off pretty much everything. They are quite
| resilient to environmental challenges. A duck's song will
| bring people back from the dead and inspires database
| research. They are thus the perfect mascot for a versatile
| and resilient data management system.
|
| https://duckdb.org/faq#why-call-it-duckdb
| boundlessdreamz wrote:
| How do you copy all the relevant data? Doesn't this create
| unnecessary load on your source DB?
| cjonas wrote:
| We have various data sources (which is another benefit of
| this approach). Data from the application DB is currently
| pulled using the FE apis which handle tenant isolation and
| allow the application database to deal with the load. I think
| pg_duckdb could be a good solution here as well, but haven't
| gotten around to testing it. Other data come from analytics
| DB. Most of this is landed on an interval via pipeline
| scripts.
| mattaitken wrote:
| This is cool. I think for our use case this wouldn't work.
| We're dealing with billions of rows for some tenants.
|
| We're about to introduce alerts where users can write their own
| TRQL queries and then define alerts from them. Which requires
| evaluating them regularly so effectively the data needs to be
| continuously up to date.
| SOLAR_FIELDS wrote:
| Billions still seems crunchable for DDB. It's however much
| you can stuff into your RAM no? Billions is still consumer
| grade machine RAM depending on the data. Trillions I would
| start to worry. But you can have a super fat spot instance
| where the crunching happens and expose a light client on top
| of that then no?
|
| Quadrillions, yeah go find yourself a trino spark pipeline
| otterley wrote:
| How large are these data volumes? How long does it take to
| prepare the data when a customer request comes in?
| cjonas wrote:
| Small. We're dealing with financial accounts, holdings and
| transactions. So a user might have 10 accounts, thousands of
| holdings, 10s of thousands of transactions. Plus a handful of
| supplemental data tables. Then there is market data that is
| shared across tenants and updated on interval. This data is
| maybe 10-20M rows.
|
| Just to clarify, the data is prepared when the user (agent)
| analytics session starts. Right now it takes 5-10s, which
| means it's typically ready well before the agent has actually
| determined it needs to run any queries. I think for larger
| volumes, pg_duckdb would allow this to scale to 10s of
| millions rows pretty efficiently.
| bob1029 wrote:
| > How do you let users write arbitrary SQL against a shared
| multi-tenant analytical database without exposing other tenants'
| data or letting a rogue query take down the cluster?
|
| For query operations I would try to find a way to solve this with
| tools like S3 and SQLite. There are a few VFS implementations for
| S3 and other CDNs.
| nlittlepoole wrote:
| Open Table Formats (Iceberg, Delta Lake, Hudi, etc) are the
| approach we've taken. That let's us offer a query engine but
| also let's the tenant bring their preferred engine (Snowflake,
| Spark, DuckDB, etc). It also addresses dirty reads and some
| other state problems that come from trying to use the file
| system. It scales as much as the bucket does, so we haven't
| found a use case we couldn't scale to yet.
|
| We (https://prequel.co) recently started offering this as a
| white labeled capability so anyone can offer it without
| building it yourself. Its a newer capability to our export
| product where instead of sending the data to the tenant's data
| warehouse, we enable you to provision an S3/GCS/ABS/etc bucket
| with the data formatted. Credential management, analytics, etc
| is all batteries included so you don't have to do that either.
| The initial interest from our customers was around BI
| integrations but agent use is starting to pick up which is
| kinda interesting to see.
| r1290 wrote:
| How does it handle large tables like. 2b rows? And how does
| it stay updated?
| datatrashfire wrote:
| as a clickhouse architect, i can't help but feel they would have
| been better served by understanding their db's native
| capabilities better before jumping into implementing this.
|
| row level access control, resource quotas, scheduling policies,
| session settings, etc. all could have been used in concert to
| achieve a very similar outcome with a dozen or so ddl/dcl
| statements.
| theowaway213456 wrote:
| This is my impression as well. ClickHouse has tons of useful
| features built in that seem like they'd work well here. Though
| the documentation about those features has been very scattered
| and hard to find in my experience
| rishabhjajoriya wrote:
| Curious howw you handle a user who accidentally writes a query
| that does a full table scan across billions of rows does the
| quota kick in fast enough or have you had cases where it still
| causes noticeable impact on other tenants?
___________________________________________________________________
(page generated 2026-03-21 23:01 UTC)