[HN Gopher] How to Distribute Postgres Geographically
       ___________________________________________________________________
        
       How to Distribute Postgres Geographically
        
       Author : mebcitto
       Score  : 106 points
       Date   : 2024-06-01 04:42 UTC (1 days ago)
        
 (HTM) web link (xata.io)
 (TXT) w3m dump (xata.io)
        
       | hazaskull wrote:
       | While it is fun to see how to creatively solve such issues, it
       | does raise the question of managability. When sharding data into
       | loosely (fdw) coupled silo's it would become tricky to make
       | consistent backups, ensure locking mechanisms work when sharded
       | data might sometimes be directly related, handle zone/region
       | failures gracefully, prevent hot spots, perform multi-region
       | schema-changes reliably, etc. I suppose this pattern principally
       | only works when the data is in fact not strongly related and the
       | silo's are quite independent. I wouldn't call that a distributed
       | system at all, really. This may be a matter of opinion of course.
       | 
       | It does give a "When all you have is a hammer..." vibe to me and
       | begs the question: why not use a system that's designed for use-
       | cases like this and do it reliably and securely ? i.e.:
       | https://www.cockroachlabs.com/docs/stable/multiregion-overvi...
       | (yes, I know full data domiciling requires something even more
       | strict but I currently don't know of any system that can
       | transparently span the globe and stay performant while not
       | sharing any metadata or caching between regions)
        
         | tudorg wrote:
         | > It does give a "When all you have is a hammer..." vibe to me
         | and begs the question: why not use a system that's designed for
         | use-cases like this and do it reliably and securely ?
         | 
         | (disclaimer: blog post author)
         | 
         | A reason would be that you want to stick to pure Postgres, for
         | example because you want to use Postgres extensions, or prefer
         | the liberal Postgres license.
         | 
         | It can also be a matter of performance, distributed
         | transactions are necessarily slower so if almost all the time
         | you can avoid them by connecting to a single node, which has
         | all the data that the transaction needs, that's going to get
         | you better performance.
        
           | hazaskull wrote:
           | Hi there! Thank you for the post and your work on pgzx!
           | Though it depends on the system (cockroachdb can place
           | leaders on specific nodes to speed up local queries, it has
           | global tables and otherwise there's always follower-reads)
           | those are of course valid reasons. Admittedly if you want to
           | keep data "pinned", you're into manual placement, rather than
           | horizontal scaling but that's nitpicking and there's pros and
           | cons. I do enjoy the freedom of Postgres and am hopeful that
           | its virtually prehistoric storage-design becomes a non-issue
           | thanks to tech such as Neon and Orioledb. The option to
           | decouple storage would provide wonderful flexibility for
           | solutions like yours too I think.
        
             | tudorg wrote:
             | Thanks for noticing pgzx! We are working on a project
             | building on top of it and going into the direction hinted
             | by this blog post.
             | 
             | I agree that the separation of storage and compute
             | complements this nicely. In fact, we take advantage of it
             | in the Xata platform which uses Aurora.
        
       | jitl wrote:
       | (I work at Notion, one of the larger Notion clones)
       | 
       | We experimented with using partitioned tables and doing the fancy
       | postgres_fdw thing when we sharded our tenant-clustered tables in
       | 2020. Even with all the Postgres instances in the same region we
       | found the approach unwieldy. Routing queries in the application
       | gives you a lot more control versus needing to teach Postgres the
       | exact logic to route a query, plus do a (dangerous? one-way?
       | global?) schema migration whenever you want to change the routing
       | logic.
       | 
       | We touched on this briefly in our sharding blog post here:
       | https://www.notion.so/blog/sharding-postgres-at-notion
       | 
       | Another reason to avoid complex routing in Postgres is risk. If
       | something goes wrong at 3am in a load-bearing cross-Postgres-
       | instance query, how easily will it be to mitigate that if the
       | routing is happening in Postgres (with whatever advanced
       | clustering system Citus, pgzx, postgres_fdw) versus if the
       | routing is happening in the application? For example if there's a
       | network partition between the "global" postgres instance and the
       | us-east-2 cluster? Maybe you're Postgres wizards and know how to
       | handle this with a quick schema change or something in the psql
       | CLI but I'd bet more on my team's ability to write fault
       | tolerance in the application versus in Postgres internal logic.
        
         | tudorg wrote:
         | Thanks a lot for commenting and pointing me to the blog post. I
         | do think I've seen it before but forgot about it. I've re-read
         | it now and it's a great read!
         | 
         | From what I understand you decided to do sharding in the
         | application code, and given the current state I think that
         | makes total sense and I'd have probably done the same.
         | 
         | Part of my point with the blog post is that there is a built-in
         | horizontal sharding solution in vanilla Postgres (partitioning
         | + FDW), but it's currently badly lacking when it comes to
         | cluster management, schema changes, distributed transactions,
         | and more. If we put work into it, perhaps we tip the balance
         | and the next successful Notion clone could choose to do it at
         | the DB level.
        
         | Ozzie_osman wrote:
         | Thanks for this commentary! At a startup where we are preparing
         | to shard Postgres. I'd be curious if you're familiar with AWS
         | limitless, and how you would have approached deciding whether
         | to use it vs. the approach in the blog post had it existed back
         | in 2021.
        
           | sgarland wrote:
           | My biggest concern with Limitless - other than inherent
           | performance issues with Aurora - is that according to their
           | docs, it's built on Aurora Serverless. IME, Serverless
           | anything tends to get absurdly expensive very quickly.
        
           | jitl wrote:
           | I'm a solid "no" on any primary datastore database thingy
           | that's under 5 years of industry wide production workload
           | experience, and would seriously weigh it against something
           | with 10+ years industry use.
           | 
           | In 2019 when I was interviewing at companies for my next
           | position I heard from a few places that the original Aurora
           | for Postgres lost their data. It seems like the sentiment on
           | mainline Aurora has improved a bit, but I would never bet my
           | company's future on an AWS preview technology. Better the
           | devil you know (and everyone else knows).
        
       | nurple wrote:
       | Ever since the war of stored procedures, I've been very reticent
       | to put any more logic than absolutely necessary into the
       | database. The simple truth of the matter is that the DB server
       | codebases are awful, complicated, places to develop this kind of
       | functionality. I strongly believe these features should be
       | handled at higher layers.
       | 
       | At my last job, in fintech, we used application-level shard
       | routing with each aggregate root served by a particular RPC
       | service (users, accounts, transactions, etc). When one of these
       | aggregate services were asked to pull data, they would query a
       | central routing service for the server and shard the requested
       | data resided in.
       | 
       | Between them and the DB servers we had a cluster of pgbouncer
       | instances to handle connection oversubscription and the routing
       | of queries around DB instances in the face of failures or
       | maintenance.
       | 
       | While this was pretty effective, the work to manage sharded
       | Postgres, DDL updates, caching, locating, and balancing data was
       | still very complicated and error prone, and was the root of many
       | production outages.
       | 
       | I didn't design this system, but after a couple years leading a
       | platform migration, which necessitated getting into all this at a
       | deep level, I would do it differently. Instead of query routing
       | and DB sharding, I would shard the aggregate root services
       | themselves. Each "sharded" set of these instances would have a
       | simple PG instance (with replica chain) behind it that knows
       | nothing about any other set.
       | 
       | At this point, instead of routing being done by the aggregate
       | root services locating and querying data from DB shards on huge
       | vertically-scaled servers, each set only pulls data from their
       | dedicated DB. A routing facade is placed in front of the set of
       | sets that sends requests from consumers to the one holding the
       | desired data.
       | 
       | With this architecture, the complexity of sharding and vertically
       | scaling at the DB layer, and handling connection oversubscription
       | with a query router like PGbouncer, just falls away.
       | 
       | I would keep these sets, and their DBs, rather small to also reap
       | the benefits of small datasets. One of the biggest issues we had
       | with huge DBs is the time it takes to do anything; as an example,
       | restarting replication in a failed chain could take many hours
       | because of the huge amount of data, and any hiccups in
       | connectivity would quickly overrun our capability to "catch"
       | replication up and we'd have to fall back to zfs-send to resnap.
       | 
       | A larger number of smaller DBs would not improve the total time
       | needed to do something like backups or DDL mutations, but it
       | would significantly reduce the time for any particular instance
       | which reduces overall risk and blast radius of a DB failure.
       | 
       | Another thing I think small DBs can help with is data locality,
       | DB automation, and possibly making the instances effectively
       | ephemeral. When your dataset is small, bringing an instance up
       | from a hot backup can take on the order of a few seconds, and
       | could allow you to schedule the root services and their DB on the
       | same host.
       | 
       | For geographical distribution, the routing facade can also send
       | writes to the set serving that shard in some other region.
        
         | sgarland wrote:
         | > Ever since the war of stored procedures, I've been very
         | reticent to put any more logic than absolutely necessary into
         | the database.
         | 
         | Counterpoint: the fewer round trips to the DB you have to do,
         | the faster your app is.
         | 
         | My belief is that as DBAs more or less went away thanks to
         | cloud providers and the rise of Full Stack Engineering, the
         | quiet part that no one wanted to admit was that RDBMS are
         | hideously complicated, and you really need SMEs to use them
         | correctly. Thus, the use of stored procedures, triggers,
         | advanced functions etc. went away in favor of just treating the
         | DB as a dumb store, and doing everything in app logic.
         | 
         | As more and more companies are discovering that actually, you
         | do need DB SMEs (I'm a DBRE; demand for this role has
         | skyrocketed), my hope is we can push some of the logic back out
         | to the DB. I am all for VCS, automated migrations, canary
         | instances, and other careful patterns. I'd just like to see
         | less of treating RDBMS as a document store, and embracing the
         | advantages of the tech already in place.
        
           | nurple wrote:
           | I kind of agree with your assessment, it would even speak to
           | the popularity of document DBs coming up around the same
           | time. However, I would argue that the decline in deep RDBMS
           | integrations is exactly because of their excessive
           | complexity. Modern software development methods left them
           | behind because their change velocity is awful. This is doubly
           | apparent when you look at features like TFA is talking about,
           | exactly because it's extremely complex to stand up, manage,
           | and make changes to.
           | 
           | I personally left in the era of ORMs exactly because it let
           | me build systems in high-productivity languages and
           | frameworks, but also because the cost benefit of RDBMS
           | integration just wasn't there. The DB is absolutely the
           | highest-risk place to put logic in any stack, if something
           | goes wrong _everything_ breaks and it's the most difficult
           | place to fix things.
           | 
           | I don't know what the future looks like for DBs, but I think
           | we'll continue to see them become even more transparent. Devs
           | just want to persist entities, and integrating at the RDBMS
           | level is a slog.
        
             | sgarland wrote:
             | I'm obviously biased because of my career speciality, but I
             | love everything about RDBMS. I love mastering the million
             | knobs they have, their weird quirks, and how if you use
             | them as designed with properly normalized schema, your data
             | will ALWAYS be exactly what you told it to be, with no
             | surprises.
             | 
             | To this end, "devs just want to persist entities" makes me
             | both sad and frustrated. Everyone seems to default to "just
             | use Postgres," but then don't want to actually use an
             | RDBMS. If you want a KV store, then use one. If you want a
             | document store, then use one. Don't subject a relational
             | database to storing giant JSON blobs with UUIDv4 keys, and
             | then complain that it's slow because you've filled the
             | memory with bloated indices, and saturated the disk
             | bandwidth with massive write amplification.
        
       ___________________________________________________________________
       (page generated 2024-06-02 23:02 UTC)