[HN Gopher] Difference between running Postgres for yourself and...
       ___________________________________________________________________
        
       Difference between running Postgres for yourself and for others
        
       Author : pwmtr
       Score  : 172 points
       Date   : 2024-07-22 14:56 UTC (1 days ago)
        
 (HTM) web link (www.ubicloud.com)
 (TXT) w3m dump (www.ubicloud.com)
        
       | ramonverse wrote:
       | > Security: Did you know that with one simple trick you can drop
       | to the OS from PostgreSQL and managed service providers hate
       | that? The trick is COPY table_name from COMMAND
       | 
       | I certainly did not know that.
        
         | amluto wrote:
         | If anyone actually needs the extra performance from avoiding
         | streaming over the Postgres protocol, this _could_ have been
         | done with some dignity using pipes and splice or using
         | SCM_RIGHTS. The latter technology has been around for a long
         | time.
        
           | kragen wrote:
           | only on localhost
        
             | theamk wrote:
             | That's kinda the point - there is an argument that one
             | should not be baking arbitrary shell command execution into
             | database server at all. Such execution will lack critical
             | security features - setting right user, process tracking,
             | cleanup, etc..
             | 
             | If you need to execute commands on database server for
             | admin work, use something designed for this (such as ssh) -
             | this will keep right management and logging simple, only
             | one source of shell command execution.
             | 
             | If you need to execute commands periodically, use some sort
             | of task scheduler, running as a dedicated user. To avoid
             | 2nd connection, you may use use postgres-controllable job
             | queues. Either way, limit to allowed commands only, so that
             | even if postgres credentials are leaked no arbitrary
             | commands can be executed.
             | 
             | Inboth approaches, this would have allow high speed,
             | localhost-specific transports instead of local shell.. if
             | postgresql would have supported them.
        
               | kragen wrote:
               | actually i think what i said was wrong, because i guess
               | the shell command generating the data to import runs on
               | the database server, not the client, so there's no reason
               | the database server _couldn 't_ be using pipes or file
               | descriptor passing for this already. in fact i'm not
               | clear why amluto thinks it isn't
        
               | theamk wrote:
               | huh?
               | 
               | Postgres already has "COPY FROM STDIN" command, which
               | makes database server use postgres connection for the raw
               | data. However, since it uses the existing connection, it
               | needs to be compatible with Postgres protocol, which
               | means that there is an extra overhead in wrapping
               | streaming data.
               | 
               | On the other hand, "COPY FROM COMMAND" has no wrapping
               | overhead, as it opens direct pipe to/from command, so no
               | Postgres protocol parsing is involved - as only short
               | command string is sent via postgres connection, while
               | bulk data goes over dedicated channel. This makes it
               | faster, although I am not sure how much does this
               | actually save.
               | 
               | amluto's point was that one can achieve no-wrapping
               | performance of "COPY FROM COMMAND" if one could pass
               | dedicated data descriptor to "COPY FROM STDIN". This
               | could be done using SCM_RIGHTS (a AF_UNIX protocol
               | feature that passes descriptors between processes) or
               | pipes+slice (not 100% sure how those would help). But
               | with SCM_RIGHTS, you'd have your psql client create pipe,
               | exec process, and then pass the descriptor to the sql
               | server. This would have exactly the same speed as "COPY
               | FROM COMMAND" (no overhead, dedicated pipe) but would not
               | mix security contexts and would execute any code under
               | server's username - overall better solution.
               | 
               | Your point was "only on localhost", which I interpreted
               | as "this approach would only work if psql runs on
               | localhost (because SCM_RIGHTS only works on localhost);
               | while "COPY FROM COMMAND" could even be executed
               | remotely".
               | 
               | This is 100% correct, but as I said, I think an ability
               | to execute commands remotely as a server user is a bad
               | idea and should have never existed.
        
         | rs_rs_rs_rs_rs wrote:
         | You need to be a super user or a member of the
         | pg_execute_server_program group for this to work.
        
       | pwmtr wrote:
       | Hey, author for the blog post is here. If you have any questions
       | or comments, please let me know!
       | 
       | It's also worth calling out the first diagram shows dependencies
       | between features for Ubicloud's managed Postgres. AWS, Azure, and
       | GCP's managed Postgres service would have a different diagram.
       | That's because we at Ubicloud treat write-ahead logs (WAL) as a
       | first class citizen.
        
         | candiddevmike wrote:
         | Your blog doesn't really mention any of the turn key PostgreSQL
         | deployment options out there. These days, especially on
         | Kubernetes, it has never been easier to run a SaaS-equivalent
         | PostgreSQL stack.
         | 
         | I think you may benefit from researching the ecosystem some
         | more.
        
           | metadat wrote:
           | Being critical without posting actual better solutions isn't
           | so helpful.
           | 
           | If you have concrete knowledge, please share it and don't be
           | cryptic!
        
           | sakjur wrote:
           | The author seems to have spent the better part of a decade
           | working professionally with Postgres. I think editorial
           | choice, rather than ignorance, might be why they're not
           | mentioning more prior art.
        
             | pwmtr wrote:
             | Yes. :) We quite like k8s-based managed Postgres solutions.
             | In fact, we convinced Patroni's author to come work with us
             | in a previous gig at Microsoft. We find that a good number
             | of companies successfully use k8s to manage Postgres for
             | themselves.
             | 
             | In this blog post, we wanted to focus on running Postgres
             | for others. AWS, Azure, and GCP's managed Postgres
             | services, and those offered by startups like Crunchy, don't
             | use k8s-based solutions. For us, one reason was operational
             | simplicity. You can write a control plane for managed
             | Postgres in 20K lines of code, including unit tests. This
             | way, if anything breaks at scale, you can quickly figure
             | out the issue without having to dive into dependencies.
             | 
             | We also understand different tradeoffs apply when you're
             | running Postgres for yourself or for others. In this blog
             | post, we wanted to focus on the latter scenario.
        
         | rembicilious wrote:
         | From the article: "If the user isn't actively working on the
         | database, PostgreSQL won't generate a WAL file each minute. WAL
         | files are by default 16 MB and PostgreSQL waits for the 16 MB
         | to fill up. So, your restore granularity could be much longer,
         | and you may not be able to restore to a particular minute in
         | time.
         | 
         | You can overcome this problem by setting the archive_timeout
         | and forcing Postgres to generate a new WAL file every minute.
         | With this configuration, Postgres would create a new WAL file
         | when it hits the 1-minute or 16 MB threshold, whichever comes
         | first.
         | 
         | The second issue with backup/restore is "no activity". In this
         | case, PostgreSQL wouldn't create a new file even if the
         | archive_timeout is set. As a solution, you can generate
         | artificial write activity by calling pg_current_xact_id()."
         | 
         | Can you explain why to create a WAL file even though there is
         | no activity?
        
           | infogulch wrote:
           | The point is to create a WAL file if there is a _little_
           | activity, but not enough to fill 16MB.
        
           | solatic wrote:
           | "Point in time restore" is a bad way to call the feature if
           | you don't let your customers pick a moment in time to restore
           | to, so those tricks ensure that there's enough WAL entries to
           | allow people to pick with per-second granularity.
        
             | pintxo wrote:
             | One can let users pick a time and then just replay the
             | closest backup? Why create empty backups?
        
             | Doxin wrote:
             | But if there has been no activity you surely can just pick
             | the most recent log that's older than the time the user
             | picked?
        
               | Tostino wrote:
               | How do you know the WAL was not lost during transmission,
               | or the server crashed before transferring the WAL even
               | though there were writes after the last WAL you have in
               | your backup?
               | 
               | That's why.
        
             | bjornsing wrote:
             | > those tricks ensure that there's enough WAL entries to
             | allow people to pick with per-second granularity
             | 
             | How can they ensure that? Sounds more like they ensure that
             | people can pick point in time with _minute_ granularity.
             | 
             | I'm no expert, but doesn't postgres write a timestamp with
             | each record in the WAL? I know that when you start postgres
             | in recovery mode it lets you specify a timestamp you want
             | to restore to, and I don't think that timestamp needs to be
             | on a WAL file boundary. (So the tricks seem weird /
             | unnecessary to me.)
        
           | pwmtr wrote:
           | We want to ensure that the users are able to restore to any
           | minute in the past.
           | 
           | Then, as @pintxo, @bjornsing, and @Doxin asked below; Why not
           | let users pick a time and then replay the closest backup? Why
           | create empty backups? This is a valid question.
           | 
           | The answer is that without an empty backup, it's not possible
           | to confirm there was no write activity. Maybe some write
           | activity occurred but you fail to archive the WAL file. You
           | need at least one WAL file (empty or not) with a timestamp
           | equal to or more recent than the target restore time to
           | ensure there were no write activity.
           | 
           | I didn't explain this in detail in the blog post for the sake
           | of brevity, but you don't need to create an empty WAL file
           | for each minute with no write activity. For instance, if
           | there was no write activity for 10 minutes and then a WAL
           | file is generated, PostgreSQL can deduce there was no write
           | activity during those 10 minutes and is able restore to any
           | minute within that period. However, if the user wants to
           | restore to 5 minutes ago and there has been no write activity
           | for the past 10 minutes, PostgreSQL doesn't have a WAL file
           | to confirm lack of write activity. In this case, you need to
           | call pg_current_xact_id() to generate a WAL file. So the
           | trick is, when you get a restore request, call
           | pg_current_xact_id() only once, just in case if there were no
           | write activity.
           | 
           | This is a good example of difference between running
           | PostgreSQL for yourself vs. others. When running PostgreSQL
           | for yourself, such edge cases is not a concern. However, when
           | managing it for others, an edge case like this could prevent
           | a restore and create an operational burden. Automating this
           | process is required especially if you are managing lots of
           | PostgreSQL databases.
        
             | rembicilious wrote:
             | Thank you, that makes a lot of sense. I should have
             | intuited that, but I couldn't quite make the connection. I
             | really appreciate the write up, it gave me a lot to think
             | about as someone who has only ever managed my own DB.
        
         | wg0 wrote:
         | Great write up. Do you plan to have similar for MySQL as well?
        
           | pwmtr wrote:
           | Do you mean writing something similar for MySQL or building a
           | MySQL service at Ubicloud?
           | 
           | For the first one, the answer would be no, because I don't
           | have any expertise on MySQL. I only used it years ago on my
           | hobby projects. I'm not the best person to write something
           | like this for MySQL.
           | 
           | For the second one, the answer would be maybe, but not any
           | time soon. We have list of services that we want to build at
           | Ubicloud and currently services like K8s have more priority.
        
             | wg0 wrote:
             | I meant the second one. Thank you for your answer.
             | 
             | Second unrelated question that anyone might want to add to,
             | why is MySQL falling out of favour in recent years in
             | comparison to Postgres?
        
               | mxuribe wrote:
               | > ...why is MySQL falling out of favour in recent years
               | in comparison to Postgres
               | 
               | I'm curious about this myself! Anyone know, or care to
               | share?
        
               | fcatalan wrote:
               | Until I switched to Postgres 8 years ago I knew a lot
               | about MySQL. I barely know anything about Postgres
        
               | pixl97 wrote:
               | Anecdotal evidence?
               | 
               | I used to work with/host lots of small php apps that tied
               | in with MySQL. PHP has dropped in popularity over the
               | years.
               | 
               | Add to this hosting for Postgres has become common so
               | you're not tied to cheap hosting for MySQL only.
               | 
               | At least I think that the Oracle name being tied to MySQL
               | made it icky for a lot of people that think the big O is
               | a devil straight from hell. This is something that got me
               | looking at Postgres way back when.
               | 
               | There's probably a myriad of 'smaller' reasons just like
               | this that enforce the trend.
        
               | pixl97 wrote:
               | https://news.ycombinator.com/item?id=35906604
               | 
               | From just over a year ago. And honestly the recent
               | reasons look like all the reasons why I moved away from
               | it in the past.
        
               | tracker1 wrote:
               | Every time I've used MySQL for anything, I will come
               | across _something_ that annoys me. You can use ANSI-
               | Quotes mode for tables, columns, etc, but when making a
               | foreign key reference it has to be back-ticks.
               | 
               | UTF8 isn't, which is a holdover from during UTF8 design,
               | but even then, it should have been retired and re-aliased
               | from the next major release (as a breaking change, that
               | wouldn't break much).
               | 
               | Magic (stylized) single-quotes work as apostrophes in
               | queries, meaning you can't use simple escape logic if
               | you're writing a custom client, and a whole new level
               | risk of security bugs. Yeah, you should use parameterized
               | queries, but still.
               | 
               | If your default indexing is case-insensitive, if you do
               | an index on a binary field, it's going to be case-
               | insensitive (which can/will cause annoying issues).
               | 
               | There are so many more one-off and analogies people have
               | when working with MySQL. PostgreSQL by contrast has been
               | far more consistent in the usage, development and
               | application of query support. Custom enhancements and so
               | many other features just work, and better.
        
         | matharmin wrote:
         | 1. Do you use logical or physical replication for high
         | availability? It seems like logical replication requires a lot
         | of manual intervention, so I'd assume physical replication? 2.
         | If that's the case, does that mean customers can't use logical
         | replication for other use cases?
         | 
         | I'm asking because logical replication seems to become more and
         | more common as a solution to automatically replicate data to
         | external systems (another Postgres instance, Kafka, data
         | warehousing, or offline sync systems in my case), but many
         | cloud providers appear to not support it. Many others do
         | (including AWS, GCP), so I'd also be interested in how they
         | handle high availability.
        
           | pwmtr wrote:
           | Yes, we use physical replication for HA.
           | 
           | There are many reasons that cloud providers don't want to
           | support logical replication;
           | 
           | - It requires giving superuser access to user. Many cloud
           | providers don't want to give that level of privilege. Some
           | cloud providers fork PostgreSQL or write custom extensions to
           | allow managing replication slots without requiring superuser
           | access. However, doing it securely is very difficult. You
           | suddenly open up a new attack vector for various privilege
           | escalation vulnerabilities.
           | 
           | - If user creates a replication slot, but does not consume
           | the changes, it can quickly fill up the disk. I dealt many
           | different failure modes of PostgreSQL, and I can confidently
           | say that disk full cases one of the most problematic/annoying
           | ones to recover from.
           | 
           | - It requires careful management of replication slots in case
           | of fail over. There are extensions or 3rd party tools helping
           | with this though.
           | 
           | So, some cloud providers don't support logical replication
           | and some support it weakly (i.e. don't cover all edge cases).
           | 
           | Thankfully there are some improvements are being done in
           | PostgreSQL core that simplifies failover of logical
           | replication slot (check out this for more information
           | https://www.postgresql.org/docs/17/logical-replication-
           | failo...), but it is still too early.
        
             | matharmin wrote:
             | Yeah, I've dealt with some of those edge cases on AWS and
             | GCP.
             | 
             | Some examples:
             | 
             | 1. I've seen a delay of hours without any messages being
             | sent on the replication protocol, likely due to a large
             | transaction in the WAL not committed when running out of
             | disk space.
             | 
             | 2. `PgError.58P02: could not create file
             | \"pg_replslot/<slot_name>/state.tmp\": File exists`
             | 
             | 3. `replication slot "..." is active for PID ...`, with
             | some system process holding on to the replication slot.
             | 
             | 4. `can no longer get changes from replication slot
             | "<slot_name>". ... This slot has never previously reserved
             | WAL, or it has been invalidated`.
             | 
             | All of these require manual intervention on the server to
             | resolve.
             | 
             | And that's not even taking into account HA/failover, these
             | are just issues with logical replication on a single node.
             | It's still a great feature though, and often worth having
             | to deal with these issues now and then.
        
               | pwmtr wrote:
               | Definitely agreed. It is great feature and a building
               | block for many complex features such as multiple
               | primaries, zero down time migrations etc. I'm also quite
               | happy to see with each PG version, it becomes more
               | stable/easy to use.
        
             | RegnisGnaw wrote:
             | There is also case where logical replication is not 100%
             | complete. One of our application uses LOBs for some reason
             | and can't do logical replication.
        
         | markus_zhang wrote:
         | Thanks for the article. Reading it makes me realize being a
         | good SRE/DevOps is crazily difficult. Did you start as a DB
         | admin?
        
           | pwmtr wrote:
           | I started as software developer and I'm still a software
           | developer. Though I always worked on either core database
           | development or building managed database services, by
           | coincidence at the beginning and by choice later on. I was
           | also fortunate to have the opportunity to work alongside some
           | of the leading experts in this domain and learn from them.
        
         | aae42 wrote:
         | surprised the article doesn't mention use of pgbouncer, do you
         | not use a proxy in front of your pg instances?
        
         | simmschi wrote:
         | I enjoyed reading the article very much. Thanks for the write
         | up!
        
         | iamcreasy wrote:
         | Thanks for the great write up.
         | 
         | What is MMW on the flowchart?
        
           | pwmtr wrote:
           | It is managed maintenance window, which basically means
           | letting user to pick a time window such as Saturday 8PM-9PM
           | and as a service provider you ensure that non critical
           | maintenances happen at that time window.
        
         | crabbone wrote:
         | There are "strange" properties your users have... who are those
         | people?
         | 
         | * You describe them as wanting to deploy the database in
         | container: why would anyone do that (unless for throwaway
         | testing or such)?
         | 
         | * The certificate issue seems also very specific to the case
         | when something needs to go over public Internet to some
         | anonymous users... Most database deployments I've seen in my
         | life fall into one of these categories: database server for a
         | Web server, which talk in a private local network, or database
         | backing some kind of management application, where, again the
         | communication between the management application and the
         | database happen without looping in the public Internet.
         | 
         | * Unwilling to wait 10 minutes to deploy a database. Just how
         | often do they need new databases? Btw. I'm not sure any of the
         | public clouds have any ETAs on VM creation, but from my
         | practice, Azure can easily take more than 10 minutes to bring
         | up a VM. With some flavors it can take a very long time.
         | 
         | The only profile I can think of is the kind of user who doesn't
         | care about what they store in the database, in the sense that
         | they are going to do some quick work and throw the data away
         | (eg. testing, or maybe BI on a slice of data). But then why
         | would these people need HA or backups, let alone version
         | updates?
        
       | zhengiszen wrote:
       | In Red Hat ecosystem there is an Ansible role to that end :
       | https://github.com/linux-system-roles/postgresql I don't know if
       | it will help everyone but it could be a good way to standardize
       | and maintain an instance configuration
        
         | wg0 wrote:
         | Is there something similar for MySQL that covers backup and
         | restore too?
        
       | superice wrote:
       | I'm a little confused about the point-in-time restore
       | functionality, I'm pretty sure there must be a way to not have to
       | force those one minute WAL boundaries. DigitalOceans managed
       | PostgreSQL for instance just allows you to specify a timestamp
       | and restore, and when looking into the PostgreSQL docs I remember
       | seeing an option to specify a timestamp as well.
        
         | pwmtr wrote:
         | You can still restore to a given minute even without one minute
         | WAL boundaries (most of the time). Consider the case where you
         | have a very low write activity and you would be able to fill up
         | one WAL file (16MB) in 1 hour. That WAL file won't be archived
         | until it is full and if you lose your database for some reason,
         | you won't have last 1 hour's data in your backups. That means
         | you cannot to restore any minute in that one hour window.
         | Shorter WAL boundaries reduces your exposure. If you set
         | archive_timeout to 1 minute, then you can restore any minute in
         | the past with the exception of the last minute (in practice, it
         | is possible to lose last few minutes because their WAL file
         | might not be archived yet, but still the exposure would be much
         | less)
         | 
         | DigitalOcean uses 5 minutes as archive_timeout, which is also a
         | reasonable value. In our experience, we saw that most of our
         | customers prefer less exposure and we settled on 1 minute as
         | archive_timeout value.
         | 
         | Roughly archive_timeout defines your RPO(recovery point
         | objective).
        
       | tehlike wrote:
       | I feel like cloudnative-pg takes away majority of pain using well
       | known kubernetes operator pattern.
        
         | pwmtr wrote:
         | At another thread in this page, I wrote more about this, but in
         | summary; we also like k8s-based managed Postgres solutions.
         | They are quite useful if you are running Postgres for yourself.
         | In managed Postgres services offered by hyperscalers or
         | companies like Crunchy though, it is not used very commonly.
        
           | JohnMakin wrote:
           | > it is not used very commonly.
           | 
           | Is this a problem of multi-tenancy in k8s specifically or
           | something else?
        
       | thyrsus wrote:
       | Thanks for the article; it's an important checklist.
       | 
       | I would think you'd want at most one certificate authority per
       | customer rather than per database. Why am I wrong?
        
         | pwmtr wrote:
         | You are not wrong. There are benefits for sharing certificate
         | authority per customer. We actually considered using one
         | authority per organization. However in our case, there is no
         | organization entity in our data model. There is projects, which
         | is similar but not exactly same. So we were not entirely sure
         | where should we put the boundary and decided to go with more
         | isolated approach.
         | 
         | It is likely that we would add a organization-like entity in
         | the future to our data model and at that time sharing
         | certificate authority would make more sense.
        
       | wolfhumble wrote:
       | From the article:
       | 
       | "The problem with the server certificate is that someone needs to
       | sign it. Usually you would want that certificate to be signed by
       | someone who is trusted globally like DigiCert. But that means you
       | need to send an external request; and the provider at some point
       | (usually in minutes but sometimes it can be hours) signs your
       | certificate and returns it back to you. This time lag is
       | unfortunately not acceptable for users. So most of the time you
       | would sign the certificate yourself. This means you need to
       | create a certificate authority (CA) and share it with the users
       | so they can validate the certificate chain. Ideally, you would
       | also create a different CA for each database."
       | 
       | Couldn't you automate this with Let's Encrypt instead?
       | 
       | Thanks!
        
       ___________________________________________________________________
       (page generated 2024-07-23 23:11 UTC)