[HN Gopher] Difference between running Postgres for yourself and...
___________________________________________________________________
Difference between running Postgres for yourself and for others
Author : pwmtr
Score : 172 points
Date : 2024-07-22 14:56 UTC (1 days ago)
(HTM) web link (www.ubicloud.com)
(TXT) w3m dump (www.ubicloud.com)
| ramonverse wrote:
| > Security: Did you know that with one simple trick you can drop
| to the OS from PostgreSQL and managed service providers hate
| that? The trick is COPY table_name from COMMAND
|
| I certainly did not know that.
| amluto wrote:
| If anyone actually needs the extra performance from avoiding
| streaming over the Postgres protocol, this _could_ have been
| done with some dignity using pipes and splice or using
| SCM_RIGHTS. The latter technology has been around for a long
| time.
| kragen wrote:
| only on localhost
| theamk wrote:
| That's kinda the point - there is an argument that one
| should not be baking arbitrary shell command execution into
| database server at all. Such execution will lack critical
| security features - setting right user, process tracking,
| cleanup, etc..
|
| If you need to execute commands on database server for
| admin work, use something designed for this (such as ssh) -
| this will keep right management and logging simple, only
| one source of shell command execution.
|
| If you need to execute commands periodically, use some sort
| of task scheduler, running as a dedicated user. To avoid
| 2nd connection, you may use use postgres-controllable job
| queues. Either way, limit to allowed commands only, so that
| even if postgres credentials are leaked no arbitrary
| commands can be executed.
|
| Inboth approaches, this would have allow high speed,
| localhost-specific transports instead of local shell.. if
| postgresql would have supported them.
| kragen wrote:
| actually i think what i said was wrong, because i guess
| the shell command generating the data to import runs on
| the database server, not the client, so there's no reason
| the database server _couldn 't_ be using pipes or file
| descriptor passing for this already. in fact i'm not
| clear why amluto thinks it isn't
| theamk wrote:
| huh?
|
| Postgres already has "COPY FROM STDIN" command, which
| makes database server use postgres connection for the raw
| data. However, since it uses the existing connection, it
| needs to be compatible with Postgres protocol, which
| means that there is an extra overhead in wrapping
| streaming data.
|
| On the other hand, "COPY FROM COMMAND" has no wrapping
| overhead, as it opens direct pipe to/from command, so no
| Postgres protocol parsing is involved - as only short
| command string is sent via postgres connection, while
| bulk data goes over dedicated channel. This makes it
| faster, although I am not sure how much does this
| actually save.
|
| amluto's point was that one can achieve no-wrapping
| performance of "COPY FROM COMMAND" if one could pass
| dedicated data descriptor to "COPY FROM STDIN". This
| could be done using SCM_RIGHTS (a AF_UNIX protocol
| feature that passes descriptors between processes) or
| pipes+slice (not 100% sure how those would help). But
| with SCM_RIGHTS, you'd have your psql client create pipe,
| exec process, and then pass the descriptor to the sql
| server. This would have exactly the same speed as "COPY
| FROM COMMAND" (no overhead, dedicated pipe) but would not
| mix security contexts and would execute any code under
| server's username - overall better solution.
|
| Your point was "only on localhost", which I interpreted
| as "this approach would only work if psql runs on
| localhost (because SCM_RIGHTS only works on localhost);
| while "COPY FROM COMMAND" could even be executed
| remotely".
|
| This is 100% correct, but as I said, I think an ability
| to execute commands remotely as a server user is a bad
| idea and should have never existed.
| rs_rs_rs_rs_rs wrote:
| You need to be a super user or a member of the
| pg_execute_server_program group for this to work.
| pwmtr wrote:
| Hey, author for the blog post is here. If you have any questions
| or comments, please let me know!
|
| It's also worth calling out the first diagram shows dependencies
| between features for Ubicloud's managed Postgres. AWS, Azure, and
| GCP's managed Postgres service would have a different diagram.
| That's because we at Ubicloud treat write-ahead logs (WAL) as a
| first class citizen.
| candiddevmike wrote:
| Your blog doesn't really mention any of the turn key PostgreSQL
| deployment options out there. These days, especially on
| Kubernetes, it has never been easier to run a SaaS-equivalent
| PostgreSQL stack.
|
| I think you may benefit from researching the ecosystem some
| more.
| metadat wrote:
| Being critical without posting actual better solutions isn't
| so helpful.
|
| If you have concrete knowledge, please share it and don't be
| cryptic!
| sakjur wrote:
| The author seems to have spent the better part of a decade
| working professionally with Postgres. I think editorial
| choice, rather than ignorance, might be why they're not
| mentioning more prior art.
| pwmtr wrote:
| Yes. :) We quite like k8s-based managed Postgres solutions.
| In fact, we convinced Patroni's author to come work with us
| in a previous gig at Microsoft. We find that a good number
| of companies successfully use k8s to manage Postgres for
| themselves.
|
| In this blog post, we wanted to focus on running Postgres
| for others. AWS, Azure, and GCP's managed Postgres
| services, and those offered by startups like Crunchy, don't
| use k8s-based solutions. For us, one reason was operational
| simplicity. You can write a control plane for managed
| Postgres in 20K lines of code, including unit tests. This
| way, if anything breaks at scale, you can quickly figure
| out the issue without having to dive into dependencies.
|
| We also understand different tradeoffs apply when you're
| running Postgres for yourself or for others. In this blog
| post, we wanted to focus on the latter scenario.
| rembicilious wrote:
| From the article: "If the user isn't actively working on the
| database, PostgreSQL won't generate a WAL file each minute. WAL
| files are by default 16 MB and PostgreSQL waits for the 16 MB
| to fill up. So, your restore granularity could be much longer,
| and you may not be able to restore to a particular minute in
| time.
|
| You can overcome this problem by setting the archive_timeout
| and forcing Postgres to generate a new WAL file every minute.
| With this configuration, Postgres would create a new WAL file
| when it hits the 1-minute or 16 MB threshold, whichever comes
| first.
|
| The second issue with backup/restore is "no activity". In this
| case, PostgreSQL wouldn't create a new file even if the
| archive_timeout is set. As a solution, you can generate
| artificial write activity by calling pg_current_xact_id()."
|
| Can you explain why to create a WAL file even though there is
| no activity?
| infogulch wrote:
| The point is to create a WAL file if there is a _little_
| activity, but not enough to fill 16MB.
| solatic wrote:
| "Point in time restore" is a bad way to call the feature if
| you don't let your customers pick a moment in time to restore
| to, so those tricks ensure that there's enough WAL entries to
| allow people to pick with per-second granularity.
| pintxo wrote:
| One can let users pick a time and then just replay the
| closest backup? Why create empty backups?
| Doxin wrote:
| But if there has been no activity you surely can just pick
| the most recent log that's older than the time the user
| picked?
| Tostino wrote:
| How do you know the WAL was not lost during transmission,
| or the server crashed before transferring the WAL even
| though there were writes after the last WAL you have in
| your backup?
|
| That's why.
| bjornsing wrote:
| > those tricks ensure that there's enough WAL entries to
| allow people to pick with per-second granularity
|
| How can they ensure that? Sounds more like they ensure that
| people can pick point in time with _minute_ granularity.
|
| I'm no expert, but doesn't postgres write a timestamp with
| each record in the WAL? I know that when you start postgres
| in recovery mode it lets you specify a timestamp you want
| to restore to, and I don't think that timestamp needs to be
| on a WAL file boundary. (So the tricks seem weird /
| unnecessary to me.)
| pwmtr wrote:
| We want to ensure that the users are able to restore to any
| minute in the past.
|
| Then, as @pintxo, @bjornsing, and @Doxin asked below; Why not
| let users pick a time and then replay the closest backup? Why
| create empty backups? This is a valid question.
|
| The answer is that without an empty backup, it's not possible
| to confirm there was no write activity. Maybe some write
| activity occurred but you fail to archive the WAL file. You
| need at least one WAL file (empty or not) with a timestamp
| equal to or more recent than the target restore time to
| ensure there were no write activity.
|
| I didn't explain this in detail in the blog post for the sake
| of brevity, but you don't need to create an empty WAL file
| for each minute with no write activity. For instance, if
| there was no write activity for 10 minutes and then a WAL
| file is generated, PostgreSQL can deduce there was no write
| activity during those 10 minutes and is able restore to any
| minute within that period. However, if the user wants to
| restore to 5 minutes ago and there has been no write activity
| for the past 10 minutes, PostgreSQL doesn't have a WAL file
| to confirm lack of write activity. In this case, you need to
| call pg_current_xact_id() to generate a WAL file. So the
| trick is, when you get a restore request, call
| pg_current_xact_id() only once, just in case if there were no
| write activity.
|
| This is a good example of difference between running
| PostgreSQL for yourself vs. others. When running PostgreSQL
| for yourself, such edge cases is not a concern. However, when
| managing it for others, an edge case like this could prevent
| a restore and create an operational burden. Automating this
| process is required especially if you are managing lots of
| PostgreSQL databases.
| rembicilious wrote:
| Thank you, that makes a lot of sense. I should have
| intuited that, but I couldn't quite make the connection. I
| really appreciate the write up, it gave me a lot to think
| about as someone who has only ever managed my own DB.
| wg0 wrote:
| Great write up. Do you plan to have similar for MySQL as well?
| pwmtr wrote:
| Do you mean writing something similar for MySQL or building a
| MySQL service at Ubicloud?
|
| For the first one, the answer would be no, because I don't
| have any expertise on MySQL. I only used it years ago on my
| hobby projects. I'm not the best person to write something
| like this for MySQL.
|
| For the second one, the answer would be maybe, but not any
| time soon. We have list of services that we want to build at
| Ubicloud and currently services like K8s have more priority.
| wg0 wrote:
| I meant the second one. Thank you for your answer.
|
| Second unrelated question that anyone might want to add to,
| why is MySQL falling out of favour in recent years in
| comparison to Postgres?
| mxuribe wrote:
| > ...why is MySQL falling out of favour in recent years
| in comparison to Postgres
|
| I'm curious about this myself! Anyone know, or care to
| share?
| fcatalan wrote:
| Until I switched to Postgres 8 years ago I knew a lot
| about MySQL. I barely know anything about Postgres
| pixl97 wrote:
| Anecdotal evidence?
|
| I used to work with/host lots of small php apps that tied
| in with MySQL. PHP has dropped in popularity over the
| years.
|
| Add to this hosting for Postgres has become common so
| you're not tied to cheap hosting for MySQL only.
|
| At least I think that the Oracle name being tied to MySQL
| made it icky for a lot of people that think the big O is
| a devil straight from hell. This is something that got me
| looking at Postgres way back when.
|
| There's probably a myriad of 'smaller' reasons just like
| this that enforce the trend.
| pixl97 wrote:
| https://news.ycombinator.com/item?id=35906604
|
| From just over a year ago. And honestly the recent
| reasons look like all the reasons why I moved away from
| it in the past.
| tracker1 wrote:
| Every time I've used MySQL for anything, I will come
| across _something_ that annoys me. You can use ANSI-
| Quotes mode for tables, columns, etc, but when making a
| foreign key reference it has to be back-ticks.
|
| UTF8 isn't, which is a holdover from during UTF8 design,
| but even then, it should have been retired and re-aliased
| from the next major release (as a breaking change, that
| wouldn't break much).
|
| Magic (stylized) single-quotes work as apostrophes in
| queries, meaning you can't use simple escape logic if
| you're writing a custom client, and a whole new level
| risk of security bugs. Yeah, you should use parameterized
| queries, but still.
|
| If your default indexing is case-insensitive, if you do
| an index on a binary field, it's going to be case-
| insensitive (which can/will cause annoying issues).
|
| There are so many more one-off and analogies people have
| when working with MySQL. PostgreSQL by contrast has been
| far more consistent in the usage, development and
| application of query support. Custom enhancements and so
| many other features just work, and better.
| matharmin wrote:
| 1. Do you use logical or physical replication for high
| availability? It seems like logical replication requires a lot
| of manual intervention, so I'd assume physical replication? 2.
| If that's the case, does that mean customers can't use logical
| replication for other use cases?
|
| I'm asking because logical replication seems to become more and
| more common as a solution to automatically replicate data to
| external systems (another Postgres instance, Kafka, data
| warehousing, or offline sync systems in my case), but many
| cloud providers appear to not support it. Many others do
| (including AWS, GCP), so I'd also be interested in how they
| handle high availability.
| pwmtr wrote:
| Yes, we use physical replication for HA.
|
| There are many reasons that cloud providers don't want to
| support logical replication;
|
| - It requires giving superuser access to user. Many cloud
| providers don't want to give that level of privilege. Some
| cloud providers fork PostgreSQL or write custom extensions to
| allow managing replication slots without requiring superuser
| access. However, doing it securely is very difficult. You
| suddenly open up a new attack vector for various privilege
| escalation vulnerabilities.
|
| - If user creates a replication slot, but does not consume
| the changes, it can quickly fill up the disk. I dealt many
| different failure modes of PostgreSQL, and I can confidently
| say that disk full cases one of the most problematic/annoying
| ones to recover from.
|
| - It requires careful management of replication slots in case
| of fail over. There are extensions or 3rd party tools helping
| with this though.
|
| So, some cloud providers don't support logical replication
| and some support it weakly (i.e. don't cover all edge cases).
|
| Thankfully there are some improvements are being done in
| PostgreSQL core that simplifies failover of logical
| replication slot (check out this for more information
| https://www.postgresql.org/docs/17/logical-replication-
| failo...), but it is still too early.
| matharmin wrote:
| Yeah, I've dealt with some of those edge cases on AWS and
| GCP.
|
| Some examples:
|
| 1. I've seen a delay of hours without any messages being
| sent on the replication protocol, likely due to a large
| transaction in the WAL not committed when running out of
| disk space.
|
| 2. `PgError.58P02: could not create file
| \"pg_replslot/<slot_name>/state.tmp\": File exists`
|
| 3. `replication slot "..." is active for PID ...`, with
| some system process holding on to the replication slot.
|
| 4. `can no longer get changes from replication slot
| "<slot_name>". ... This slot has never previously reserved
| WAL, or it has been invalidated`.
|
| All of these require manual intervention on the server to
| resolve.
|
| And that's not even taking into account HA/failover, these
| are just issues with logical replication on a single node.
| It's still a great feature though, and often worth having
| to deal with these issues now and then.
| pwmtr wrote:
| Definitely agreed. It is great feature and a building
| block for many complex features such as multiple
| primaries, zero down time migrations etc. I'm also quite
| happy to see with each PG version, it becomes more
| stable/easy to use.
| RegnisGnaw wrote:
| There is also case where logical replication is not 100%
| complete. One of our application uses LOBs for some reason
| and can't do logical replication.
| markus_zhang wrote:
| Thanks for the article. Reading it makes me realize being a
| good SRE/DevOps is crazily difficult. Did you start as a DB
| admin?
| pwmtr wrote:
| I started as software developer and I'm still a software
| developer. Though I always worked on either core database
| development or building managed database services, by
| coincidence at the beginning and by choice later on. I was
| also fortunate to have the opportunity to work alongside some
| of the leading experts in this domain and learn from them.
| aae42 wrote:
| surprised the article doesn't mention use of pgbouncer, do you
| not use a proxy in front of your pg instances?
| simmschi wrote:
| I enjoyed reading the article very much. Thanks for the write
| up!
| iamcreasy wrote:
| Thanks for the great write up.
|
| What is MMW on the flowchart?
| pwmtr wrote:
| It is managed maintenance window, which basically means
| letting user to pick a time window such as Saturday 8PM-9PM
| and as a service provider you ensure that non critical
| maintenances happen at that time window.
| crabbone wrote:
| There are "strange" properties your users have... who are those
| people?
|
| * You describe them as wanting to deploy the database in
| container: why would anyone do that (unless for throwaway
| testing or such)?
|
| * The certificate issue seems also very specific to the case
| when something needs to go over public Internet to some
| anonymous users... Most database deployments I've seen in my
| life fall into one of these categories: database server for a
| Web server, which talk in a private local network, or database
| backing some kind of management application, where, again the
| communication between the management application and the
| database happen without looping in the public Internet.
|
| * Unwilling to wait 10 minutes to deploy a database. Just how
| often do they need new databases? Btw. I'm not sure any of the
| public clouds have any ETAs on VM creation, but from my
| practice, Azure can easily take more than 10 minutes to bring
| up a VM. With some flavors it can take a very long time.
|
| The only profile I can think of is the kind of user who doesn't
| care about what they store in the database, in the sense that
| they are going to do some quick work and throw the data away
| (eg. testing, or maybe BI on a slice of data). But then why
| would these people need HA or backups, let alone version
| updates?
| zhengiszen wrote:
| In Red Hat ecosystem there is an Ansible role to that end :
| https://github.com/linux-system-roles/postgresql I don't know if
| it will help everyone but it could be a good way to standardize
| and maintain an instance configuration
| wg0 wrote:
| Is there something similar for MySQL that covers backup and
| restore too?
| superice wrote:
| I'm a little confused about the point-in-time restore
| functionality, I'm pretty sure there must be a way to not have to
| force those one minute WAL boundaries. DigitalOceans managed
| PostgreSQL for instance just allows you to specify a timestamp
| and restore, and when looking into the PostgreSQL docs I remember
| seeing an option to specify a timestamp as well.
| pwmtr wrote:
| You can still restore to a given minute even without one minute
| WAL boundaries (most of the time). Consider the case where you
| have a very low write activity and you would be able to fill up
| one WAL file (16MB) in 1 hour. That WAL file won't be archived
| until it is full and if you lose your database for some reason,
| you won't have last 1 hour's data in your backups. That means
| you cannot to restore any minute in that one hour window.
| Shorter WAL boundaries reduces your exposure. If you set
| archive_timeout to 1 minute, then you can restore any minute in
| the past with the exception of the last minute (in practice, it
| is possible to lose last few minutes because their WAL file
| might not be archived yet, but still the exposure would be much
| less)
|
| DigitalOcean uses 5 minutes as archive_timeout, which is also a
| reasonable value. In our experience, we saw that most of our
| customers prefer less exposure and we settled on 1 minute as
| archive_timeout value.
|
| Roughly archive_timeout defines your RPO(recovery point
| objective).
| tehlike wrote:
| I feel like cloudnative-pg takes away majority of pain using well
| known kubernetes operator pattern.
| pwmtr wrote:
| At another thread in this page, I wrote more about this, but in
| summary; we also like k8s-based managed Postgres solutions.
| They are quite useful if you are running Postgres for yourself.
| In managed Postgres services offered by hyperscalers or
| companies like Crunchy though, it is not used very commonly.
| JohnMakin wrote:
| > it is not used very commonly.
|
| Is this a problem of multi-tenancy in k8s specifically or
| something else?
| thyrsus wrote:
| Thanks for the article; it's an important checklist.
|
| I would think you'd want at most one certificate authority per
| customer rather than per database. Why am I wrong?
| pwmtr wrote:
| You are not wrong. There are benefits for sharing certificate
| authority per customer. We actually considered using one
| authority per organization. However in our case, there is no
| organization entity in our data model. There is projects, which
| is similar but not exactly same. So we were not entirely sure
| where should we put the boundary and decided to go with more
| isolated approach.
|
| It is likely that we would add a organization-like entity in
| the future to our data model and at that time sharing
| certificate authority would make more sense.
| wolfhumble wrote:
| From the article:
|
| "The problem with the server certificate is that someone needs to
| sign it. Usually you would want that certificate to be signed by
| someone who is trusted globally like DigiCert. But that means you
| need to send an external request; and the provider at some point
| (usually in minutes but sometimes it can be hours) signs your
| certificate and returns it back to you. This time lag is
| unfortunately not acceptable for users. So most of the time you
| would sign the certificate yourself. This means you need to
| create a certificate authority (CA) and share it with the users
| so they can validate the certificate chain. Ideally, you would
| also create a different CA for each database."
|
| Couldn't you automate this with Let's Encrypt instead?
|
| Thanks!
___________________________________________________________________
(page generated 2024-07-23 23:11 UTC)