[HN Gopher] PostgreSQL, Memory and the Cloud
___________________________________________________________________
PostgreSQL, Memory and the Cloud
Author : bilalhusain
Score : 127 points
Date : 2021-07-10 15:55 UTC (7 hours ago)
(HTM) web link (sosna.de)
(TXT) w3m dump (sosna.de)
| Okkef wrote:
| Good article, but please check the spelling. 'Program' is with a
| single m at the end. Also, it's 'hundreds' instead of 'hunderts'.
| dkersten wrote:
| A metacomment about the page (rather than the content): the text
| in the white boxes is almost unreadable for me, the contrast is
| crazy low.
| meepmorp wrote:
| I had this problem, too; there's a button to toggle the night
| mode theme, which fixed it for me.
| aeyes wrote:
| Wow, the title of this post is very calm compared to what is
| actually happening.
|
| CloudSQL Postgres is running with a misconfigured OS OOM killer,
| crashes Postmaster randomly even if memory use is below instance
| spec. GCP closes this bug report as "Won't fix".
|
| This is a priority 1 issue. Seeing a wontfix for this has
| completely destroyed my trust of their judgement. The bug report
| states that they have been in contact with support since
| February.
|
| Unbelievable attitude towards fixing production critical problems
| of their platform affecting all customers.
| breakingcups wrote:
| It exposes a very problematic communication pattern. The
| engineering team doesn't respond to the support team
| (accidentally or deliberately). The support team then just
| decides to close the issue instead of prodding the engineering
| team for an actual response (even if it's just "Yeah, we're not
| fixing it").
|
| Now the issue is just in limbo and the only one who feels the
| pain is the customer.
| perlgeek wrote:
| Another "fun" interaction pattern: User reports a bug (or a
| feature request), several others subscribe to and/or vote for
| this to be solved, and then a service rep closes the issue
| because there wasn't any recent activity.
|
| I've observed with with Atlassian where I wanted to report a
| Jira bug, but found that it had already been opened some
| years before, more than a hundred people had subscribed, bug
| was still closed as "no activity, must not be relevant". I
| just found the exact same bug reported for Jira Cloud (I had
| observed it in the on-prem version):
| https://jira.atlassian.com/browse/JSWCLOUD-8865 and it was
| closed there for the very same reason.
|
| I didn't leave a comment because the original report
| described the issue perfectly, and adding a "me too" comment
| is just noise in the bug tracker. Guess I'll be noise in
| future :-(
| jaredsohn wrote:
| I don't think a 'me too' would be noise right now. Last
| activity was in 2019; having someone state the problem is
| still real in 2021 could impact if it gets fixed.
| slimsag wrote:
| This doesn't really surprise me. We use CloudSQL at my work
| (Sourcegraph) and have run into all sorts of weird issues
| actually putting it into production, e.g. segmentation faults
| when turning on Query Insights (which, lol, is supposed to give
| insight into why your DB might be behaving poorly.)
|
| For the most part it works okay and is fine, but there have
| definitely been a fair number of quirks..
|
| https://issuetracker.google.com/u/2/savedsearches/559773?pli...
| yjftsjthsd-h wrote:
| So are there problems with disabling overcommit? Or is it really
| that simple (at least for dedicated db hosts)?
| jpeter wrote:
| And if there are no problems, why is it not disabled by
| default?
| AzN1337c0d3r wrote:
| Because many program "unnecessarily" allocate a lot of
| virtual memory and then never actually page it in, disabling
| overcommit will start killing processes due to lack of memory
| even though most of the memory isn't been actually used.
| derefr wrote:
| If you mean "why isn't it disabled by default on Linux
| installs": most programs don't expect malloc(2) to ever
| return NULL. Those programs will just _assume_ the return
| value from malloc(2) is valid memory. In the best case,
| they'll immediately write to it and protection-fault. In the
| worst case, they'll hold onto this NULL pointer for a while,
| passing it around, until eventually something else somewhere
| distant in the program blows up some unknown amount of time
| later.
|
| Even those programs that are "malloc(2) error aware", often
| do something stupid and counterproductive in response, like
| attempting to allocate more memory for an exception object /
| stack trace / error string.
|
| Programs that do something useful in response to a NULL
| malloc(2) return result -- useful for the stability of the
| system as a whole, better than what the OOM killer gets you
| -- are rare, even on servers. Usually it's only stateful,
| long-running, DBMS-like daemons that 1. bother, and 2. have
| the engineering effort put into them to do the _right_ thing.
| mjw1007 wrote:
| I think your first paragraph is too pessimistic.
|
| You can get a decent idea of the behaviour of the programs
| you use when they run out of memory by running under
| 'ulimit -v' with a low limit.
|
| In my experience most (though far from all) of the programs
| I use managed a clean abort with an error message (as from
| a traditional xmalloc()).
| derefr wrote:
| Most of these were likely managed-language programs.
|
| Programs witten for managed language runtimes will have a
| language-runtime-level abort on malloc(2) fail, which
| usually _is_ well written, in the sense that it will
| clean up _language-runtime-level_ resources, and emit a
| _language-runtime-level_ error message.
|
| But this language-runtime-level abort usually isn't
| exposed to the application in any hookable way, so from
| the developer's perspective, it's basically the same as
| being OOM killed. There's no option to clean up e.g. an
| individual transaction's resources in order to keep
| going. There's no hooks for libraries to use to e.g.
| properly send close messages on sockets (if the language
| runtime doesn't do that itself as part of managing socket
| lifetimes.) Etc.
|
| These managed runtimes (e.g. the JVM) may expose a
| catchable exception for OOM errors, but these are for
| _internal, language-runtime level_ OOM errors, triggered
| by the runtime itself under certain conditions, rather
| than in response to a syscall failure. When malloc(2)
| fails, it's basically "too late" from these runtimes'
| perspectives -- they no longer have the resources
| required to allow the user to run any more code.
| mjw1007 wrote:
| << Most of these were likely managed-language programs.
| >>
|
| Please don't guess. They weren't.
|
| It is true that a program that aborts as soon as malloc
| returns failure isn't doing any special cleanup or
| attempting to keep going.
|
| But that's not at all the same as << Those programs will
| just assume the return value from malloc(2) is valid
| memory. In the best case, they'll immediately write to it
| and protection-fault. >>, which is what I'm informing you
| is too pessimistic.
| perlgeek wrote:
| There are problems with disabling overcommit.
|
| Consider this scenario: a process runs a fork(), and shortly
| after it runs an exec(). Normally, the extra fork only uses a
| tiny amount of extra memory, because the memory is shared
| between the parent and the child, until one of them writes to
| it (copy-on-write).
|
| With overcommit disabled, the kernel must reserve enough space
| to copy the whole writable RAM of a process when it forks.
|
| So you have a 16GB machine, and an 8.1GB process cannot spawn
| any other program through the usual fork + exec routine
| (workarounds exist, like forking before allocating lots of
| memory and using IPC to instruct the low-memory fork to fork
| again and launch, but that's way more complicated than a simple
| fork + exec).
|
| So if you have a dedicated DB host and you know that your DB
| engine is very carefully engineered to work with disabled
| overcommit, you can disable it. On a general-purpose machine a
| disabled overcommit will waste lots of RAM that's sitting
| unused.
| Diggsey wrote:
| Some programs allocate a lot of virtual memory and then don't
| use it.
|
| Also, linux's forking model can result in a lot of virtual
| memory being allocated if a heavy-weight program tries to
| fork+exec a lot of smaller programs, since fork+exec it not
| atomic and briefly doubles the virtual memory usage of the
| original program.
|
| I think there are better ways to spawn programs that don't
| suffer from this problem now...
|
| If you have programs that are written to allocate virtual
| memory sparingly (like postgres) then that should be fine.
|
| However, there is a second way you can be caught out: even if
| you disable overcommit, your program can still be OOM killed
| for violating cgroup limits, since cgroup limits always behave
| as though over-commit is enabled (ie. they allow you to
| allocate more than you are allowed, and then you get OOM killed
| when you try to use the allocated memory). This means you'd
| have to be really careful running eg. postgres inside a
| kubernetes pod.
|
| This behaviour really sucks IMO. I would like it if you could
| set overcommit on a per-program basis, so that eg. postgres can
| say "I know what I'm doing - when I allocate virtual memory I
| want you to _really_ allocate it (and tell me now if you can
| 't...)". I think you can somewhat achieve this with memory
| locking, but that prevents it from being paged out at all...
| ithkuil wrote:
| Wasn't vfork designed to solve this issue? Does it not work
| in practice?
| Diggsey wrote:
| It is certainly one way to solve that specific issue,
| assuming the program was written to take advantage of it.
| As mentioned, there are several other reasons a program may
| use a lot of virtual memory though.
| mnahkies wrote:
| I recently managed to crash a GCP cloudsql postgres 12 host
| running an interactive query that was rather heavy (based on
| error logs OOM).
|
| It surprised me because I had never executed a query and caused
| the whole host to crash up until that point - now I'm wondering
| if this misconfiguration is the cause
| thyrsus wrote:
| Are there recommendations for learning about Linux kernel memory
| management? Two anecdata:
|
| * I had some compute servers that were up for 200 days. The
| customers noticed that they were half as fast as identical
| hardware just booted. Dropping the file system cache ("echo 3 |
| sudo dd of=/proc/sys/vm/drop_cache") brought the speed back up to
| the newly deployed servers. WTF? File system caches are supposed
| to be zero cost discards as soon as processes ask for RAM - but
| something else is going on. I suspect the kernel is behaving
| badly with overpopulated RAM management data (TLB entries?), but
| I don't know how to measure that.
|
| * If that is actually the problem, then a solution might be to
| decrease data size by using non-zero hugepages ("cat
| /proc/sys/vm/nr_hugepages"). I'd love to see recommendations on
| when to use that.
| SteveNuts wrote:
| Are you using any swap? If so, check the swappiness setting
| thyrsus wrote:
| No swap. These are large RAM (400G to 1000G) Kubernetes
| nodes.
| sargun wrote:
| This is likely due to a kernel bug that was caused by the
| way cgroup slab management is handled. Upgrade to 5.10 or
| later, and it should be fixed. I'd be interested to see if
| the problem continues.
| jeffbee wrote:
| Explicit hugepages on x86 are difficult to manage. Most people
| using off-the-shelf software can only take advantage of it by
| configuring, for example, innodb buffer pools to use them.
| However if your compute server really is a database, then
| you'll find the performance benefit is well worth the
| configuration.
|
| For other processes you'll need a hugepage-aware allocator such
| as tcmalloc (the new one, not the old one) and transparent
| hugepages enabled. Again, the benefits of this may be enormous,
| if page table management is expensive on your services.
|
| You will find a great deal of blogs on the web recommending
| disabling transparent hugepages. These people are all mislead.
| Hugepages are a major benefit.
| citrin_ru wrote:
| I don't remember details now, but I've seen a situation when a
| Java app was working slower and a box with more RAM (and
| probably a bigger heap size), compare to a box with the same
| CPU but 2x less RAM. I suspected that TLB cache was the reason,
| but didn't have time to test this.
| Tostino wrote:
| Could have also been compressed OOPs
| shdh wrote:
| GCP CloudSQL has a lot of issues. There was one with query
| insights being enabled causing segfaults on `LEFT JOIN`
| operations. Its since been patched, but really shitty.
___________________________________________________________________
(page generated 2021-07-10 23:00 UTC)