[HN Gopher] PostgreSQL, Memory and the Cloud
       ___________________________________________________________________
        
       PostgreSQL, Memory and the Cloud
        
       Author : bilalhusain
       Score  : 127 points
       Date   : 2021-07-10 15:55 UTC (7 hours ago)
        
 (HTM) web link (sosna.de)
 (TXT) w3m dump (sosna.de)
        
       | Okkef wrote:
       | Good article, but please check the spelling. 'Program' is with a
       | single m at the end. Also, it's 'hundreds' instead of 'hunderts'.
        
       | dkersten wrote:
       | A metacomment about the page (rather than the content): the text
       | in the white boxes is almost unreadable for me, the contrast is
       | crazy low.
        
         | meepmorp wrote:
         | I had this problem, too; there's a button to toggle the night
         | mode theme, which fixed it for me.
        
       | aeyes wrote:
       | Wow, the title of this post is very calm compared to what is
       | actually happening.
       | 
       | CloudSQL Postgres is running with a misconfigured OS OOM killer,
       | crashes Postmaster randomly even if memory use is below instance
       | spec. GCP closes this bug report as "Won't fix".
       | 
       | This is a priority 1 issue. Seeing a wontfix for this has
       | completely destroyed my trust of their judgement. The bug report
       | states that they have been in contact with support since
       | February.
       | 
       | Unbelievable attitude towards fixing production critical problems
       | of their platform affecting all customers.
        
         | breakingcups wrote:
         | It exposes a very problematic communication pattern. The
         | engineering team doesn't respond to the support team
         | (accidentally or deliberately). The support team then just
         | decides to close the issue instead of prodding the engineering
         | team for an actual response (even if it's just "Yeah, we're not
         | fixing it").
         | 
         | Now the issue is just in limbo and the only one who feels the
         | pain is the customer.
        
           | perlgeek wrote:
           | Another "fun" interaction pattern: User reports a bug (or a
           | feature request), several others subscribe to and/or vote for
           | this to be solved, and then a service rep closes the issue
           | because there wasn't any recent activity.
           | 
           | I've observed with with Atlassian where I wanted to report a
           | Jira bug, but found that it had already been opened some
           | years before, more than a hundred people had subscribed, bug
           | was still closed as "no activity, must not be relevant". I
           | just found the exact same bug reported for Jira Cloud (I had
           | observed it in the on-prem version):
           | https://jira.atlassian.com/browse/JSWCLOUD-8865 and it was
           | closed there for the very same reason.
           | 
           | I didn't leave a comment because the original report
           | described the issue perfectly, and adding a "me too" comment
           | is just noise in the bug tracker. Guess I'll be noise in
           | future :-(
        
             | jaredsohn wrote:
             | I don't think a 'me too' would be noise right now. Last
             | activity was in 2019; having someone state the problem is
             | still real in 2021 could impact if it gets fixed.
        
         | slimsag wrote:
         | This doesn't really surprise me. We use CloudSQL at my work
         | (Sourcegraph) and have run into all sorts of weird issues
         | actually putting it into production, e.g. segmentation faults
         | when turning on Query Insights (which, lol, is supposed to give
         | insight into why your DB might be behaving poorly.)
         | 
         | For the most part it works okay and is fine, but there have
         | definitely been a fair number of quirks..
         | 
         | https://issuetracker.google.com/u/2/savedsearches/559773?pli...
        
       | yjftsjthsd-h wrote:
       | So are there problems with disabling overcommit? Or is it really
       | that simple (at least for dedicated db hosts)?
        
         | jpeter wrote:
         | And if there are no problems, why is it not disabled by
         | default?
        
           | AzN1337c0d3r wrote:
           | Because many program "unnecessarily" allocate a lot of
           | virtual memory and then never actually page it in, disabling
           | overcommit will start killing processes due to lack of memory
           | even though most of the memory isn't been actually used.
        
           | derefr wrote:
           | If you mean "why isn't it disabled by default on Linux
           | installs": most programs don't expect malloc(2) to ever
           | return NULL. Those programs will just _assume_ the return
           | value from malloc(2) is valid memory. In the best case,
           | they'll immediately write to it and protection-fault. In the
           | worst case, they'll hold onto this NULL pointer for a while,
           | passing it around, until eventually something else somewhere
           | distant in the program blows up some unknown amount of time
           | later.
           | 
           | Even those programs that are "malloc(2) error aware", often
           | do something stupid and counterproductive in response, like
           | attempting to allocate more memory for an exception object /
           | stack trace / error string.
           | 
           | Programs that do something useful in response to a NULL
           | malloc(2) return result -- useful for the stability of the
           | system as a whole, better than what the OOM killer gets you
           | -- are rare, even on servers. Usually it's only stateful,
           | long-running, DBMS-like daemons that 1. bother, and 2. have
           | the engineering effort put into them to do the _right_ thing.
        
             | mjw1007 wrote:
             | I think your first paragraph is too pessimistic.
             | 
             | You can get a decent idea of the behaviour of the programs
             | you use when they run out of memory by running under
             | 'ulimit -v' with a low limit.
             | 
             | In my experience most (though far from all) of the programs
             | I use managed a clean abort with an error message (as from
             | a traditional xmalloc()).
        
               | derefr wrote:
               | Most of these were likely managed-language programs.
               | 
               | Programs witten for managed language runtimes will have a
               | language-runtime-level abort on malloc(2) fail, which
               | usually _is_ well written, in the sense that it will
               | clean up _language-runtime-level_ resources, and emit a
               | _language-runtime-level_ error message.
               | 
               | But this language-runtime-level abort usually isn't
               | exposed to the application in any hookable way, so from
               | the developer's perspective, it's basically the same as
               | being OOM killed. There's no option to clean up e.g. an
               | individual transaction's resources in order to keep
               | going. There's no hooks for libraries to use to e.g.
               | properly send close messages on sockets (if the language
               | runtime doesn't do that itself as part of managing socket
               | lifetimes.) Etc.
               | 
               | These managed runtimes (e.g. the JVM) may expose a
               | catchable exception for OOM errors, but these are for
               | _internal, language-runtime level_ OOM errors, triggered
               | by the runtime itself under certain conditions, rather
               | than in response to a syscall failure. When malloc(2)
               | fails, it's basically "too late" from these runtimes'
               | perspectives -- they no longer have the resources
               | required to allow the user to run any more code.
        
               | mjw1007 wrote:
               | << Most of these were likely managed-language programs.
               | >>
               | 
               | Please don't guess. They weren't.
               | 
               | It is true that a program that aborts as soon as malloc
               | returns failure isn't doing any special cleanup or
               | attempting to keep going.
               | 
               | But that's not at all the same as << Those programs will
               | just assume the return value from malloc(2) is valid
               | memory. In the best case, they'll immediately write to it
               | and protection-fault. >>, which is what I'm informing you
               | is too pessimistic.
        
         | perlgeek wrote:
         | There are problems with disabling overcommit.
         | 
         | Consider this scenario: a process runs a fork(), and shortly
         | after it runs an exec(). Normally, the extra fork only uses a
         | tiny amount of extra memory, because the memory is shared
         | between the parent and the child, until one of them writes to
         | it (copy-on-write).
         | 
         | With overcommit disabled, the kernel must reserve enough space
         | to copy the whole writable RAM of a process when it forks.
         | 
         | So you have a 16GB machine, and an 8.1GB process cannot spawn
         | any other program through the usual fork + exec routine
         | (workarounds exist, like forking before allocating lots of
         | memory and using IPC to instruct the low-memory fork to fork
         | again and launch, but that's way more complicated than a simple
         | fork + exec).
         | 
         | So if you have a dedicated DB host and you know that your DB
         | engine is very carefully engineered to work with disabled
         | overcommit, you can disable it. On a general-purpose machine a
         | disabled overcommit will waste lots of RAM that's sitting
         | unused.
        
         | Diggsey wrote:
         | Some programs allocate a lot of virtual memory and then don't
         | use it.
         | 
         | Also, linux's forking model can result in a lot of virtual
         | memory being allocated if a heavy-weight program tries to
         | fork+exec a lot of smaller programs, since fork+exec it not
         | atomic and briefly doubles the virtual memory usage of the
         | original program.
         | 
         | I think there are better ways to spawn programs that don't
         | suffer from this problem now...
         | 
         | If you have programs that are written to allocate virtual
         | memory sparingly (like postgres) then that should be fine.
         | 
         | However, there is a second way you can be caught out: even if
         | you disable overcommit, your program can still be OOM killed
         | for violating cgroup limits, since cgroup limits always behave
         | as though over-commit is enabled (ie. they allow you to
         | allocate more than you are allowed, and then you get OOM killed
         | when you try to use the allocated memory). This means you'd
         | have to be really careful running eg. postgres inside a
         | kubernetes pod.
         | 
         | This behaviour really sucks IMO. I would like it if you could
         | set overcommit on a per-program basis, so that eg. postgres can
         | say "I know what I'm doing - when I allocate virtual memory I
         | want you to _really_ allocate it (and tell me now if you can
         | 't...)". I think you can somewhat achieve this with memory
         | locking, but that prevents it from being paged out at all...
        
           | ithkuil wrote:
           | Wasn't vfork designed to solve this issue? Does it not work
           | in practice?
        
             | Diggsey wrote:
             | It is certainly one way to solve that specific issue,
             | assuming the program was written to take advantage of it.
             | As mentioned, there are several other reasons a program may
             | use a lot of virtual memory though.
        
       | mnahkies wrote:
       | I recently managed to crash a GCP cloudsql postgres 12 host
       | running an interactive query that was rather heavy (based on
       | error logs OOM).
       | 
       | It surprised me because I had never executed a query and caused
       | the whole host to crash up until that point - now I'm wondering
       | if this misconfiguration is the cause
        
       | thyrsus wrote:
       | Are there recommendations for learning about Linux kernel memory
       | management? Two anecdata:
       | 
       | * I had some compute servers that were up for 200 days. The
       | customers noticed that they were half as fast as identical
       | hardware just booted. Dropping the file system cache ("echo 3 |
       | sudo dd of=/proc/sys/vm/drop_cache") brought the speed back up to
       | the newly deployed servers. WTF? File system caches are supposed
       | to be zero cost discards as soon as processes ask for RAM - but
       | something else is going on. I suspect the kernel is behaving
       | badly with overpopulated RAM management data (TLB entries?), but
       | I don't know how to measure that.
       | 
       | * If that is actually the problem, then a solution might be to
       | decrease data size by using non-zero hugepages ("cat
       | /proc/sys/vm/nr_hugepages"). I'd love to see recommendations on
       | when to use that.
        
         | SteveNuts wrote:
         | Are you using any swap? If so, check the swappiness setting
        
           | thyrsus wrote:
           | No swap. These are large RAM (400G to 1000G) Kubernetes
           | nodes.
        
             | sargun wrote:
             | This is likely due to a kernel bug that was caused by the
             | way cgroup slab management is handled. Upgrade to 5.10 or
             | later, and it should be fixed. I'd be interested to see if
             | the problem continues.
        
         | jeffbee wrote:
         | Explicit hugepages on x86 are difficult to manage. Most people
         | using off-the-shelf software can only take advantage of it by
         | configuring, for example, innodb buffer pools to use them.
         | However if your compute server really is a database, then
         | you'll find the performance benefit is well worth the
         | configuration.
         | 
         | For other processes you'll need a hugepage-aware allocator such
         | as tcmalloc (the new one, not the old one) and transparent
         | hugepages enabled. Again, the benefits of this may be enormous,
         | if page table management is expensive on your services.
         | 
         | You will find a great deal of blogs on the web recommending
         | disabling transparent hugepages. These people are all mislead.
         | Hugepages are a major benefit.
        
         | citrin_ru wrote:
         | I don't remember details now, but I've seen a situation when a
         | Java app was working slower and a box with more RAM (and
         | probably a bigger heap size), compare to a box with the same
         | CPU but 2x less RAM. I suspected that TLB cache was the reason,
         | but didn't have time to test this.
        
           | Tostino wrote:
           | Could have also been compressed OOPs
        
       | shdh wrote:
       | GCP CloudSQL has a lot of issues. There was one with query
       | insights being enabled causing segfaults on `LEFT JOIN`
       | operations. Its since been patched, but really shitty.
        
       ___________________________________________________________________
       (page generated 2021-07-10 23:00 UTC)