[HN Gopher] Playing with BOLT and Postgres
       ___________________________________________________________________
        
       Playing with BOLT and Postgres
        
       Author : aquastorm
       Score  : 161 points
       Date   : 2024-10-04 17:17 UTC (2 days ago)
        
 (HTM) web link (vondra.me)
 (TXT) w3m dump (vondra.me)
        
       | miohtama wrote:
       | 10% - 20% performance improvement for PostgreSQL "for free" is
       | amazing. It almost sounds too good to be true.
        
         | albntomat0 wrote:
         | There's a section of the article at the end about how Postgres
         | doesn't have LTO enabled by default. I'm assuming they're not
         | doing PGO/FDO either?
         | 
         | From the Bolt paper: "For the GCC and Clang compilers, our
         | evaluation shows that BOLT speeds up their binaries by up to
         | 20.4% on top of FDO and LTO, and up to 52.1% if the binaries
         | are built without FDO and LTO."
        
           | touisteur wrote:
           | I've always wondered how people actually get the profiles for
           | Profile-Guided-Optimization. Unit tests probably won't
           | actuate high-performance paths. You'd need a set of
           | performance-stress tests. Is there a write-up on how everyone
           | does it in the wild ?
        
             | mhh__ wrote:
             | You might be surprised how much speedup you can get from
             | (say) just running a test suite as PGO samples. If I had to
             | guess this is probably because compilers spend a lot of
             | time optimising cold paths which they otherwise would have
             | no information about
        
               | pgaddict wrote:
               | Yeah, getting the profile is obviously a very important
               | step. Because if it wasn't, why collect the profile at
               | all? We could just do "regular" LTO.
               | 
               | I'm not sure there's one correct way to collect the
               | profile, though. ISTM we could either (a) collect one
               | very "general" profile, to optimize for arbitrary
               | workload, or (b) profile a single isolated workload, and
               | optimize for it. In the blog I tried to do (b) first, and
               | then merged the various profiles to do (a). But it's far
               | from perfect, I think.
               | 
               | But even with the very "rough" profile from "make
               | installcheck" (which is the basic set of regression
               | tests), is still helps a lot. Which is nice. I agree it's
               | probably because even that basic profile is sufficient
               | for identifying the hot/cold paths.
        
               | foota wrote:
               | I think you have to be a bit careful here, since if the
               | profiles are too different from what you'll actually see
               | in production, you can end up regressing performance
               | instead of improving it. E.g., imagine you use one kind
               | of compression in test and another in production, and the
               | FDO decides that your production compression code doesn't
               | need optimization at all.
               | 
               | If you set up continuous profiling though (which you can
               | use to get flamegraphs for production) you can use that
               | same dataset for FDO.
        
               | pgaddict wrote:
               | Yeah, I was worried using the "wrong" profile might
               | result in regressions. But I haven't really seen that in
               | my tests, even when using profiles from quite different
               | workloads (like OLTP vs. analytics, different TPC-H
               | queries, etc.). So I guess most optimizations are fairly
               | generic, etc.
        
               | mhh__ wrote:
               | There are some projects (not sure if available to use in
               | anger) to generate PGO data use using AI.
        
               | still_grokking wrote:
               | AI can predict how some code behaves when run?
               | 
               | So AI can predict whether some program halts?
               | 
               | Seriously?
        
               | MonkeyClub wrote:
               | Well spotted! :)
        
               | CyberDildonics wrote:
               | That's not how it works. BOLT is mainly about figuring
               | out the most likely instructions that will run after
               | branches and putting them close together in the binary.
               | Unlikely instructions like error and exception paths can
               | be put at the end of the binary. Putting the most used
               | instructions close together leverages prefetching and
               | cache so that unused instructions aren't what is being
               | prefetched and cached.
               | 
               | In short it is better memory access patterns for
               | instructions.
        
               | foota wrote:
               | I suspect you know this based on the detail in your
               | comment and just missed it, but parent is talking about
               | FDO, not BOLT.
        
               | mhh__ wrote:
               | Yes, but I'm not talking about BOLT
        
             | tdullien wrote:
             | Google and Meta do in-production profiling. I think that
             | tech is coming to everyone else slowly.
        
             | its_bbq wrote:
             | If I remember correctly, at Google we would run a sampling
             | profiler on some processes in prod to create these
             | profiles, with some mechanism for additional manual
             | overrides
        
           | pgaddict wrote:
           | With the LTO, I think it's more complicated - it depends on
           | the packagers / distributions, and e.g. on Ubuntu we
           | apparently get -flto for years.
        
       | fabian2k wrote:
       | My first instinct is that the effect is too large to be real. But
       | that should be something other people could reproduce and verify.
       | The second thought is that it might overfit the benchmark code
       | here, but they address it in the post. But any kind of double-
       | digit improvement to Postgres performance would be very
       | interesting.
        
         | pgaddict wrote:
         | (author here)
         | 
         | I agree the +40% effect feels a bit too good, but it only
         | applies to the simple OLTP queries on in-memory data, so the
         | inefficiencies may have unexpectedly large impact. I agree
         | 30-40% would be a massive speedup, and I expected it to
         | disappear with a more diverse profile, but it did not ...
         | 
         | The TPC-H speedups (~5-10%) seem much more plausible,
         | considering the binary layout effects we sometimes observe
         | during benchmarking.
         | 
         | Anyway, I'd welcome other people trying to reproduce these
         | tests.
        
           | fabian2k wrote:
           | I looked and there is no mention of BOLT yet in the pgsql-
           | hackers mailing list, that might be the more appropriate
           | place to get more attention on this. Though there are
           | certainly a few PostgreSQL developers reading here as well.
        
             | pgaddict wrote:
             | True. At the moment I don't have anything very "actionable"
             | beyond "it's magically faster", so I wanted to investigate
             | this a bit more before posting to -hackers. For example,
             | after reading the paper I realized BOLT has "-report-bad-
             | layout" option to report cases of bad layout, so I wonder
             | if we could identify places where to reorganize the code.
             | 
             | OTOH my blog is syndicated to
             | https://planet.postgresql.org, so it's not particularly
             | hidden from the other devs.
        
       | Avamander wrote:
       | How easy would it be to have an entire distro (re)built with
       | BOLT? Say for example Gentoo?
        
         | fishgoesblub wrote:
         | It would be difficult as every package/program would need a
         | step to generate the profile data by executing and running the
         | program like the user would.
        
           | metadat wrote:
           | Is it theoretically possible to perform the profile
           | generation+apply steps dynamically at runtime?
        
             | cryptonector wrote:
             | It would be hard to trust the result.
        
             | tjalfi wrote:
             | I wouldn't want to support it, but similar things have been
             | done before.
             | 
             | Alexia Massalin's Synthesis[0] (pdf) operating system did
             | JIT-like optimizations for system calls. Here's a LWN
             | article[1] with a summary. Anyone who's interested in
             | operating systems should read this thesis.
             | 
             | HP's Dynamo[2] runtime optimizer did JIT-like optimizations
             | on PA-RISC binaries; it was released in 2000. DynamoRIO[3]
             | is an open source descendant. Also, DEC had a similar tool
             | for the Alpha, but I've forgotten the name.
             | 
             | [0] https://citeseerx.ist.psu.edu/document?repid=rep1&type=
             | pdf&d...
             | 
             | [1] https://lwn.net/Articles/270081/
             | 
             | [2] https://dl.acm.org/doi/pdf/10.1145/349299.349303
             | 
             | [3] https://dynamorio.org/
        
               | hikarikuen wrote:
               | This is getting way outside the traditional compiler
               | model, but I believe the .NET JIT has been adding more
               | support for this in the last couple versions. One aspect
               | of it is covered at
               | https://devblogs.microsoft.com/dotnet/performance-
               | improvemen...
        
               | EvanAnderson wrote:
               | Nat Freidman developed "GNU Rope"[1] from 1998 which, if
               | memory serves, was inspired by a tool that did the same
               | thing in IRIX (cord, I believe).
               | 
               | [1] http://lwn.net/1998/1029/als/rope.html
        
             | pgaddict wrote:
             | I believe some JIT systems already do PGO / might be
             | extended to do what BOLT does.
        
         | genewitch wrote:
         | based on what "fishgoesblub" commented, building - read:
         | `emerge -e @world` - a gentoo system with profiling forced, and
         | then using it in that "degraded" state for a while ought to be
         | able to inform PGO, right? if there's a really good speedup
         | from putting hot code together, the hottest code after moderate
         | use should suffice to speed up things, and this could
         | continually be improved.
         | 
         | I'm also certain that if there were a way to anonymously share
         | profiling data upstream (or to the maintainers), that would
         | decrease the "degradation" from the first step, above. I am
         | 100% spitballing here. I'm a dedicated gentoo sysadmin, but i
         | know only a small bit about optimization of the sort being
         | discussed here. So it is possible that every user would have to
         | do the "unprofiled profiler" build first, which, if one cares,
         | is probably a net negative to the planet, unless the idea pans
         | out, then it's a huge positive for the planet - man hours,
         | electricity, wear/endurance on parts, etc.
        
       | albntomat0 wrote:
       | I posted this in a comment already, but the results here line up
       | with the original BOLT paper.
       | 
       | "For the GCC and Clang compilers, our evaluation shows that BOLT
       | speeds up their binaries by up to 20.4% on top of FDO and LTO,
       | and up to 52.1% if the binaries are built without FDO and LTO."
       | 
       | "Up to" though is always hard to evaluate.
        
         | paulddraper wrote:
         | Up to 10000% I think
         | 
         | https://xkcd.com/870/
        
         | genewitch wrote:
         | "Up to" is one of those "technically correct", it's probably
         | more genuine and ethical to give a range in the same
         | circumstances. If 95% of binaries get at least 18%. but the
         | remaining 5% get much less than that, and that's important,
         | then say that, maybe.
         | 
         | When i see stuff like this, i usually infer that 95% gets a
         | median of 0% speedup, and a couple of cases get 20.4% or
         | whatever. But giving a chart of speedups for each sort of thing
         | that it speeds up (or doesn't) doesn't make for good copy, i
         | think.
        
       | krick wrote:
       | Does it work with rustc binaries?
        
         | glandium wrote:
         | Already done. https://github.com/rust-lang/rust/pull/116352
        
       | jeffbee wrote:
       | On the subject of completely free speedups to databases, someone
       | sent a patch to MySQL many years ago that loads the text into
       | hugepages, to reduce iTLB misses. It has large speedups and no
       | negative consequences so of course it was ignored. The number of
       | well-known techniques that FOSS projects refuse to adopt is
       | large.
        
         | paulryanrogers wrote:
         | MySQL has adopted a lot of performance work from FB, Google,
         | and others. Though I suspect they want their implementation for
         | license reasons.
        
         | pgaddict wrote:
         | I guess it's this MySQL bug:
         | https://bugs.mysql.com/bug.php?id=101369 which seems to have
         | stalled after request ti sign an OCA.
         | 
         | Anyway, I have no idea what would it take to do something like
         | that in Postgres, I'm not familiar with this sfuff. But if
         | someone submits a patch with some measurements, I'm sure we'll
         | take a look.
        
       | mhio wrote:
       | Would the profiles and resulting binaries be highly CPU specific?
       | I couldn't find any cross hardware notes in the original paper.
       | 
       | The example's I'm thinking of are CPU's with vastly different
       | L1/L2/L3 cache profiles. Epyc vs Xeon. Maybe Zen 3 v Zen 5.
       | 
       | Just wondering if it looks great on a benchmark machine (and a
       | hyperscaler with a common hardware fleet) but might not look as
       | great when distributing common binaries to the world. Doing
       | profiling/optimising after release seems dicey.
        
         | pgaddict wrote:
         | Interesting question. I think most optimizations described in
         | the BOLT paper are fairly hardware agnostic - branch prediction
         | does not depend the architecture, etc. But I'm not an expert on
         | microarchitectures.
        
           | jeffbee wrote:
           | A lot of the benefits of BOLT come from fixing the block
           | layout so that taken branches go backward and untaken
           | branches go forward. This is CPU neutral.
        
       | vivzkestrel wrote:
       | completely out of the loop here so asking, what is BOLT, how does
       | it actually improve postgres? what do the optimizations do under
       | the hood? and how do we know they haven't disabled something
       | mission critical?
        
         | paulddraper wrote:
         | Literally the second sentence
        
       | CalChris wrote:
       | For distros, you're probably talking about small programs with
       | shared libraries. I talked to the Bolt guy at an LLVM meeting and
       | Bolt is set up for big statically linked programs like what you'd
       | see at Facebook or Google (which has Propeller). It may have
       | changed but even though they were upstreaming Bolt to LLVM, they
       | didn't really have support for small programs with shared
       | libraries.
        
       ___________________________________________________________________
       (page generated 2024-10-06 23:02 UTC)