[HN Gopher] Playing with BOLT and Postgres
___________________________________________________________________
Playing with BOLT and Postgres
Author : aquastorm
Score : 161 points
Date : 2024-10-04 17:17 UTC (2 days ago)
(HTM) web link (vondra.me)
(TXT) w3m dump (vondra.me)
| miohtama wrote:
| 10% - 20% performance improvement for PostgreSQL "for free" is
| amazing. It almost sounds too good to be true.
| albntomat0 wrote:
| There's a section of the article at the end about how Postgres
| doesn't have LTO enabled by default. I'm assuming they're not
| doing PGO/FDO either?
|
| From the Bolt paper: "For the GCC and Clang compilers, our
| evaluation shows that BOLT speeds up their binaries by up to
| 20.4% on top of FDO and LTO, and up to 52.1% if the binaries
| are built without FDO and LTO."
| touisteur wrote:
| I've always wondered how people actually get the profiles for
| Profile-Guided-Optimization. Unit tests probably won't
| actuate high-performance paths. You'd need a set of
| performance-stress tests. Is there a write-up on how everyone
| does it in the wild ?
| mhh__ wrote:
| You might be surprised how much speedup you can get from
| (say) just running a test suite as PGO samples. If I had to
| guess this is probably because compilers spend a lot of
| time optimising cold paths which they otherwise would have
| no information about
| pgaddict wrote:
| Yeah, getting the profile is obviously a very important
| step. Because if it wasn't, why collect the profile at
| all? We could just do "regular" LTO.
|
| I'm not sure there's one correct way to collect the
| profile, though. ISTM we could either (a) collect one
| very "general" profile, to optimize for arbitrary
| workload, or (b) profile a single isolated workload, and
| optimize for it. In the blog I tried to do (b) first, and
| then merged the various profiles to do (a). But it's far
| from perfect, I think.
|
| But even with the very "rough" profile from "make
| installcheck" (which is the basic set of regression
| tests), is still helps a lot. Which is nice. I agree it's
| probably because even that basic profile is sufficient
| for identifying the hot/cold paths.
| foota wrote:
| I think you have to be a bit careful here, since if the
| profiles are too different from what you'll actually see
| in production, you can end up regressing performance
| instead of improving it. E.g., imagine you use one kind
| of compression in test and another in production, and the
| FDO decides that your production compression code doesn't
| need optimization at all.
|
| If you set up continuous profiling though (which you can
| use to get flamegraphs for production) you can use that
| same dataset for FDO.
| pgaddict wrote:
| Yeah, I was worried using the "wrong" profile might
| result in regressions. But I haven't really seen that in
| my tests, even when using profiles from quite different
| workloads (like OLTP vs. analytics, different TPC-H
| queries, etc.). So I guess most optimizations are fairly
| generic, etc.
| mhh__ wrote:
| There are some projects (not sure if available to use in
| anger) to generate PGO data use using AI.
| still_grokking wrote:
| AI can predict how some code behaves when run?
|
| So AI can predict whether some program halts?
|
| Seriously?
| MonkeyClub wrote:
| Well spotted! :)
| CyberDildonics wrote:
| That's not how it works. BOLT is mainly about figuring
| out the most likely instructions that will run after
| branches and putting them close together in the binary.
| Unlikely instructions like error and exception paths can
| be put at the end of the binary. Putting the most used
| instructions close together leverages prefetching and
| cache so that unused instructions aren't what is being
| prefetched and cached.
|
| In short it is better memory access patterns for
| instructions.
| foota wrote:
| I suspect you know this based on the detail in your
| comment and just missed it, but parent is talking about
| FDO, not BOLT.
| mhh__ wrote:
| Yes, but I'm not talking about BOLT
| tdullien wrote:
| Google and Meta do in-production profiling. I think that
| tech is coming to everyone else slowly.
| its_bbq wrote:
| If I remember correctly, at Google we would run a sampling
| profiler on some processes in prod to create these
| profiles, with some mechanism for additional manual
| overrides
| pgaddict wrote:
| With the LTO, I think it's more complicated - it depends on
| the packagers / distributions, and e.g. on Ubuntu we
| apparently get -flto for years.
| fabian2k wrote:
| My first instinct is that the effect is too large to be real. But
| that should be something other people could reproduce and verify.
| The second thought is that it might overfit the benchmark code
| here, but they address it in the post. But any kind of double-
| digit improvement to Postgres performance would be very
| interesting.
| pgaddict wrote:
| (author here)
|
| I agree the +40% effect feels a bit too good, but it only
| applies to the simple OLTP queries on in-memory data, so the
| inefficiencies may have unexpectedly large impact. I agree
| 30-40% would be a massive speedup, and I expected it to
| disappear with a more diverse profile, but it did not ...
|
| The TPC-H speedups (~5-10%) seem much more plausible,
| considering the binary layout effects we sometimes observe
| during benchmarking.
|
| Anyway, I'd welcome other people trying to reproduce these
| tests.
| fabian2k wrote:
| I looked and there is no mention of BOLT yet in the pgsql-
| hackers mailing list, that might be the more appropriate
| place to get more attention on this. Though there are
| certainly a few PostgreSQL developers reading here as well.
| pgaddict wrote:
| True. At the moment I don't have anything very "actionable"
| beyond "it's magically faster", so I wanted to investigate
| this a bit more before posting to -hackers. For example,
| after reading the paper I realized BOLT has "-report-bad-
| layout" option to report cases of bad layout, so I wonder
| if we could identify places where to reorganize the code.
|
| OTOH my blog is syndicated to
| https://planet.postgresql.org, so it's not particularly
| hidden from the other devs.
| Avamander wrote:
| How easy would it be to have an entire distro (re)built with
| BOLT? Say for example Gentoo?
| fishgoesblub wrote:
| It would be difficult as every package/program would need a
| step to generate the profile data by executing and running the
| program like the user would.
| metadat wrote:
| Is it theoretically possible to perform the profile
| generation+apply steps dynamically at runtime?
| cryptonector wrote:
| It would be hard to trust the result.
| tjalfi wrote:
| I wouldn't want to support it, but similar things have been
| done before.
|
| Alexia Massalin's Synthesis[0] (pdf) operating system did
| JIT-like optimizations for system calls. Here's a LWN
| article[1] with a summary. Anyone who's interested in
| operating systems should read this thesis.
|
| HP's Dynamo[2] runtime optimizer did JIT-like optimizations
| on PA-RISC binaries; it was released in 2000. DynamoRIO[3]
| is an open source descendant. Also, DEC had a similar tool
| for the Alpha, but I've forgotten the name.
|
| [0] https://citeseerx.ist.psu.edu/document?repid=rep1&type=
| pdf&d...
|
| [1] https://lwn.net/Articles/270081/
|
| [2] https://dl.acm.org/doi/pdf/10.1145/349299.349303
|
| [3] https://dynamorio.org/
| hikarikuen wrote:
| This is getting way outside the traditional compiler
| model, but I believe the .NET JIT has been adding more
| support for this in the last couple versions. One aspect
| of it is covered at
| https://devblogs.microsoft.com/dotnet/performance-
| improvemen...
| EvanAnderson wrote:
| Nat Freidman developed "GNU Rope"[1] from 1998 which, if
| memory serves, was inspired by a tool that did the same
| thing in IRIX (cord, I believe).
|
| [1] http://lwn.net/1998/1029/als/rope.html
| pgaddict wrote:
| I believe some JIT systems already do PGO / might be
| extended to do what BOLT does.
| genewitch wrote:
| based on what "fishgoesblub" commented, building - read:
| `emerge -e @world` - a gentoo system with profiling forced, and
| then using it in that "degraded" state for a while ought to be
| able to inform PGO, right? if there's a really good speedup
| from putting hot code together, the hottest code after moderate
| use should suffice to speed up things, and this could
| continually be improved.
|
| I'm also certain that if there were a way to anonymously share
| profiling data upstream (or to the maintainers), that would
| decrease the "degradation" from the first step, above. I am
| 100% spitballing here. I'm a dedicated gentoo sysadmin, but i
| know only a small bit about optimization of the sort being
| discussed here. So it is possible that every user would have to
| do the "unprofiled profiler" build first, which, if one cares,
| is probably a net negative to the planet, unless the idea pans
| out, then it's a huge positive for the planet - man hours,
| electricity, wear/endurance on parts, etc.
| albntomat0 wrote:
| I posted this in a comment already, but the results here line up
| with the original BOLT paper.
|
| "For the GCC and Clang compilers, our evaluation shows that BOLT
| speeds up their binaries by up to 20.4% on top of FDO and LTO,
| and up to 52.1% if the binaries are built without FDO and LTO."
|
| "Up to" though is always hard to evaluate.
| paulddraper wrote:
| Up to 10000% I think
|
| https://xkcd.com/870/
| genewitch wrote:
| "Up to" is one of those "technically correct", it's probably
| more genuine and ethical to give a range in the same
| circumstances. If 95% of binaries get at least 18%. but the
| remaining 5% get much less than that, and that's important,
| then say that, maybe.
|
| When i see stuff like this, i usually infer that 95% gets a
| median of 0% speedup, and a couple of cases get 20.4% or
| whatever. But giving a chart of speedups for each sort of thing
| that it speeds up (or doesn't) doesn't make for good copy, i
| think.
| krick wrote:
| Does it work with rustc binaries?
| glandium wrote:
| Already done. https://github.com/rust-lang/rust/pull/116352
| jeffbee wrote:
| On the subject of completely free speedups to databases, someone
| sent a patch to MySQL many years ago that loads the text into
| hugepages, to reduce iTLB misses. It has large speedups and no
| negative consequences so of course it was ignored. The number of
| well-known techniques that FOSS projects refuse to adopt is
| large.
| paulryanrogers wrote:
| MySQL has adopted a lot of performance work from FB, Google,
| and others. Though I suspect they want their implementation for
| license reasons.
| pgaddict wrote:
| I guess it's this MySQL bug:
| https://bugs.mysql.com/bug.php?id=101369 which seems to have
| stalled after request ti sign an OCA.
|
| Anyway, I have no idea what would it take to do something like
| that in Postgres, I'm not familiar with this sfuff. But if
| someone submits a patch with some measurements, I'm sure we'll
| take a look.
| mhio wrote:
| Would the profiles and resulting binaries be highly CPU specific?
| I couldn't find any cross hardware notes in the original paper.
|
| The example's I'm thinking of are CPU's with vastly different
| L1/L2/L3 cache profiles. Epyc vs Xeon. Maybe Zen 3 v Zen 5.
|
| Just wondering if it looks great on a benchmark machine (and a
| hyperscaler with a common hardware fleet) but might not look as
| great when distributing common binaries to the world. Doing
| profiling/optimising after release seems dicey.
| pgaddict wrote:
| Interesting question. I think most optimizations described in
| the BOLT paper are fairly hardware agnostic - branch prediction
| does not depend the architecture, etc. But I'm not an expert on
| microarchitectures.
| jeffbee wrote:
| A lot of the benefits of BOLT come from fixing the block
| layout so that taken branches go backward and untaken
| branches go forward. This is CPU neutral.
| vivzkestrel wrote:
| completely out of the loop here so asking, what is BOLT, how does
| it actually improve postgres? what do the optimizations do under
| the hood? and how do we know they haven't disabled something
| mission critical?
| paulddraper wrote:
| Literally the second sentence
| CalChris wrote:
| For distros, you're probably talking about small programs with
| shared libraries. I talked to the Bolt guy at an LLVM meeting and
| Bolt is set up for big statically linked programs like what you'd
| see at Facebook or Google (which has Propeller). It may have
| changed but even though they were upstreaming Bolt to LLVM, they
| didn't really have support for small programs with shared
| libraries.
___________________________________________________________________
(page generated 2024-10-06 23:02 UTC)