[HN Gopher] Ten years of improvements in PostgreSQL's optimizer
___________________________________________________________________
Ten years of improvements in PostgreSQL's optimizer
Author : samaysharma
Score : 254 points
Date : 2024-04-17 03:27 UTC (19 hours ago)
(HTM) web link (rmarcus.info)
(TXT) w3m dump (rmarcus.info)
| nsilvestri wrote:
| Proebsting's Law comes to mind:
| https://proebsting.cs.arizona.edu/law.html
| cpeterso wrote:
| In this case, the researcher built each PostgreSQL version
| using the same GCC version (13.2) and tested on the same OS.
| waterproof wrote:
| That seems like a very weak "law". Is it meant as a joke? The
| supporting evidence is just numbers pulled apparently from
| nowhere ("let's assume...") and the conclusion is wildly off
| base. They seem to imply that if optimization improves the
| performance of much of the world's software, by 4% each year,
| then it's a waste of time.
|
| Murphy's law is the only comparison given. I wonder what the
| comparative expense is to develop faster and faster hardware,
| vs. to keep improving compilers. Depending on how the ROIs
| compare, (in dollars-per-percent gain, let's say) maybe this
| "law" is worth some weight.
|
| OTOH, the Postgres article in question seems to show
| diminishing returns from optimization, which belies the whole
| premise of the "law" which assumes the gain is consistent from
| year to year. And might prove Prorbstig's implied point about
| optimization being a bad investment over the long term!
| devjab wrote:
| Why? The at law seems to talk about how performance software
| improvements isn't relevant while this article talks about how
| there improvements to Postgres has been significant.
|
| Is it because you view the 15% to be a low number? Because it
| really, really, isn't in the context. It's certainly smaller
| than the 60% from your linked law especially if you do the
| whole 15/10 thing, but you really shouldn't compare Postgres's
| performance to hardware increases because they aren't the same.
| You would need absolutely massive amounts of hardware
| improvements to compare to even 1% performance on what is being
| measured here.
|
| I don't think the law is as silly as other posters here, but
| it's talking about programming language compile times. I
| wouldn't compare something as unimportant as that to what is
| arguably one of the most important things in computer
| science... considering how much data storage and consumption
| means to our world.
| fear91 wrote:
| The nice thing about compiler optimizations is that you can
| improve performance of existing CPU's without physically
| touching them. Year by year. You squeeze more of the machine
| someone designed. It adds up.
|
| Imagine what environmental impact you would have if you
| optimized Python's performance by 1%? How much CO2 are you
| removing from the atmosphere? It's likely to overshadow the
| environmental footprint of you, your family and all your
| friends combined. Hell, maybe it's the entire city you live in.
| All because someone spent time implementing a few bitwise
| tricks.
| hnben wrote:
| > Imagine what environmental impact you would have if you
| optimized Python's performance by 1%?
|
| I imagine this would also increase the complexity of the
| python intepreter by more than 1%, which in turn would
| increase the number of bugs in the interpreter by more than
| 1%. Which would burn both manpower and CPU-Cycles on a global
| scale.
|
| (I assume that optimization, that reduce complexity are
| already exhausted, e.g. stuff like "removing redundant
| function calls". )
| fear91 wrote:
| This is an assumption that a reasonable person naively
| comes up with.
|
| Then if you actually go ahead and check, it turns out it's
| not true! It's quite a shocking revelation.
|
| When you dig into the popular compilers/runtimes (with the
| exception of things like LLVM)
|
| Many of them still have low hanging fruit of the form:
|
| a = b + c - b
|
| Yes, the above is still not fully optimized in the official
| implementations of some popular programming languages.
|
| Also an optimization of "removing redundant function calls"
| isn't a binary on/off switch. You can do it better or
| worse. Sometimes you can remove them, sometimes not. If you
| improve your analysis, you can do more of that and improve
| performance. Same for DSE, CSE, etc...
| Sesse__ wrote:
| In many languages, you can't just optimize + b - b willy-
| nilly, as there could be side effects and non-obvious
| interactions abound. For instance, in JavaScript, where
| everything is a 64-bit double, a + b - b is definitely
| not the same as a, given large enough or small enough a
| or b. In LLVM for floats as well, certainly.
| fear91 wrote:
| It's an example of how trivial some of this low hanging
| fruit is, the above is a concrete case that I have
| personally implemented (arithmetic of in-register
| 64/32-bit integers). You can get into semantics and
| restrictions, but I think the point I'm raising is clear.
| uhoh-itsmaciek wrote:
| Neat, but Postgres version numbering changed with v10. 9.6, 9.5,
| 9.4, 9.3, 9.2, 9.1, 9.0, 8.4, 8.3, 8.2, 8.1, and 8.0 are
| effectively all distinct major versions. It'd be interesting to
| see how perf changed with those.
| guiriduro wrote:
| But at least they maintained the courtesy of major version
| filesystem compatibility allowing faster in-place upgrades
| during an extended period (from v9.0 - 9.6), by simply changing
| the binaries. Possibly that held them back, but yearly updates
| that require more downtime/reindexing isn't so much fun, and
| possibly the reason why many sites skip upgrades until previous
| versions drop out of support (esp AWS RDS users). While the
| alternative of a logical replication upgrade (from v10 onward)
| has availability benefits, its a major project with unavoidable
| costs and significant risks unless schemas are relatively
| simple.
| ioltas wrote:
| The bar to merge a patch that would break on-disk format is
| so high that it's really impossible to reach it ;)
|
| More seriously, we care a lot about the format of the 8k
| pages to ease upgrade paths as much as possible.
| RMarcus wrote:
| I totally agree -- I picked the latest version for each major
| version using the semver interpretation of the version numbers,
| which is not how PostgreSQL has traditionally done major
| version numbers (i.e., PG 8.2 and 8.1 are different major
| versions, but I interpreted the version numbers as if they were
| minor versions).
|
| The main reason I did this was to reduce the number of versions
| I had to test, but I agree a more complete analysis would test
| each (true) major version.
| uhoh-itsmaciek wrote:
| >Of course, not all of these improvements are attributable to the
| query optimizer.
|
| It would be interesting to see plan changes, if any, across
| versions.
| throwaway984393 wrote:
| It would arguably be more accurate to compare the actual systems
| running Postgres and their precompiled binaries. Nearly
| everything else in a given system running a given version of an
| application may have an impact on its performance in subtle ways.
| Its development occurred in conjunction with these other system
| components, thus it was optimized for them. In addition,
| distributions tend to ship their binaries with patches not found
| upstream which can also affect performance. Everything from the
| kernel and scheduler to glibc and more may affect performance.
| Newer isn't always faster. Even the containerization you now use
| might add a subtle penalty that an older non-containered version
| may not experience.
|
| Basically, it's theoretically possible that some old ass system
| running version 8 runs it faster than your current system runs
| version 8 (excluding hardware).
|
| Add to that all the other usual caveats about benchmarks etc
| (like the fact that IO wasn't tested here but is pretty dang
| relevant to real world performance)
| edweis wrote:
| How query optimisation looks like? Does it optimize on the SQL or
| algorithm level?
| BenoitP wrote:
| It describes all the way the SQL could be executed, then choses
| the faster plan. For example: if you're looking for the user
| row with user_id xx, do you read the full table then filter it
| (you have to look at all the rows)? Or do you use the dedicated
| data structure to do so (an index will enable to do it in the
| logarithm of the number of rows)?
|
| A lot more can be done: choosing the join order, choosing the
| join strategies, pushing the filter predicates at the source,
| etc. That's the vast topic of SQL optimization.
| abhishekjha wrote:
| Is there a more general reading for software engineers?
|
| Seems like jumping right into the code can be a bit
| overwhelming if you have no background on the topic.
| orlp wrote:
| Not exactly reading but I would recommend the database
| engineering courses by Andy Pavlo that are freely available
| on YouTube.
| jmgimeno wrote:
| PostgreSQL Query Optimization: The Ultimate Guide to
| Building Efficient Queries, by Henrietta Dombrovskaya and
| Boris Novikov Anna Bailliekova, published by Apress in
| 2021.
| rotifer wrote:
| You may discover this if you go to the Apress site, but
| there is a second edition [1] out.
|
| [1] https://hdombrovskaya.wordpress.com/2024/01/11/the-
| optimizat...
| sbuttgereit wrote:
| Always worth a mention:
|
| https://use-the-index-luke.com/
|
| Markus Winand (the website's author) has also written a
| book on the subject targeted at software developers which
| is decent for non-DBA level knowledge.
| whiterknight wrote:
| https://www.sqlite.org/queryplanner.html
| magicalhippo wrote:
| For the databases I've used, which excludes PostgreSQL, most
| optimization happens on algorithmic level, that is, selecting
| the best algorithms to use for a given query and in what order
| to execute them.
|
| From what I gather this is mostly because various different SQL
| queries might get transformed into the same "instructions", or
| execution plan, but also because SQL semantics doesn't leave
| much room for language-level optimizations.
|
| As a sibling comment noted, an important decision is if you can
| replace a full table scan with an index lookup or index scan
| instead.
|
| For example, if you need to do a full table scan, and do
| significant computation per row to determine if a row should be
| included in the result set, the optimizer might change the full
| table scan with a parallel table scan then merge the results
| from each parallel task.
|
| When writing performant code for a compiler, you'll want to
| know how your compilers optimizer transforms the source code to
| machine instructions, so you prefer writing code the optimizer
| handles well and avoid writing code the optimizer outputs
| slower machine instructions for. After all the optimizer is
| programmed to detect certain patterns and transform them.
|
| Same thing with a query optimizer and its execution plan.
| You'll have to learn which patterns the query optimizer in the
| database you're using can handle and generate efficient
| execution plans for.
| Sesse__ wrote:
| Postgres is a fairly standard System R implementation. You
| convert your SQL into a join tree with some operations at the
| end, and then try various access paths and their
| combinations.
| mdavidn wrote:
| At a very high level, the query planner's goal is to minimize
| the cost of reading data from disk. It gathers pre-computed
| column statistics, like counts of rows and distinct values, to
| estimate the number of rows a query is likely to match. It uses
| this information to order joins and choose indexes, among other
| things. Joins can be accomplished with several different
| algorithms, like hashing or looping or merging. The cheapest
| option depends on factors like whether one side fits in working
| memory or whether both sides are already sorted, e.g. thanks to
| index scans.
| paulddraper wrote:
| The query optimization chooses what algorithms provide the
| results requested by the SQL.
| morepork wrote:
| Site seems to be down, you can try this instead:
| https://web.archive.org/web/20240417050840/https://rmarcus.i...
| BitPirate wrote:
| The author mentions PostgreSQL's JIT compiler. Up to this day,
| I've only seen it degrade the performance of queries. Disabling
| it is on my install checklist.
| LunaSea wrote:
| The JIT compiler is great for analytical queries.
|
| You can configure thresholds for the JIT activation in
| PostgreSQL as well if you want to elevate the bar from which
| the JIT is enabled.
| refset wrote:
| For analytical queries note that you really have to learn how
| to express the queries efficiently with Postgres -
| unfortunately the optimizer is still missing lots of tricks
| found in more sophisticated engines [0][1][2]. JIT
| compilation alone can't get close to making up for those
| gaps.
|
| [0] https://pganalyze.com/blog/5mins-postgres-optimize-
| subquerie...
|
| [1] https://duckdb.org/2023/05/26/correlated-subqueries-in-
| sql.h...
|
| [2] "Unnesting Arbitrary Queries"
| https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf
| Sesse__ wrote:
| Postgres isn't really hot on implementing the latest stuff
| from academia (most of what they do seem to be fairly home-
| grown, not the least because some of it was never really
| written about before they started doing it). Arbitrary
| unnesting is certainly one of the optimizations they don't
| do; fancy sketches and other estimation techniques
| generally likewise. But what they're really good at is
| polish. A thousand small tweaks and optimizations to iron
| out issues for 0.1% of queries. It tends to have fairly few
| sharp edges, less so than any other RDBMS I've used.
| dur-randir wrote:
| But PG is much more commonly used for OLTP, not OLAP. I'm
| baffled to this day that they enabled it by default.
| LunaSea wrote:
| OLAP is a very common use case for PostgreSQL as well.
| baba92 wrote:
| A customer had the worst performance issues after switching to
| postgres. Weirdly it only happened in the docker and test
| server setup and not on the developers machine (he ran postgres
| via homebrew).
|
| Turns out homebrew installed postgres without JIT support and
| one query on the developer machine ran in 200ms, while with JIT
| the query took 4-5 seconds. Took some time to figure out, since
| I'm not a heavy postgres user, but since then I've always
| disabled JIT and never looked back.
| masklinn wrote:
| Yeah pg's JIT is a pretty good demonstration that LLVM is not
| great at JITs, worsened by postgres not having a good
| persistent shared query cache.
|
| It would probably be less detrimental if it could perform
| asynchronous compilation for future queries (which really is
| closer to how normal JITs actually behave, at least for
| optimising backends)
| MaxBarraclough wrote:
| > LLVM is not great at JITs
|
| That seems like a generalisation. It's true LLVM will never
| be a lightweight toolkit, but if you want to generate highly
| optimised code it seems like a solid choice, assuming you're
| doing relatively 'vanilla' code-gen work.
| SigmundA wrote:
| PG just needs to do like the other databases and cache query
| plans, also add hints please.
| masklinn wrote:
| > PG just needs to do like the other databases and cache
| query plans
|
| The inline blocking compilation step will still fuck your
| query.
|
| > also add hints please
|
| I'd much rather they allowed bypassing the planner entirely
| and had an interface to feed plans directly to the
| executor.
| SigmundA wrote:
| >The inline blocking compilation step will still fuck
| your query
|
| Only on the first execution, the long plan/jit time is
| usually only an issue when running a fast query over and
| over again.
|
| However if plans where cached then you could also plan
| without jitting then background jit and update cached
| plan for next time which would be even smarter.
|
| >I'd much rather they allowed bypassing the planner
| entirely and had an interface to feed plans directly to
| the executor
|
| Other database have plan locking where you can load an
| existing plan from somewhere else in serialized form
| (XML) and force it to use that. Most of the time I prefer
| just a few hints in the query source solves almost all
| issue with the planner.
| RMarcus wrote:
| PostgreSQL has coarse-grain query hints (like
| `enable_hashjoin`), and the excellent `pg_hint_plan`
| extension allows you to specify a complete or partial
| execution plan: https://github.com/ossc-db/pg_hint_plan
| nextaccountic wrote:
| Can't postgres JIT a query once and run the compiled query
| multiple times?
| DoubleFree wrote:
| The postgres query optimizer will try to minimize the number of
| pages read from disk (and the number of intermediate pages
| written to disk). Benchmarking the query optimizer by making the
| shared buffers large enough to hold all the data therefore seems
| wrong, as you're then measuring the speed of the query optimizer
| and the join processor, instead of the quality of the generated
| query plans. It would not surprise me if the generated plans for
| these versions are actually all the same and this is only
| measuring execution speed.
| Sesse__ wrote:
| No, it optimizes cost, which includes pages read from disk
| _and_ things like CPU use. Cost is an arbitrary unit meant to
| correlate with time spent, not by disk loads, so it's
| completely reasonable to compare plans with everything loaded
| in RAM. (Cost is by convention scaled such that one page read
| from disk is 1.0, but that's a different things from "the
| optimizer will try to minimize the number of pages read from
| disk". It could just as well have been scaled so that 1.0 was 1
| ms on some arbitrary machine.)
| RMarcus wrote:
| It is certainly possible that the plans are similar, and that
| improvements to the execution engine are being measured. The
| join order benchmark was designed to test optimizer quality. It
| is worth noting that _in addition_ to trying to measure the
| number of pages read from disk, the PG optimizer _also_ tries
| to reduce the number of tuples examined by the CPU, the number
| of predicate evaluations, etc. All these numbers are rolled up
| into a "cost," which is the function that the optimizer
| minimizes.
|
| It is also true that measuring cold cache and warm cache
| performance can produce different results, and this experiment
| is certainly in the warm cache scenario. But, the cold cache
| scenario suffers from the problem you mention as well: an
| improvement to PG's B-tree that saves a few IOs will dominate
| any kind of CPU-based improvement (at least at the data size of
| the join order benchmark).
|
| FWIW, the plan for the query with the P90 latency changes from
| a plan that uses loop and merge join in PG8.4 to a plan that
| uses hash join in PG16 (where it is no longer the P90 query),
| which is at least some evidence of optimizer improvements.
| petergeoghegan wrote:
| > It is certainly possible that the plans are similar, and
| that improvements to the execution engine are being measured.
| The join order benchmark was designed to test optimizer
| quality.
|
| I don't think that it's possible to test optimizer quality in
| isolation -- not really, not if it's to be in any way
| practical. Many (most?) individual improvements to the
| optimizer are hopelessly intertwined with executor
| enhancements. This is usually fairly obvious, because the
| same commit changes both the optimizer and the executor
| together. Sometimes it's much less obvious, though, because
| its more a case of the optimizer and executor coevolving.
|
| It's probably still the case that a lot of the improvements
| seen here are pure executor enhancements, but I can't say
| that I'm very confident about that. (As the main person
| behind those B-Tree IO savings you might expect me to have
| more confidence than that, but I find it horribly difficult
| to tease these issues apart while keeping the discussion high
| level/relevant.)
| mehulashah wrote:
| I'm a bit confused with the analysis here. How do you confirm a
| downward trend in the data that is not visible on the graph. It
| looks like the median drops a little in the first few versions,
| but then rises in the last few. The R2 is really low, so I'm not
| sold on the correlation here. Basically, it looks like tail
| latency has improved and YMMV for everything else.
| RMarcus wrote:
| I'm the author of the blog post.
|
| "Tail latency has improved by YMMV for everything else" => yes,
| I think that's a valid (but conservative) read. Of course, in
| many (most?) applications, tail latency is very important. Tail
| latency also tends to be the thing that optimizer engineers
| target (i.e., reduce the runtime of the longest running query).
| darksaints wrote:
| As someone who has worked with Postgres for 15 years and also has
| spent most of my career modeling and solving mathematical
| optimization problems, I have so much to say on this topic, but
| I'll leave it at these three points:
|
| * Every optimization problem needs data on costs, and the more
| and better the data is, the better. Postgres has made some
| improvements here, especially with cross column statistics, but
| there are still massive improvements left on the table. The most
| glaring omission is data on syscall latencies. Reading a page
| from disk has dramatically different latencies from system to
| system, and postgres still relies on configuration values for
| these costs, when it could very easily be measuring them. Another
| omission is foreign key statistics. Joins along foreign keys
| should never have a bad plan, but they still occasionally do.
|
| * Deferred or alternative scenario planning should be adopted,
| especially for large and expensive queries. As it is today, your
| plan is finalized before it is executed, even though earlier
| stages of execution could provide information (like rowcounts or
| cardinality estimates) that could dramatically improve later
| stage plans.
|
| * Machine learning most definitely could be an area where
| improvements could be made, but I've been unimpressed with the
| efforts that Ive seen. Don't use machine learning for planning,
| use machine learning for cost discovery and estimation. Build
| better cost models, and then let the optimization engine work
| with that data.
| RMarcus wrote:
| I'm curious to hear a bit more of your opinion. For example,
| I'm surprised that syscall latency is something near the top of
| your list. I think the usual wisdom in the DB community is that
| the cost models are mostly fine, but the cardinality estimation
| is really bad.
|
| In terms of the deferred / alternative planning, do you think
| adaptive query execution is a reasonable way to achieve this?
| It certainly allows for information early in the query
| execution to impact later plans. My worry with these approaches
| is that if you get the first couple of joins wrong (which is
| not uncommon), unless you have something like Yannakakis/SIPs,
| you still can't recover.
|
| I am obviously biased on the whole "ML for query optimization"
| thing. One thing I would note is that every "ML for planning"
| approach I've seen does, under the hood, use ML for cost
| discovery/estimation. These approaches are just trying to
| balance the data they collect (exploration) with the quality of
| the plans they produce (exploitation). Interestingly, if you
| use ML in a way that is completely removed from planning, you
| actually get _worse_ query plans despite more accurate
| estimates:
| https://people.csail.mit.edu/tatbul/publications/flowloss_vl...
| (again, I've got a horse in this race, so my opinion should
| come with a side of salt :D)
| aeyes wrote:
| As a DBA I have a very hard time properly tuning parameters
| like random_page_cost and the default of 4.0 is no longer
| applicable for most database servers.
|
| I don't want to be tuning this, it takes a lot of time to do
| it properly and I haven't retested this in a long time. I
| just set something that has worked in the past which is
| probably bad.
|
| I completely agree that Postgres should be able to figure
| this out on its own. This is just an example, there are more
| such parameters which should be adjusted to the hardware but
| most people will leave the defaults.
| srcreigh wrote:
| Awesome paper! Might be my cue to delve into the world of
| cost estimates. My naive brain thinks, how can it be so hard
| :-)
| adgjlsfhk1 wrote:
| My guess is that the reason for putting syscall latency high
| is that it should be easy to fix. Cardinality tracking is a
| hard problem, but running a loop on install that measures the
| cost of a couple dozen syscalls really could be done
| automatically.
| nextaccountic wrote:
| Do you think that MSSQL is better on this front?
| da_chicken wrote:
| > Another omission is foreign key statistics. Joins along
| foreign keys should never have a bad plan, but they still
| occasionally do.
|
| I'm curious what you mean. Postres, like many RDBMSs, does not
| automatically create an index on a FK. But you have to know
| that already.
|
| Is it join sequencing?
| Sesse__ wrote:
| Having a foreign key should probably be a hint to the planner
| that it should create a cross-table correlation estimate on
| the source and destination columns. This is about join
| cardinality estimation, not having an index.
| zeroimpl wrote:
| > Deferred or alternative scenario planning should be adopted,
| especially for large and expensive queries. As it is today,
| your plan is finalized before it is executed, even though
| earlier stages of execution could provide information (like
| rowcounts or cardinality estimates) that could dramatically
| improve later stage plans.
|
| Alternative planning sounds awesome. I saw a query plan the
| other day where it expected about 1000 rows from some subquery
| so did a nested loop to an index scan. In reality there's about
| a billion rows. I'm not sure yet why the estimate was so bad,
| but being able to pivot from a nested loop to a hash join if
| the row count crossed some threshold would be great at avoiding
| some catastrophic plans.
___________________________________________________________________
(page generated 2024-04-17 23:01 UTC)