[HN Gopher] Ten years of improvements in PostgreSQL's optimizer
       ___________________________________________________________________
        
       Ten years of improvements in PostgreSQL's optimizer
        
       Author : samaysharma
       Score  : 254 points
       Date   : 2024-04-17 03:27 UTC (19 hours ago)
        
 (HTM) web link (rmarcus.info)
 (TXT) w3m dump (rmarcus.info)
        
       | nsilvestri wrote:
       | Proebsting's Law comes to mind:
       | https://proebsting.cs.arizona.edu/law.html
        
         | cpeterso wrote:
         | In this case, the researcher built each PostgreSQL version
         | using the same GCC version (13.2) and tested on the same OS.
        
         | waterproof wrote:
         | That seems like a very weak "law". Is it meant as a joke? The
         | supporting evidence is just numbers pulled apparently from
         | nowhere ("let's assume...") and the conclusion is wildly off
         | base. They seem to imply that if optimization improves the
         | performance of much of the world's software, by 4% each year,
         | then it's a waste of time.
         | 
         | Murphy's law is the only comparison given. I wonder what the
         | comparative expense is to develop faster and faster hardware,
         | vs. to keep improving compilers. Depending on how the ROIs
         | compare, (in dollars-per-percent gain, let's say) maybe this
         | "law" is worth some weight.
         | 
         | OTOH, the Postgres article in question seems to show
         | diminishing returns from optimization, which belies the whole
         | premise of the "law" which assumes the gain is consistent from
         | year to year. And might prove Prorbstig's implied point about
         | optimization being a bad investment over the long term!
        
         | devjab wrote:
         | Why? The at law seems to talk about how performance software
         | improvements isn't relevant while this article talks about how
         | there improvements to Postgres has been significant.
         | 
         | Is it because you view the 15% to be a low number? Because it
         | really, really, isn't in the context. It's certainly smaller
         | than the 60% from your linked law especially if you do the
         | whole 15/10 thing, but you really shouldn't compare Postgres's
         | performance to hardware increases because they aren't the same.
         | You would need absolutely massive amounts of hardware
         | improvements to compare to even 1% performance on what is being
         | measured here.
         | 
         | I don't think the law is as silly as other posters here, but
         | it's talking about programming language compile times. I
         | wouldn't compare something as unimportant as that to what is
         | arguably one of the most important things in computer
         | science... considering how much data storage and consumption
         | means to our world.
        
         | fear91 wrote:
         | The nice thing about compiler optimizations is that you can
         | improve performance of existing CPU's without physically
         | touching them. Year by year. You squeeze more of the machine
         | someone designed. It adds up.
         | 
         | Imagine what environmental impact you would have if you
         | optimized Python's performance by 1%? How much CO2 are you
         | removing from the atmosphere? It's likely to overshadow the
         | environmental footprint of you, your family and all your
         | friends combined. Hell, maybe it's the entire city you live in.
         | All because someone spent time implementing a few bitwise
         | tricks.
        
           | hnben wrote:
           | > Imagine what environmental impact you would have if you
           | optimized Python's performance by 1%?
           | 
           | I imagine this would also increase the complexity of the
           | python intepreter by more than 1%, which in turn would
           | increase the number of bugs in the interpreter by more than
           | 1%. Which would burn both manpower and CPU-Cycles on a global
           | scale.
           | 
           | (I assume that optimization, that reduce complexity are
           | already exhausted, e.g. stuff like "removing redundant
           | function calls". )
        
             | fear91 wrote:
             | This is an assumption that a reasonable person naively
             | comes up with.
             | 
             | Then if you actually go ahead and check, it turns out it's
             | not true! It's quite a shocking revelation.
             | 
             | When you dig into the popular compilers/runtimes (with the
             | exception of things like LLVM)
             | 
             | Many of them still have low hanging fruit of the form:
             | 
             | a = b + c - b
             | 
             | Yes, the above is still not fully optimized in the official
             | implementations of some popular programming languages.
             | 
             | Also an optimization of "removing redundant function calls"
             | isn't a binary on/off switch. You can do it better or
             | worse. Sometimes you can remove them, sometimes not. If you
             | improve your analysis, you can do more of that and improve
             | performance. Same for DSE, CSE, etc...
        
               | Sesse__ wrote:
               | In many languages, you can't just optimize + b - b willy-
               | nilly, as there could be side effects and non-obvious
               | interactions abound. For instance, in JavaScript, where
               | everything is a 64-bit double, a + b - b is definitely
               | not the same as a, given large enough or small enough a
               | or b. In LLVM for floats as well, certainly.
        
               | fear91 wrote:
               | It's an example of how trivial some of this low hanging
               | fruit is, the above is a concrete case that I have
               | personally implemented (arithmetic of in-register
               | 64/32-bit integers). You can get into semantics and
               | restrictions, but I think the point I'm raising is clear.
        
       | uhoh-itsmaciek wrote:
       | Neat, but Postgres version numbering changed with v10. 9.6, 9.5,
       | 9.4, 9.3, 9.2, 9.1, 9.0, 8.4, 8.3, 8.2, 8.1, and 8.0 are
       | effectively all distinct major versions. It'd be interesting to
       | see how perf changed with those.
        
         | guiriduro wrote:
         | But at least they maintained the courtesy of major version
         | filesystem compatibility allowing faster in-place upgrades
         | during an extended period (from v9.0 - 9.6), by simply changing
         | the binaries. Possibly that held them back, but yearly updates
         | that require more downtime/reindexing isn't so much fun, and
         | possibly the reason why many sites skip upgrades until previous
         | versions drop out of support (esp AWS RDS users). While the
         | alternative of a logical replication upgrade (from v10 onward)
         | has availability benefits, its a major project with unavoidable
         | costs and significant risks unless schemas are relatively
         | simple.
        
           | ioltas wrote:
           | The bar to merge a patch that would break on-disk format is
           | so high that it's really impossible to reach it ;)
           | 
           | More seriously, we care a lot about the format of the 8k
           | pages to ease upgrade paths as much as possible.
        
         | RMarcus wrote:
         | I totally agree -- I picked the latest version for each major
         | version using the semver interpretation of the version numbers,
         | which is not how PostgreSQL has traditionally done major
         | version numbers (i.e., PG 8.2 and 8.1 are different major
         | versions, but I interpreted the version numbers as if they were
         | minor versions).
         | 
         | The main reason I did this was to reduce the number of versions
         | I had to test, but I agree a more complete analysis would test
         | each (true) major version.
        
       | uhoh-itsmaciek wrote:
       | >Of course, not all of these improvements are attributable to the
       | query optimizer.
       | 
       | It would be interesting to see plan changes, if any, across
       | versions.
        
       | throwaway984393 wrote:
       | It would arguably be more accurate to compare the actual systems
       | running Postgres and their precompiled binaries. Nearly
       | everything else in a given system running a given version of an
       | application may have an impact on its performance in subtle ways.
       | Its development occurred in conjunction with these other system
       | components, thus it was optimized for them. In addition,
       | distributions tend to ship their binaries with patches not found
       | upstream which can also affect performance. Everything from the
       | kernel and scheduler to glibc and more may affect performance.
       | Newer isn't always faster. Even the containerization you now use
       | might add a subtle penalty that an older non-containered version
       | may not experience.
       | 
       | Basically, it's theoretically possible that some old ass system
       | running version 8 runs it faster than your current system runs
       | version 8 (excluding hardware).
       | 
       | Add to that all the other usual caveats about benchmarks etc
       | (like the fact that IO wasn't tested here but is pretty dang
       | relevant to real world performance)
        
       | edweis wrote:
       | How query optimisation looks like? Does it optimize on the SQL or
       | algorithm level?
        
         | BenoitP wrote:
         | It describes all the way the SQL could be executed, then choses
         | the faster plan. For example: if you're looking for the user
         | row with user_id xx, do you read the full table then filter it
         | (you have to look at all the rows)? Or do you use the dedicated
         | data structure to do so (an index will enable to do it in the
         | logarithm of the number of rows)?
         | 
         | A lot more can be done: choosing the join order, choosing the
         | join strategies, pushing the filter predicates at the source,
         | etc. That's the vast topic of SQL optimization.
        
           | abhishekjha wrote:
           | Is there a more general reading for software engineers?
           | 
           | Seems like jumping right into the code can be a bit
           | overwhelming if you have no background on the topic.
        
             | orlp wrote:
             | Not exactly reading but I would recommend the database
             | engineering courses by Andy Pavlo that are freely available
             | on YouTube.
        
             | jmgimeno wrote:
             | PostgreSQL Query Optimization: The Ultimate Guide to
             | Building Efficient Queries, by Henrietta Dombrovskaya and
             | Boris Novikov Anna Bailliekova, published by Apress in
             | 2021.
        
               | rotifer wrote:
               | You may discover this if you go to the Apress site, but
               | there is a second edition [1] out.
               | 
               | [1] https://hdombrovskaya.wordpress.com/2024/01/11/the-
               | optimizat...
        
             | sbuttgereit wrote:
             | Always worth a mention:
             | 
             | https://use-the-index-luke.com/
             | 
             | Markus Winand (the website's author) has also written a
             | book on the subject targeted at software developers which
             | is decent for non-DBA level knowledge.
        
             | whiterknight wrote:
             | https://www.sqlite.org/queryplanner.html
        
         | magicalhippo wrote:
         | For the databases I've used, which excludes PostgreSQL, most
         | optimization happens on algorithmic level, that is, selecting
         | the best algorithms to use for a given query and in what order
         | to execute them.
         | 
         | From what I gather this is mostly because various different SQL
         | queries might get transformed into the same "instructions", or
         | execution plan, but also because SQL semantics doesn't leave
         | much room for language-level optimizations.
         | 
         | As a sibling comment noted, an important decision is if you can
         | replace a full table scan with an index lookup or index scan
         | instead.
         | 
         | For example, if you need to do a full table scan, and do
         | significant computation per row to determine if a row should be
         | included in the result set, the optimizer might change the full
         | table scan with a parallel table scan then merge the results
         | from each parallel task.
         | 
         | When writing performant code for a compiler, you'll want to
         | know how your compilers optimizer transforms the source code to
         | machine instructions, so you prefer writing code the optimizer
         | handles well and avoid writing code the optimizer outputs
         | slower machine instructions for. After all the optimizer is
         | programmed to detect certain patterns and transform them.
         | 
         | Same thing with a query optimizer and its execution plan.
         | You'll have to learn which patterns the query optimizer in the
         | database you're using can handle and generate efficient
         | execution plans for.
        
           | Sesse__ wrote:
           | Postgres is a fairly standard System R implementation. You
           | convert your SQL into a join tree with some operations at the
           | end, and then try various access paths and their
           | combinations.
        
         | mdavidn wrote:
         | At a very high level, the query planner's goal is to minimize
         | the cost of reading data from disk. It gathers pre-computed
         | column statistics, like counts of rows and distinct values, to
         | estimate the number of rows a query is likely to match. It uses
         | this information to order joins and choose indexes, among other
         | things. Joins can be accomplished with several different
         | algorithms, like hashing or looping or merging. The cheapest
         | option depends on factors like whether one side fits in working
         | memory or whether both sides are already sorted, e.g. thanks to
         | index scans.
        
         | paulddraper wrote:
         | The query optimization chooses what algorithms provide the
         | results requested by the SQL.
        
       | morepork wrote:
       | Site seems to be down, you can try this instead:
       | https://web.archive.org/web/20240417050840/https://rmarcus.i...
        
       | BitPirate wrote:
       | The author mentions PostgreSQL's JIT compiler. Up to this day,
       | I've only seen it degrade the performance of queries. Disabling
       | it is on my install checklist.
        
         | LunaSea wrote:
         | The JIT compiler is great for analytical queries.
         | 
         | You can configure thresholds for the JIT activation in
         | PostgreSQL as well if you want to elevate the bar from which
         | the JIT is enabled.
        
           | refset wrote:
           | For analytical queries note that you really have to learn how
           | to express the queries efficiently with Postgres -
           | unfortunately the optimizer is still missing lots of tricks
           | found in more sophisticated engines [0][1][2]. JIT
           | compilation alone can't get close to making up for those
           | gaps.
           | 
           | [0] https://pganalyze.com/blog/5mins-postgres-optimize-
           | subquerie...
           | 
           | [1] https://duckdb.org/2023/05/26/correlated-subqueries-in-
           | sql.h...
           | 
           | [2] "Unnesting Arbitrary Queries"
           | https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf
        
             | Sesse__ wrote:
             | Postgres isn't really hot on implementing the latest stuff
             | from academia (most of what they do seem to be fairly home-
             | grown, not the least because some of it was never really
             | written about before they started doing it). Arbitrary
             | unnesting is certainly one of the optimizations they don't
             | do; fancy sketches and other estimation techniques
             | generally likewise. But what they're really good at is
             | polish. A thousand small tweaks and optimizations to iron
             | out issues for 0.1% of queries. It tends to have fairly few
             | sharp edges, less so than any other RDBMS I've used.
        
           | dur-randir wrote:
           | But PG is much more commonly used for OLTP, not OLAP. I'm
           | baffled to this day that they enabled it by default.
        
             | LunaSea wrote:
             | OLAP is a very common use case for PostgreSQL as well.
        
         | baba92 wrote:
         | A customer had the worst performance issues after switching to
         | postgres. Weirdly it only happened in the docker and test
         | server setup and not on the developers machine (he ran postgres
         | via homebrew).
         | 
         | Turns out homebrew installed postgres without JIT support and
         | one query on the developer machine ran in 200ms, while with JIT
         | the query took 4-5 seconds. Took some time to figure out, since
         | I'm not a heavy postgres user, but since then I've always
         | disabled JIT and never looked back.
        
         | masklinn wrote:
         | Yeah pg's JIT is a pretty good demonstration that LLVM is not
         | great at JITs, worsened by postgres not having a good
         | persistent shared query cache.
         | 
         | It would probably be less detrimental if it could perform
         | asynchronous compilation for future queries (which really is
         | closer to how normal JITs actually behave, at least for
         | optimising backends)
        
           | MaxBarraclough wrote:
           | > LLVM is not great at JITs
           | 
           | That seems like a generalisation. It's true LLVM will never
           | be a lightweight toolkit, but if you want to generate highly
           | optimised code it seems like a solid choice, assuming you're
           | doing relatively 'vanilla' code-gen work.
        
           | SigmundA wrote:
           | PG just needs to do like the other databases and cache query
           | plans, also add hints please.
        
             | masklinn wrote:
             | > PG just needs to do like the other databases and cache
             | query plans
             | 
             | The inline blocking compilation step will still fuck your
             | query.
             | 
             | > also add hints please
             | 
             | I'd much rather they allowed bypassing the planner entirely
             | and had an interface to feed plans directly to the
             | executor.
        
               | SigmundA wrote:
               | >The inline blocking compilation step will still fuck
               | your query
               | 
               | Only on the first execution, the long plan/jit time is
               | usually only an issue when running a fast query over and
               | over again.
               | 
               | However if plans where cached then you could also plan
               | without jitting then background jit and update cached
               | plan for next time which would be even smarter.
               | 
               | >I'd much rather they allowed bypassing the planner
               | entirely and had an interface to feed plans directly to
               | the executor
               | 
               | Other database have plan locking where you can load an
               | existing plan from somewhere else in serialized form
               | (XML) and force it to use that. Most of the time I prefer
               | just a few hints in the query source solves almost all
               | issue with the planner.
        
               | RMarcus wrote:
               | PostgreSQL has coarse-grain query hints (like
               | `enable_hashjoin`), and the excellent `pg_hint_plan`
               | extension allows you to specify a complete or partial
               | execution plan: https://github.com/ossc-db/pg_hint_plan
        
         | nextaccountic wrote:
         | Can't postgres JIT a query once and run the compiled query
         | multiple times?
        
       | DoubleFree wrote:
       | The postgres query optimizer will try to minimize the number of
       | pages read from disk (and the number of intermediate pages
       | written to disk). Benchmarking the query optimizer by making the
       | shared buffers large enough to hold all the data therefore seems
       | wrong, as you're then measuring the speed of the query optimizer
       | and the join processor, instead of the quality of the generated
       | query plans. It would not surprise me if the generated plans for
       | these versions are actually all the same and this is only
       | measuring execution speed.
        
         | Sesse__ wrote:
         | No, it optimizes cost, which includes pages read from disk
         | _and_ things like CPU use. Cost is an arbitrary unit meant to
         | correlate with time spent, not by disk loads, so it's
         | completely reasonable to compare plans with everything loaded
         | in RAM. (Cost is by convention scaled such that one page read
         | from disk is 1.0, but that's a different things from "the
         | optimizer will try to minimize the number of pages read from
         | disk". It could just as well have been scaled so that 1.0 was 1
         | ms on some arbitrary machine.)
        
         | RMarcus wrote:
         | It is certainly possible that the plans are similar, and that
         | improvements to the execution engine are being measured. The
         | join order benchmark was designed to test optimizer quality. It
         | is worth noting that _in addition_ to trying to measure the
         | number of pages read from disk, the PG optimizer _also_ tries
         | to reduce the number of tuples examined by the CPU, the number
         | of predicate evaluations, etc. All these numbers are rolled up
         | into a  "cost," which is the function that the optimizer
         | minimizes.
         | 
         | It is also true that measuring cold cache and warm cache
         | performance can produce different results, and this experiment
         | is certainly in the warm cache scenario. But, the cold cache
         | scenario suffers from the problem you mention as well: an
         | improvement to PG's B-tree that saves a few IOs will dominate
         | any kind of CPU-based improvement (at least at the data size of
         | the join order benchmark).
         | 
         | FWIW, the plan for the query with the P90 latency changes from
         | a plan that uses loop and merge join in PG8.4 to a plan that
         | uses hash join in PG16 (where it is no longer the P90 query),
         | which is at least some evidence of optimizer improvements.
        
           | petergeoghegan wrote:
           | > It is certainly possible that the plans are similar, and
           | that improvements to the execution engine are being measured.
           | The join order benchmark was designed to test optimizer
           | quality.
           | 
           | I don't think that it's possible to test optimizer quality in
           | isolation -- not really, not if it's to be in any way
           | practical. Many (most?) individual improvements to the
           | optimizer are hopelessly intertwined with executor
           | enhancements. This is usually fairly obvious, because the
           | same commit changes both the optimizer and the executor
           | together. Sometimes it's much less obvious, though, because
           | its more a case of the optimizer and executor coevolving.
           | 
           | It's probably still the case that a lot of the improvements
           | seen here are pure executor enhancements, but I can't say
           | that I'm very confident about that. (As the main person
           | behind those B-Tree IO savings you might expect me to have
           | more confidence than that, but I find it horribly difficult
           | to tease these issues apart while keeping the discussion high
           | level/relevant.)
        
       | mehulashah wrote:
       | I'm a bit confused with the analysis here. How do you confirm a
       | downward trend in the data that is not visible on the graph. It
       | looks like the median drops a little in the first few versions,
       | but then rises in the last few. The R2 is really low, so I'm not
       | sold on the correlation here. Basically, it looks like tail
       | latency has improved and YMMV for everything else.
        
         | RMarcus wrote:
         | I'm the author of the blog post.
         | 
         | "Tail latency has improved by YMMV for everything else" => yes,
         | I think that's a valid (but conservative) read. Of course, in
         | many (most?) applications, tail latency is very important. Tail
         | latency also tends to be the thing that optimizer engineers
         | target (i.e., reduce the runtime of the longest running query).
        
       | darksaints wrote:
       | As someone who has worked with Postgres for 15 years and also has
       | spent most of my career modeling and solving mathematical
       | optimization problems, I have so much to say on this topic, but
       | I'll leave it at these three points:
       | 
       | * Every optimization problem needs data on costs, and the more
       | and better the data is, the better. Postgres has made some
       | improvements here, especially with cross column statistics, but
       | there are still massive improvements left on the table. The most
       | glaring omission is data on syscall latencies. Reading a page
       | from disk has dramatically different latencies from system to
       | system, and postgres still relies on configuration values for
       | these costs, when it could very easily be measuring them. Another
       | omission is foreign key statistics. Joins along foreign keys
       | should never have a bad plan, but they still occasionally do.
       | 
       | * Deferred or alternative scenario planning should be adopted,
       | especially for large and expensive queries. As it is today, your
       | plan is finalized before it is executed, even though earlier
       | stages of execution could provide information (like rowcounts or
       | cardinality estimates) that could dramatically improve later
       | stage plans.
       | 
       | * Machine learning most definitely could be an area where
       | improvements could be made, but I've been unimpressed with the
       | efforts that Ive seen. Don't use machine learning for planning,
       | use machine learning for cost discovery and estimation. Build
       | better cost models, and then let the optimization engine work
       | with that data.
        
         | RMarcus wrote:
         | I'm curious to hear a bit more of your opinion. For example,
         | I'm surprised that syscall latency is something near the top of
         | your list. I think the usual wisdom in the DB community is that
         | the cost models are mostly fine, but the cardinality estimation
         | is really bad.
         | 
         | In terms of the deferred / alternative planning, do you think
         | adaptive query execution is a reasonable way to achieve this?
         | It certainly allows for information early in the query
         | execution to impact later plans. My worry with these approaches
         | is that if you get the first couple of joins wrong (which is
         | not uncommon), unless you have something like Yannakakis/SIPs,
         | you still can't recover.
         | 
         | I am obviously biased on the whole "ML for query optimization"
         | thing. One thing I would note is that every "ML for planning"
         | approach I've seen does, under the hood, use ML for cost
         | discovery/estimation. These approaches are just trying to
         | balance the data they collect (exploration) with the quality of
         | the plans they produce (exploitation). Interestingly, if you
         | use ML in a way that is completely removed from planning, you
         | actually get _worse_ query plans despite more accurate
         | estimates:
         | https://people.csail.mit.edu/tatbul/publications/flowloss_vl...
         | (again, I've got a horse in this race, so my opinion should
         | come with a side of salt :D)
        
           | aeyes wrote:
           | As a DBA I have a very hard time properly tuning parameters
           | like random_page_cost and the default of 4.0 is no longer
           | applicable for most database servers.
           | 
           | I don't want to be tuning this, it takes a lot of time to do
           | it properly and I haven't retested this in a long time. I
           | just set something that has worked in the past which is
           | probably bad.
           | 
           | I completely agree that Postgres should be able to figure
           | this out on its own. This is just an example, there are more
           | such parameters which should be adjusted to the hardware but
           | most people will leave the defaults.
        
           | srcreigh wrote:
           | Awesome paper! Might be my cue to delve into the world of
           | cost estimates. My naive brain thinks, how can it be so hard
           | :-)
        
           | adgjlsfhk1 wrote:
           | My guess is that the reason for putting syscall latency high
           | is that it should be easy to fix. Cardinality tracking is a
           | hard problem, but running a loop on install that measures the
           | cost of a couple dozen syscalls really could be done
           | automatically.
        
         | nextaccountic wrote:
         | Do you think that MSSQL is better on this front?
        
         | da_chicken wrote:
         | > Another omission is foreign key statistics. Joins along
         | foreign keys should never have a bad plan, but they still
         | occasionally do.
         | 
         | I'm curious what you mean. Postres, like many RDBMSs, does not
         | automatically create an index on a FK. But you have to know
         | that already.
         | 
         | Is it join sequencing?
        
           | Sesse__ wrote:
           | Having a foreign key should probably be a hint to the planner
           | that it should create a cross-table correlation estimate on
           | the source and destination columns. This is about join
           | cardinality estimation, not having an index.
        
         | zeroimpl wrote:
         | > Deferred or alternative scenario planning should be adopted,
         | especially for large and expensive queries. As it is today,
         | your plan is finalized before it is executed, even though
         | earlier stages of execution could provide information (like
         | rowcounts or cardinality estimates) that could dramatically
         | improve later stage plans.
         | 
         | Alternative planning sounds awesome. I saw a query plan the
         | other day where it expected about 1000 rows from some subquery
         | so did a nested loop to an index scan. In reality there's about
         | a billion rows. I'm not sure yet why the estimate was so bad,
         | but being able to pivot from a nested loop to a hash join if
         | the row count crossed some threshold would be great at avoiding
         | some catastrophic plans.
        
       ___________________________________________________________________
       (page generated 2024-04-17 23:01 UTC)