[HN Gopher] Speeding up sort performance in Postgres 15
       ___________________________________________________________________
        
       Speeding up sort performance in Postgres 15
        
       Author : todsacerdoti
       Score  : 292 points
       Date   : 2022-05-20 07:58 UTC (15 hours ago)
        
 (HTM) web link (www.citusdata.com)
 (TXT) w3m dump (www.citusdata.com)
        
       | MR4D wrote:
       | It is this type of analytics and discipline that I love about the
       | PG team.
       | 
       | It takes a lot of dedication and patience to do, but I wish more
       | software teams did this. Perhaps schools should put more emphasis
       | on this type of analysis.
       | 
       | Maybe then Call of Duty would load in a reasonable time. (sorry,
       | pet peeve of mine)
        
         | maccard wrote:
         | I work in video games and have done my fair share of perf work.
         | Pretty much every game does this kind of work, and all the live
         | service games have had a full team dedicated to ongoing
         | performance work.
         | 
         | I could easily sit and throw shade that Twitter takes longer to
         | show my timeline than most ps5 games do to boot up and load
         | into game, but that doesn't help anyone.
        
           | MR4D wrote:
           | You make a good point, and I am aware of that. Frankly, the
           | engineering that goes into any FPS (even the bad ones) is
           | beyond amazing, especially when you add networking latencies
           | to the complexity.
           | 
           | I should have chosen websites for my rant, but CoD has been
           | on my mind this week due to the weird way you have to keep
           | getting into a game every time and go through what seems like
           | multiple home screens in a way that is either a bug, or just
           | poorly thought out. Either way, it is a weird waste of time,
           | and annoying when you're trying to play two back-to-back
           | games.
        
           | hinkley wrote:
           | It's typical in non-gaming teams to have any instinct to
           | optimize beaten out of you instead of guided.
           | 
           | Knuth's comments about premature optimization have been so
           | misquoted at this point that we've gone way beyond people
           | using it as permission not to think, they also use it as a
           | mandate to stop other people from thinking as well.
           | 
           | Academics are particularly bad at falling back on aphorisms,
           | so how one would change the curriculum to fix this would also
           | involve changing the teachers, which as an automatic
           | conversation killer.
           | 
           | In my case I was keeping my own council before anyone tried
           | to beat it out of me, and I learned to just keep things to
           | myself. On the plus side I know dozens of ways to refactor
           | for legibility and 'accidentally' improve performance at the
           | same time.
        
             | HideousKojima wrote:
             | >On the plus side I know dozens of ways to refactor for
             | legibility and 'accidentally' improve performance at the
             | same time.
             | 
             | My favorite is "That nested loop is kind of hard to read,
             | how about making a dictionary/hashmap instead?"
        
           | zerocrates wrote:
           | There's also the factor of what kinds of performance you care
           | about and are thinking about, and what you're not.
           | 
           | There was the memorable story shared around here a while ago
           | of a player patching GTA V's live service online to eliminate
           | multiple-minute load times [1]. I'm sure Rockstar had/has
           | many people working on performance (the core game is a well-
           | known good performer and now has been on three different
           | console generations since launch) but just, not this.
           | 
           | (Though to totally miss such truly extensive load times on
           | your main revenue-generating component, that one's a
           | headscratcher. Feels like something you'd end up addressing
           | just for your own internal sanity. Maybe everyone just had
           | monster workstations and didn't notice.)
           | 
           | [1] https://news.ycombinator.com/item?id=26296339
        
       | darksaints wrote:
       | One of the things I have been consistently excited about was the
       | new pluggable storage API, and I had followed a handful of
       | projects that were working on plugins, like zedstore, zheap, and
       | cstore. But now it seems like all of those efforts have
       | stagnated. Any idea what happened?
        
         | ccleve wrote:
         | The difficulty is that the new table access method is very
         | complex and hard to get right. I'm building something using a
         | regular index access method now and it's hard enough. Once I
         | get all that infrastructure working properly I may take another
         | crack at a table access method.
         | 
         | For some purposes you can get away with a foreign data wrapper.
         | FDWs are a lot simpler.
        
         | jfbaro wrote:
         | orioleDB is another one
        
       | kabes wrote:
       | A couple of months back there was a post from someone who made a
       | presentation about a lot of things that are 'bad' in postgres
       | (mostly performance-wise) and a github repo that tried to fix
       | that (in first instance with patches and the idea was to later
       | have it either upstreamed or as an addon).
       | 
       | I can't find anything of this effort anymore. Does someone know
       | what I'm talking about and what the current status of that is?
       | 
       | EDIT: found it: https://github.com/orioledb/orioledb
       | 
       | I'd love to hear what PG devs here think of his criticism and
       | proposed solutions.
        
         | fuy wrote:
         | Agree with you - would love to hear some of the core PG devs to
         | pay attention and discuss this work.
        
           | ConnorLeet wrote:
           | I think that this repo was created by a core PG dev.
           | 
           | https://github.com/orioledb/orioledb/commits?author=akorotko.
           | ..
           | 
           | https://github.com/akorotkov
        
         | canadiantim wrote:
        
       | gampleman wrote:
       | Could the specialisation of sort functions be further improved by
       | using e.g. something like Radix sort for integers?
        
       | JacKTrocinskI wrote:
       | They added the MERGE statement!
       | 
       | ...and better defaults for the public schema.
       | 
       | Really looking forward to Postgres 15.
        
       | avar wrote:
       | A related discussion about inlining these sort functions,
       | templating etc.: https://www.postgresql.org/message-
       | id/flat/CA%2BhUKGJ2-eaDqA...
        
       | hinkley wrote:
       | I'm surprised they use quicksort. I would think in a database
       | you'd value predictability over best case behavior.
       | 
       | Also is it a stable quicksort?
        
       | Jamie9912 wrote:
       | Great stuff. What's the best way to update a major version of
       | Postgres when it comes out?
        
         | mike-cardwell wrote:
         | https://www.postgresql.org/docs/current/upgrading.html
        
         | dewey wrote:
         | This is just a data point but in my experience over the past
         | few years Postgres updates are usually nothing to worry about.
         | There's other systems where updates come with a lot more
         | baggage attached (Like ElasticSearch) where you have to
         | refactor schemas and stay up to date with the documentation way
         | more than with PG.
        
           | jon_richards wrote:
           | Homebrew on mac has a long-standing upgrade bug with Postgres
           | extensions like postgis. It uninstalls the old postgis before
           | copying the data over and then gets stuck in a hard to
           | recover state. I always assume I'm going to lose all my data
           | whenever I update major versions.
        
         | speedgoose wrote:
         | It's difficult to say without knowing your situation.
         | Personally I would wait a few months at least to be sure that
         | the biggest bugs are killed. Then if it's a new project or
         | something not critical, I may upgrade without any precaution
         | and hope for the best. For something more critical I would rely
         | on the integration tests and manual testing in a pro-production
         | environment for a few weeks. Before the final upgrade, I would
         | finally do a few practice upgrade runs and be ready to rollback
         | if needed.
        
           | altendo wrote:
           | > Personally I would wait a few months at least to be sure
           | that the biggest bugs are killed.
           | 
           | This is my default behavior with any database product. Solid
           | advice.
        
       | zeckalpha wrote:
       | One graph has sorting a 10GB table and refers to it as a "Large
       | sort"... is 10GB large any more?
        
         | dagw wrote:
         | "Large" in this context is stated to mean significantly larger
         | than work_mem, which defaults to 4MB out of the box.
         | Interestingly performance (relative to PG14) seems to get worse
         | as work_mem increases and the ratio of table size to work_mem
         | decreases. If this holds it should mean that for a fixed
         | work_mem PG15 performance (again relative to PG14) should
         | improve as table size increases.
        
           | pigcat wrote:
           | > performance (relative to PG14) seems to get worse as
           | work_mem increases and the ratio of table size to work_mem
           | decreases
           | 
           | Do you have any source for this? This is the first time I've
           | ever heard this and it seems contrary to everything I've
           | read.
        
         | d_watt wrote:
         | Not for OLAP, but this seems to be about OLTP. If you're trying
         | to sort 10GB of data for a traditional OLTP query, I would
         | consider that "too large" and need of some data structure
         | optimization.
         | 
         | Also it's specifically about queries where the sorted data is
         | too big for memory:
         | 
         | > The final change is aimed at larger scale sorts that exceed
         | the size of work_mem significantly. The algorithm which merges
         | individual tapes was changed to use a k-way merge. When the
         | number of tapes is large, less I/O is required than the
         | original polyphase merge algorithm.
        
           | hinkley wrote:
           | It is a common complaint in discussions about sort algorithms
           | that the case for sorting numbers is a degenerate case and
           | 'real' cases involve at least pointer chasing, which can make
           | the cost of compares to stop being constant time, or have a
           | constant time that is greater than log(n) (or worse, n).
           | 
           | Sometimes locality of reference may increase the total number
           | of comparisons but decrease the average cost considerably.
           | For practical applications like databases, it's the runtime
           | that matters, not the theory.
        
       | nhoughto wrote:
       | Interesting to see there is still some more to squeeze out of the
       | old girl yet. Most seem like things that could have attempted
       | before now but I guess weren't the top pain points to fix, I
       | wonder how many more 4-6% savings there are in there, good to see
       | people going deep and keeping the improvements coming.
        
         | mattashii wrote:
         | I'm currently working (on and off) on improving the performance
         | of specific "kinds"/"shapes" of btree indexes by several %s
         | (depending on shape it can be tens of percents). I hope and
         | expect that it'll be ready for Pg16, but that of course assumes
         | that the concept behind the implementation is acceptable.
         | 
         | See https://www.postgresql.org/message-
         | id/flat/CAEze2Wg52tsSWA9F... for specifics
        
           | mchusma wrote:
           | Thanks so much for your contributions! The community for PG
           | is so great.
        
         | jeffbee wrote:
         | Anyone who is using a distro package of pg has the largest
         | opportunity because they are all built wrong. A peak-optimized
         | binary will get you much more than 4% speed up over generic
         | packages.
        
           | brightball wrote:
           | Using PG for years and this is the first time somebody made
           | me aware of this, so thank you.
        
           | Bootvis wrote:
           | Can you please elaborate, what's wrong and what do I need to
           | do when building from source?
        
             | adgjlsfhk1 wrote:
             | compile with --march=native can often be significant.
        
               | mayama wrote:
               | Will native march option work in case of cloud where
               | you're not in control of cpu?
        
               | jeffbee wrote:
               | I haven't used a cloud where you don't dictate the exact
               | CPU you want. EC2, GCE, and Azure all provide specific
               | guarantees. Even if you wanted a generic any-cloud binary
               | you would probably compile with -march=skylake or, if
               | truly paranoid, sandybridge.
               | 
               | Many distro packages are, by contrast, compiled for
               | k8-generic. Freaking Opteron!
               | 
               | BTW "native" is not really recommendable. One really
               | ought to target a particular platform.
        
               | brianwawok wrote:
               | Do google and AWS hosted postgres do this? I assume, but
               | I am not sure how to check.
        
               | jeffbee wrote:
               | I don't know but it would be out-of-character for Google
               | to have a C program that wasn't optimized to the hilt.
        
               | jeffbee wrote:
               | Compiling for your target is good, not using the PLT for
               | mem* functions helps a ton, feedback-directed
               | optimization and LTO and/or post-link optimization helps
               | the most. For CPU-bound workloads of course.
        
         | hinkley wrote:
         | In the long tail it's more a matter of choosing a module or
         | concern and squeezing a percent here and a percent there until
         | it adds up to 5%, rather than looking for tall tent poles. Also
         | counting calls for a fixed workload, and making sure this call
         | stack really should call functionB nine times for three results
         | or if that should be six or three. Function is still exactly
         | the same cost but is now 1% of the overall cost instead of 3%.
        
       | mkurz wrote:
       | Great work!
        
       | nitinreddy88 wrote:
       | "The first test I ran while working on reducing the memory
       | consumption of sort improved performance by 371%"
       | 
       | This is impressive in terms of performance for SQL Server
       | operations. Imagine the end performance benefits for
       | analytical/warehouse model of queries.
        
         | mattashii wrote:
         | > SQL Server operations
         | 
         | I might me misunderstanding you, but this is a post about
         | PostgreSQL, not Microsoft's SQL Server.
         | 
         | > performance benefits for analytical/warehouse model
         | 
         | That performance result was for a fully in-memory sort
         | operation, which you are are less likely to see in data
         | warehouses.
        
           | dspillett wrote:
           | _> but this is a post about PostgreSQL, not Microsoft 's SQL
           | Server._
           | 
           | I assume that what was meant is what I'd say as "SQL engine"
           | or "SQL query runner". It is unfortunate that MS has a habit
           | of using common words in its product names. PG _is_ a SQL
           | server (no capital) in the sense that it is a server that
           | processes SQL statements for retrieving and manipulating
           | data.
        
       | dewey wrote:
       | There's a lot of interesting improvements in PG15, some of my
       | favorites from
       | https://www.postgresql.org/about/news/postgresql-15-beta-1-r...
       | are:
       | 
       | >PostgreSQL 15 introduces the jsonlog format for logging. This
       | allows PostgreSQL logs to be consumed by many programs that
       | perform structured log aggregation and analysis.
       | 
       | >pg_basebackup, a utility used to take full backups of a
       | PostgreSQL cluster, now supports server-side compression using
       | Gzip, LZ4, or Zstandard compression.
       | 
       | >This includes the introduction of parallelization for SELECT
       | DISTINCT statements and improvements in performance to window
       | functions that use row_number(), rank(), and count().
       | 
       | >PostgreSQL 15 builds on its existing support for the SQL/JSON
       | path language by including more standard SQL/JSON functions.
       | These include SQL/JSON constructors, query / introspection
       | functions, and the ability to convert JSON data into a table.
        
         | baggiponte wrote:
         | That's fascinating! Totally orthogonal: how does PG store data?
         | Do they use columnar specifications such as Arrow & Parquet?
        
           | chrsig wrote:
           | The main table is stored in a 'heap', which is more similar
           | to the dynamic allocation region of a program rather than the
           | data structure.
           | 
           | If it needs to write to the table, it has a free space map
           | listing all of the available fixed size pages. Each page can
           | hold some number of fixed size tuples.
           | 
           | I'm surprised they've gotten so far with this. It seems ripe
           | for fragmentation issues. I'm surprised they don't at least
           | have a b+tree implementation to provide clustering based on
           | the primary key. They have a b+tree implementation for
           | indices, but not for tuple storage. IIRC, there's a feature
           | to allow fixed size data row to be stored directly in the
           | index, to help alleviate the need to look in the heap table.
           | Perhaps that's what's kept the heap as viable?
           | 
           | Variable length strings are kept in a separate file (TOAST
           | tables), referenced by the tuple. This keeps records fixed
           | size, and allows for circumventing disk reads for string data
           | for queries that don't require them. I'm not quite sure how
           | the allocation scheme or data structure works there. My (very
           | speculative) guess is that they're stored as a linked list,
           | and on write, the allocator tries to give contiguous pages.
           | 
           | fwiw, I don't use postgresql on a regular basis, I've just
           | spent some good amount of time squinting at the file format
           | and related code, soley due to my own curiosity. It's been a
           | while, so it's very possible that I'm misrepresenting or
           | misremembering something. Someone has already posted their
           | documentation on the file format -- it's a good read if
           | you're interested.
        
             | tomnipotent wrote:
             | > but not for tuple storage
             | 
             | Both tables and indexes use the same b+-tree implementation
             | based on Lehman and Yao (prev/next links on internal
             | nodes).
             | 
             | > Each page can hold some number of fixed size tuples
             | 
             | Variable size.
             | 
             | > Variable length strings are kept in a separate file
             | 
             | Only if the row exceeds the page size, otherwise everything
             | is stuffed into the data leaf node.
             | 
             | > I'm not quite sure how the allocation scheme or data
             | structure works there
             | 
             | Values are broken up into DataSize/PageSize "virtual rows"
             | and stitched back together, but everything is still in the
             | same slotted page structure.
        
               | SigmundA wrote:
               | >Both tables and indexes use the same b+-tree
               | implementation based on Lehman and Yao (prev/next links
               | on internal nodes).
               | 
               | Tables in Postgres are not btree's they are unordered
               | heaps. The row id in the b-tree index points to a page
               | number + item index in the heap.
               | 
               | A true clustered index would simply expose the index
               | itself as a table without secondary heap storage,
               | reducing I/O and storage space. Row visibility seems to
               | complicate this in PG as index only scans may require
               | checking the heap for visibility.
        
             | uhoh-itsmaciek wrote:
             | >Variable length strings are kept in a separate file (TOAST
             | tables), referenced by the tuple. This keeps records fixed
             | size...
             | 
             | That's not entirely true: variable-width row values under a
             | size threshold are stored inline (possibly compressed).
             | 
             | The beauty part about TOAST is that they're mostly just
             | regular tables (any table that needs its data TOASTed gets
             | its own TOAST table). You can look them up in the system
             | catalogs and query them directly.
             | 
             | The schema for these is (chunk_id, chunk_seq, chunk_data).
             | Chunk "pieces" are limited in size so they can be stored
             | inline (obviously there's not much benefit to a TOAST table
             | itself having a TOAST table), so any row values that need
             | to be TOASTed may also be broken up into multiple pieces.
             | Row values stored inline in the "owning" table reference a
             | chunk_id where the TOASTed data is stored, and queries that
             | need to return that data look up all the chunks for a chunk
             | id, ordered by chunk_seq.
             | 
             | You're right that this scheme avoids reading TOASTed data
             | when you're not actually using it (a good reason to avoid
             | "SELECT *" if you've got big TOASTed data).
        
           | uhoh-itsmaciek wrote:
           | There's a pretty good high-level overview of the
           | implementation details in the official docs:
           | https://www.postgresql.org/docs/current/storage.html
        
           | paulryanrogers wrote:
           | It's row based with versions kept along side current data.
           | Later to be vacuumed away. Unlike a redo log based approach.
        
         | thejosh wrote:
         | I'm also super excited for the points you raised, looks like
         | some great features being released, I really like that JSON is
         | becoming a first class citizen.
         | 
         | I'm also looking forward to the \copy improvements, especially
         | around loading data into PG.
        
         | mchusma wrote:
         | I use both SELECT DISTINCT and row_number a lot, so excited for
         | that as well.
        
           | 1500100900 wrote:
           | SELECT DISTINCT is rarely the best idea. In most cases it
           | means someone is using joins to filter out rows that should
           | have been removed with EXISTS (a semi-join).
        
             | sa46 wrote:
             | Interesting, I hadn't thought of it that way. Looking
             | through our codebase:
             | 
             | - We get a lot of mileage out of DISTINCT ON to get the
             | most recent version of a row for some subset of columns. I
             | think this a different use case than what you refer to.
             | 
             | - We typically use DISTINCT to get PKs for a table when
             | starting from another table. Like, find all user IDs that
             | commented on a given story by looking at the join table.
             | SELECT DISTINCT user_id FROM story_comment WHERE story_id =
             | 4
        
         | BitPirate wrote:
         | I think they missed one of the biggest changes: ICU collations
         | can finally be set as default for clusters and databases.
        
       | aftbit wrote:
       | Ah yes, good old:                   # set client\_min\_messages
       | TO ’debug1’;
        
       | maccard wrote:
       | > The change made here adds a set of new quicksort functions
       | which suit some common datatypes. These quicksort functions have
       | the comparison function compiled inline to remove the function
       | call overhead.
       | 
       | Stuff like this frustrates me. This is what generics are for in
       | most languages. Looking at the commit[0], this is a _perfect_
       | example of where a sprinkling of C++ features (a very basic
       | template) would outperform the original C version. It's also
       | super disappointing that in 2022 we're still manually marking 10-
       | line functions used in one place and called from one site as
       | inline.
       | 
       | [0]
       | https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
        
         | [deleted]
        
         | bfgoodrich wrote:
         | > This is what generics are for in most languages.
         | 
         | That goes exactly _opposite_ of this optimization, and is just
         | entirely orthogonal.
         | 
         | > It's also super disappointing that in 2022 we're still
         | manually marking 10- line functions used in one place and
         | called from one site as inline.
         | 
         | You misunderstand the change.
         | 
         | In a nutshell the prior version was a polymorphic function
         | pointer that varied by the sort type.
         | (*sort_ptr)(data);
         | 
         | No compiler can inline this, no matter how aggressively you
         | turn up the optimizations.
         | 
         | Their change adds specialized cases.                  if
         | (dataType == PG_INT32) {           sort_int32(data);        }
         | else if (dataType == PG_INT64) {           sort_int64(data);
         | } ... {        } else {           (*sort_ptr)(data);        }
         | 
         | This enabled the inlining, and while it's no harm to add the
         | inline specifier if you want to be very overt in your
         | intentions, most compilers would inline it regardless in most
         | scenarios.
        
           | samhw wrote:
           | >> This is what generics are for in most languages. > That
           | goes exactly opposite of this optimization, and is just
           | entirely orthogonal.
           | 
           | What does that mean? Also, how is it "exactly opposite" and
           | "orthogonal"? Maybe let's get rid of the mixed metaphors and
           | speak in concrete terms.
           | 
           | In your comment below ("why would you want the compiler to do
           | that? you can write out all the monomorphizations yourself!")
           | I'm not totally certain that you understand the point of
           | generics. _You can always write out each case yourself._
           | Incidentally that also applies to your precious inlining. But
           | it 's a tool to make programmers more productive.
        
             | [deleted]
        
           | vlovich123 wrote:
           | In c++, the sort algorithm is polymorphic and inlined because
           | it automatically generates those sort variants for you.
           | 
           | Op is not trying to say "the code as is" but "written in c++
           | style".
           | 
           | I don't think they misunderstood anything (and their beef
           | with seemingly unprofiled use of always inline attribute is
           | probably correct too)
        
             | bfgoodrich wrote:
             | > I don't think they misunderstood anything
             | 
             | The code could have made the type-specific sort macros.
             | They could have literally written the code blocks in the
             | actual sort function.
             | 
             | Instead they decided to make it separate functions and mark
             | it inline because _that is their overt intention_ and they
             | thought it was cleaner.
             | 
             | What is the problem? Honestly, aside from noisy, useless
             | griping, what is possibly the problem with that?
             | 
             | Their implication seems to be that inline shouldn't be
             | necessary. And the truth is that it almost certainly _isn
             | 't_ necessary, and any optimizing compiler would inline
             | regardless. But clarifying their intentions is precisely
             | what developers should do: "We're moving it here for
             | organizational/cleanliness purpose, but it is intentionally
             | written this way to be inlined because that is the entire
             | point of this optimization".
             | 
             | As to the "unprofiled" use, their specific optimization was
             | because they know the time being wasted in function calls,
             | and the optimization was to avoid that.
        
               | macdice wrote:
               | (Author of the code here) Yeah :-)
        
         | JacobiX wrote:
         | Just to clarify, this is what C++ templates do, they can create
         | specializations for you, because templates are processed at
         | compile-time. But its not how generics works for most languages
         | (runtime, type erasure, etc).
        
         | macdice wrote:
         | Speaking as the author of that change, and also as a long time
         | user of C++ (in other projects), I hear ya! But I don't have
         | the power to make PostgreSQL switch to C++. Years ago,
         | std::sort() vs qsort() was almost a standard example of how C++
         | can be faster than C.
         | 
         | src/lib/sort_template.h is an example of a pseudo template
         | function, but we also have at least one pseudo template data
         | structure in src/lib/simplehash.h.
         | 
         | And if you want to see some "policy based design" (C++
         | community term) in PostgreSQL C code, check out
         | src/backend/exec/nodeHashjoin.c. It uses forced inlining to
         | "instantiate" two different versions of the hash join executor
         | code, one with parallelism and one without, so that we don't
         | have all the "am I in parallel mode?" branching/code size in
         | the generated code.
        
         | mort96 wrote:
         | > It's also super disappointing that in 2022 we're still
         | manually marking 10- line functions used in one place and
         | called from one site as inline.
         | 
         | The `inline` keyword usually isn't necessary anymore to get the
         | compiler to inline stuff, the compiler has some pretty good
         | heuristics for determining whether to inline a function call or
         | not. The `inline` keyword is an extra hint to the compiler, and
         | affects linkage (i.e multiple object files defining the same
         | `inline` function does not result in a linker error).
         | 
         | I don't understand what you're disappointed at, this all seems
         | very reasonable.
        
           | maccard wrote:
           | I never said they used the inline keyword, you inferred that.
           | If you look at the diff, they mark the fuctions as
           | `pg_attribute_always_inline` which expands to __forceinline
           | or __attribute__((always_inline)) inline depending on your
           | compiler. These specifiers will remove any of the compilers
           | heurestics [0].
           | 
           | > I don't understand what you're disappointed at, this all
           | seems very reasonable.
           | 
           | I'm disappointed by the fact that we're still manually
           | overriding compiler heuristics and manually writing faux
           | copy-pasted generics for performance, when this is a solved
           | problem in so many other languages.
           | 
           | EDIT: I would _love_ to see some benchmarks without the
           | pg_attribute_always_inline to see whether forcibly inlining
           | the code is even necessary. In my (extended) experience with
           | optimising C++ code, leaning on __forceinline is unnecessary
           | in the vast vast majority of cases and should be used very
           | sparingly.
           | 
           | [0] https://docs.microsoft.com/en-us/cpp/cpp/inline-
           | functions-cp...
        
             | gpderetta wrote:
             | If the profiler tells you to use force_inline, you use
             | force_inline.
        
               | maccard wrote:
               | Given this is a blog post explaining their optimisations,
               | you'd expect them to mention profiling with/without this
               | attribute if they did it. Looking at the email thread,
               | they share profiling results of before/after but there's
               | nothing justifying the use of __forceinline in the thread
               | or the blog post as far as I can see.
        
               | macdice wrote:
               | FWIW I just removed pg_attribute_always_inline from those
               | three comparator functions in tuplesort.c and it made no
               | difference to the generated code under GCC 10 at -O2. If
               | we can show that's true for Clang and MSVC too, maybe we
               | should just take them out. Edit: But I don't really mind
               | them myself, they do document the intention of the
               | programmer in instantiating all these specialisations...
        
               | tialaramex wrote:
               | At least they _measured_ , so that's immediately at least
               | a B+. It is of course often hard to know whether anybody
               | actually measured before doing their work or only
               | afterwards to justify it but we should give benefit of
               | the doubt.
               | 
               | Not everything you tried is going to be worth explaining,
               | and I get that their whole plan here is "forcibly inline
               | this" so in C, where the compiler won't necessarily get
               | your memo, it seems reasonable to explicitly say so.
        
           | Sesse__ wrote:
           | > The `inline` keyword usually isn't necessary anymore to get
           | the compiler to inline stuff, the compiler has some pretty
           | good heuristics for determining whether to inline a function
           | call or not.
           | 
           | Not really. The heuristics are pretty much based on a rough
           | idea of function size (e.g. if it believes the call overhead
           | is larger than just inlining the function, it will generally
           | always be done if optimization is on), and/or functions it
           | can prove is used only once (e.g. because they are static).
           | That's it. There's no good idea of e.g. "this looks like a
           | tight loop, and if I inline, I can maybe special-case away
           | this cruft here". The programmer's hints are very much
           | relevant even in 2022.
        
             | jeffbee wrote:
             | They're not. You are either using PGO and getting effective
             | inline or you are just guessing.
        
       | ollysb wrote:
       | It seems a little surprising that GIN indexes have never been
       | improved to allow for efficient sorting. Is this just not in
       | enough demand to receive any attention or is there a technical
       | aspect that makes it particularly difficult?
        
         | mattashii wrote:
         | I'm not familiar with the full details, but GIN indeed has some
         | issues that make it difficult to make presorting expensive:
         | 
         | It has a guarantee of only one item in the tree for each key.
         | All table rows that share that key are stored under that one
         | item, as one large value. This value can thus become really
         | large (and is stored as a subtree in such cases).
         | 
         | To sort these tuples, you can either implement that as 'merge
         | key/value pairs when keys are equal during the sort phase' or
         | 'merge key/value pairs when keys are equal during the index
         | load phase'.
         | 
         | The first requires handling of arbitrarily large KV pairs
         | (potentially larger than the normal page size, which indexes
         | generally don't handle well) and losing sort elements during
         | the sort (due to merges of values, I'm not sure sort algorithms
         | can work with that).
         | 
         | The second requires a significantly larger temporary space to
         | store the temporary data (potentially several times the size of
         | the original data).
         | 
         | Additionally, someone must first find this enough of an issue
         | to try to fix this, and apparently nobody has yet thought it to
         | be enough of an issue to take a stab at it.
        
       ___________________________________________________________________
       (page generated 2022-05-20 23:01 UTC)