hngopher.com

       [HN Gopher] Index Merges vs. Composite Indexes in Postgres and M...
       ___________________________________________________________________
        
       Index Merges vs. Composite Indexes in Postgres and MySQL
        
       Author : Sirupsen
       Score  : 74 points
       Date   : 2022-11-27 19:02 UTC (3 hours ago)
        
 (HTM) web link (sirupsen.com)
 (TXT) w3m dump (sirupsen.com)
        
       | magicalhippo wrote:
       | A composite index can also be used for a partial index scan[1],
       | so if you're frequently looking for just int100 as well as the
       | int100,int1000 combination (as per the article's example), then a
       | composite index int100,int1000 can be used for queries just
       | filtering on int100.
       | 
       | The order of the columns in the composite index might matter
       | then. We got some nice savings by reordering columns in indexes
       | and changing the join order in queries (our DB wasn't _that_
       | smart) or adding a  "useless" filters to the where clause,
       | allowing us to consolidate multiple indexes into one composite
       | one.
       | 
       | [1]: might be using the wrong terminology here, I'm not talking
       | about a partial index[2], but using only the first N dimensions
       | of a M-dimensional index.
       | 
       | [2]: https://www.postgresql.org/docs/current/indexes-partial.html
        
         | winrid wrote:
         | There are some DBs where the order of fields doesn't matter in
         | the compound index if you're just searching by one field (but
         | of course is less efficient). I think Oracle supports this.
        
           | magicalhippo wrote:
           | Good point, I reworded it to be more general. Can't recall
           | seeing our DB being that clever.
        
           | thom wrote:
           | Postgres is capable of doing this with some index types, but
           | generally more slowly than a B-tree index being asked about
           | its leftmost columns:
           | 
           | https://www.postgresql.org/docs/current/indexes-
           | multicolumn....
           | 
           | Also worth noting that for very complex workloads that need
           | to support arbitrary subsets of equality matches over
           | columns, a Bloom filter might work best:
           | 
           | https://www.postgresql.org/docs/current/bloom.html
        
           | mattashii wrote:
           | This dependz on the index type, but PostgreSQL also supports
           | queries based on arbitrary attributes of multi-attribute
           | indexes: both GIN and BRIN can be used for queries on any of
           | the indexed expressions.
           | 
           | Maybe eventually PG will also get index skip scans for
           | btrees, which could allow for arbitrary columns searches in
           | the btree; but we're not there yet by a large margin.
        
       | bawolff wrote:
       | Honestly im kind of surprised that they are even that close. I
       | wonder if this changes at scale when the intersection is larger.
        
         | Sirupsen wrote:
         | Ideally, I would add three graphs to the post:
         | 
         | (1) Table size on the x-axis, and time on the y-axis for index
         | merge vs composite index
         | 
         | (2) Number of columns on the x-axis, and time on the y-axis for
         | both
         | 
         | (3) Number of final matches on the x-axis, and time on the
         | y-axis for both
         | 
         | But ran out of time and decided to test with the table size of
         | 10M rows, and a 100-ish result set. That's in my experience a
         | decent representation for what you might be doing with a
         | relational database.
        
       | Lukas1994 wrote:
       | Good stuff! What's the size difference between the composite
       | index vs the two separate indices?
        
         | Sirupsen wrote:
         | I've added this to the article, thanks!
         | 
         | Composite index (int64, int64): ~70 MiB in Postgres, ~350 MiB
         | in MySQL
         | 
         | Single index (int64): ~70 MiB in Postgres, ~240 MiB in MySQL
         | 
         | If you assume the majority of an index are index entries of
         | (int64, int64, int64) where the third number is some sort of
         | identifier for the record on the heap, you'd expect this to be
         | ~230 MiB. So Postgres does some compression very well here, and
         | MySQL has a bit more overhead for its indexes it seems.
        
           | libraryofbabel wrote:
           | As you mention in the article, this is small by modern
           | hardware standards and means the indexes can be entirely held
           | in memory. Curious how things look for much larger tables
           | where the index has to be read from disk?
        
           | fabian2k wrote:
           | Could be the deduplication newer Postgres versions have for
           | B-Tree indexes:
           | 
           | https://www.postgresql.org/docs/current/btree-
           | implementation...
           | 
           | > 67.4.3. Deduplication
           | 
           | > A duplicate is a leaf page tuple (a tuple that points to a
           | table row) where all indexed key columns have values that
           | match corresponding column values from at least one other
           | leaf page tuple in the same index. Duplicate tuples are quite
           | common in practice. B-Tree indexes can use a special, space-
           | efficient representation for duplicates when an optional
           | technique is enabled: deduplication.
           | 
           | > Deduplication works by periodically merging groups of
           | duplicate tuples together, forming a single posting list
           | tuple for each group. The column key value(s) only appear
           | once in this representation. This is followed by a sorted
           | array of TIDs that point to rows in the table. This
           | significantly reduces the storage size of indexes where each
           | value (or each distinct combination of column values) appears
           | several times on average. The latency of queries can be
           | reduced significantly. Overall query throughput may increase
           | significantly. The overhead of routine index vacuuming may
           | also be reduced significantly.
        
             | Sirupsen wrote:
             | Excellent, thank you! I'll add that to the article.
        
               | fabian2k wrote:
               | This is a guess on my part, though I think a plausible
               | one. To verify you'd probably have to compare an index
               | with unique values to one with many identical ones.
        
           | masklinn wrote:
           | An other thing which might be of interest: what if you use
           | convering indexes for the two single indexes? Is postgres
           | able to do an index-only scan then, thanks to the coverage?
           | 
           | Or does it decide to use only one index and filter based on
           | the covered values?
        
       | pdhborges wrote:
       | For these 64-bit index entries we'd expect to have to scan
       | roughly:       index_row_size[?]rows=2[?]64bit[?]10^5=1.5MiB
       | 
       | Where do the 10^5 rows come from? With a composite index and a
       | point query doesn't the database scan just the 100 returned rows?
        
         | Sirupsen wrote:
         | You're absolutely right!
         | 
         | I forgot to move this around when I updated the article's
         | structure. This is only relevant when doing the index merge.
         | The article has been updated
        
       ___________________________________________________________________
       (page generated 2022-11-27 23:00 UTC)