[HN Gopher] How database indexing works internally
       ___________________________________________________________________
        
       How database indexing works internally
        
       Author : the2ndfloorguy
       Score  : 56 points
       Date   : 2021-08-19 05:28 UTC (1 days ago)
        
 (HTM) web link (blog.pankajtanwar.in)
 (TXT) w3m dump (blog.pankajtanwar.in)
        
       | thinkharderdev wrote:
       | I don't understand why ``` SELECT total(amount) FROM orders WHERE
       | createdAt BETWEEN '2020-01-01 00:00:00' AND '2020-12-31
       | 23:59:59'; ```
       | 
       | would do a full table scan. Wouldn't the engine be able to use
       | the index to find just the correct rows and then only do the
       | total on those?
        
         | masklinn wrote:
         | > Wouldn't the engine be able to use the index to find just the
         | correct rows and then only do the total on those?
         | 
         | Yeah, the explanation doesn't really make sense in general,
         | though computing anything might be sufficient to throw off
         | mysql's query optimiser.
        
         | da_chicken wrote:
         | It depends on a lot of factors. It's almost always down to
         | cardinality and organization.
         | 
         | How many records are in the table, and what proportion of
         | records are estimated to be necessary to read? If it's more
         | than a given threshold, then the query engine may decide to
         | just read everything than waste time sorting out which pages it
         | needs and which it doesn't. I/O is slow, but CPU is not free.
         | 
         | Is the table clustered on the `createdAt` field? If not, the
         | system will not assume that the table records are going to be
         | stored in any particular order that benefits the query
         | execution. After all, the field may not represent when the
         | record was created in this particular system. It will then
         | estimate how many pages it thinks it will need to read. Again,
         | at a certain threshold it will decide to just read every page
         | in the table. Sometimes it's more expensive to figure out what
         | to read and still end up reading 50% of the table instead of
         | just reading everything and scanning through it quickly.
         | 
         | Remember, databases almost always read from disk in _pages_
         | that contain multiple records. They can 't read just one
         | record. They read the whole page and then read the records out
         | of that. That's the smallest unit of I/O. If records are evenly
         | distributed, then the system will need to read almost every
         | page anyways. That's why clustering indexes help so much, but
         | you can only cluster on one set of fields.
         | 
         | As an aside for datetime thresholds it's a better idea to
         | specify `WHERE createdAt >= '2020-01-01 00:00:00' AND createdAt
         | < '2021-01-01 00:00:00'`. Different systems have different time
         | precision. You don't want to miss records because of fractional
         | seconds. Yes, it probably won't matter, but you can also just
         | write it this way and not have a hole in your logic. BETWEEN is
         | just syntactic sugar, too.
        
         | tomnipotent wrote:
         | Statistics.
         | 
         | The query planner has decided that it's going to have to visit
         | a lot of pages regardless, so rather than reading the index AND
         | most of the table it can get the job done with less by just
         | scanning the table.
        
           | WJW wrote:
           | To elaborate on this with a concrete example: if ALL rows
           | were created between 2020-01-01 00:00:00 and 2020-12-31
           | 23:59:59, going to the index is pure overhead.
        
             | da_chicken wrote:
             | How do you know that's when the rows were created? Maybe it
             | means when the orders were created, but the records are
             | from a foreign system and they were inserted in a
             | completely different order.
             | 
             | More relevantly, how does _the query engine_ know that the
             | `createdAt` field has anything to do with the order of
             | records stored on disk?
        
               | tomnipotent wrote:
               | > how does the query engine know
               | 
               | It doesn't, really.
               | 
               | When a query planner is deciding the best path for
               | answering a query, it will generate many different plans
               | with different costs. Estimations are sometimes wrong, so
               | planners may try several different plans for the same
               | query over time and eventually settle on the one that
               | actually did the best.
               | 
               | Databases are also constantly collecting statistics on
               | the values in each column, such as min/max, distribution,
               | cardinality, rows per page/block, values at different
               | percentiles. It's possible for these statistics to change
               | over time, either suddenly or slowly, and this can mean
               | the cached plan is sub-optimal. Or maybe you're
               | adding/removing indexes. Plans become stale, and new
               | plans will need to be tested.
               | 
               | So let's say you create an index on `createdAt`. When you
               | ask for rows `WHERE createdAt >= '2020-01-01' AND
               | createdAt < '2020-04-01'`, it's going to generate plans
               | considering that index and compare it against a plan
               | without that index.
               | 
               | Because the index is sorted and we have a range query
               | (min/max), the planner knows it can very quickly find the
               | inner b-tree nodes that contain all rows that match the
               | query and how many pages may need to be read from disk
               | and how many rows that will include. It will then know
               | exactly what data pages we need to scan to find the rows
               | we're interested in.
               | 
               | Without that index, it has no idea about the distribution
               | of values of `createdAt`. It's very possible that 99% of
               | all rows are sorted, but for whatever reason 1 row is
               | completely out of place (should be row #100, but is row
               | #10,000). Every row will need to be scanned to be sure.
               | 
               | Even with the index, the database statistics include
               | min/max values. Let's say we change our query to `WHERE
               | createdAt >= '1900-01-01' AND createdAt < '2100-01-01'`,
               | and it includes EVERY row in the table. The query planner
               | will be able to figure this out, and will generate a less
               | costly plan that just does a full table scan instead and
               | skips the index.
        
             | tomnipotent wrote:
             | Or even if it's most of the rows. The optimizer will
             | determine a cost which includes data distribution, and the
             | cumulative cost of an index scan + table seeks could still
             | exceed just the cost of a table scan.
        
         | CapriciousCptl wrote:
         | I was curious so I tried it in postgres 13. Postgres, at least,
         | uses the index to form a bitmap and scans that when aggregating
         | in the first case (10% rows in the bitmap) and not in a second
         | case WHERE "createdAt" BETWEEN '1990-01-01 00:00:00' AND
         | '2020-12-31 23:59:59'; (100% rows in the bitmap, obviating the
         | need for the intermediate step). I also tried ~20% rows
         | (2019-2020) and the planner skipped the index. '''
         | 
         | CREATE TABLE temp (id SERIAL PRIMARY KEY, amount MONEY,
         | "createdAt" TIMESTAMPTZ); CREATE INDEX ON temp ("createdAt");
         | 
         | INSERT INTO temp(id, "createdAt", amount) SELECT
         | generate_series(1,1000000) AS id, NOW() + (random() * (interval
         | '10 years')) - interval '10 years' AS createdAt, random() *
         | 100::money AS amount.
         | 
         | EXPLAIN SELECT sum(amount) FROM temp WHERE "createdAt" BETWEEN
         | '2020-01-01 00:00:00' AND '2020-12-31 23:59:59';
         | 
         | Aggregate (cost=10286.06..10286.07 rows=1 width=8) -> Bitmap
         | Heap Scan on temp (cost=2148.00..10033.48 rows=101032 width=8)
         | Recheck Cond: (("createdAt" >= '2020-01-01
         | 00:00:00-05'::timestamp with time zone) AND ("createdAt" <=
         | '2020-12-31 23:59:59-05'::timestamp with time zone)) -> Bitmap
         | Index Scan on "temp_createdAt_idx" (cost=0.00..2122.75
         | rows=101032 width=0) Index Cond: (("createdAt" >= '2020-01-01
         | 00:00:00-05'::timestamp with time zone) AND ("createdAt" <=
         | '2020-12-31 23:59:59-05'::timestamp with time zone))
         | 
         | And when running a longer query: Finalize Aggregate
         | (cost=14596.71..14596.72 rows=1 width=8) -> Gather
         | (cost=14596.49..14596.70 rows=2 width=8) Workers Planned: 2 ->
         | Partial Aggregate (cost=13596.49..13596.50 rows=1 width=8) ->
         | Parallel Seq Scan on temp (cost=0.00..12620.00 rows=390597
         | width=8) Filter: (("createdAt" >= '1990-01-01
         | 00:00:00-05'::timestamp with time zone) AND ("createdAt" <=
         | '2020-12-31 23:59:59-05'::timestamp with time zone))
        
           | tomnipotent wrote:
           | Buffer pool plays a big part. Very possible all the data is
           | already in-memory, and for certain data sizes it'll be faster
           | to just follow leaf nodes start-to-finish than it is to
           | determine what pages you can skip.
           | 
           | Postgres buffer pool is a ring, and relies on "clock sweep"
           | to decide what pages it can evict on each iteration. It has a
           | shared buffer, and per-query buffers to eliminate shared
           | buffer evictions (for costly queries). When doing index
           | scans, worst-case the same page is being accessed in random
           | order multiple times and it's evicted between those accesses
           | so we end up with redundant disk I/O.
           | 
           | Bitmap scans ensure each page is only scanned once and in-
           | order, so it's a great solution when you need more than an
           | index scan but less than a full table scan (worth of data),
           | not to mention multiple indexes can be combined into one
           | bitmap scan.
           | 
           | If every page is already in memory, the query planner may
           | pick plans that look sub-optimal if you factor in disk I/O
           | but are otherwise very efficient in-memory.
        
           | masklinn wrote:
           | FWIW you can indent with 4 spaces for a preformatted block
           | e.g.                   CREATE TABLE temp (id SERIAL PRIMARY
           | KEY, amount MONEY, "createdAt" TIMESTAMPTZ); CREATE INDEX ON
           | temp ("createdAt");         INSERT INTO temp(id, "createdAt",
           | amount) SELECT generate_series(1,1000000) AS id, NOW() +
           | (random() * (interval '10 years')) - interval '10 years' AS
           | createdAt, random() * 100::money AS amount;
        
         | dboreham wrote:
         | Indexing isn't magic. You can define an ordered index on the
         | createdAt field and your query could use that index if it feels
         | by its notion of the best thing to do that it'd be worthwhile.
         | You can do things to persuade it what to do. It's up to you.
        
           | masklinn wrote:
           | The problem is that the article's explanation makes
           | absolutely no sense: it claims that a full scan is performed
           | because of the computation (function call) in the `select`
           | clause. And worse, that indexing the column _used_ in the
           | function call fixes it:
           | 
           | > But, It will still do a full table scan because we are
           | using "amount" column to calculate the total. So, we need to
           | put an index on "amount" column too.
           | 
           | Seems to me like TFA tried it, saw that the index was not
           | used, and invented a justification for it. And after adding
           | the second index the query planner had collected the relevant
           | information and started using the index, or something.
        
       | eatonphil wrote:
       | Awesome guide! If you want to see a very minimal walkthrough of
       | actually implementing this in code, the third post of my "write a
       | SQL database in Go" series does just this [0].
       | 
       | Once you have the index set up and are able to use it the next
       | challenging part is pattern matching (plus algebraic rewrites) to
       | figure out if the index is actually applicable.
       | 
       | For example, in my linked blog post, an index is only used if the
       | WHERE expression has only AND-ed operations and one of those
       | expressions is a filter using <, >, =, <> (and a few other
       | operators) and one of the columns is indexed. This is simpler
       | than what a real database might attempt.
       | 
       | Another interesting part of seeing it in code (and messing with
       | the code) is observing cases like where even though you're using
       | an index it doesn't speed anything up if you filter still matches
       | every item in the index.
       | 
       | Also if you just want to learn more about indexes as a user, I
       | highly recommend the site "Use the index, Luke" that has many
       | useful guides/explanations.
       | 
       | [0] https://notes.eatonphil.com/database-basics-indexes.html
       | 
       | [1] https://use-the-index-luke.com/
        
       | masklinn wrote:
       | The article claims to talk about database indexing but as soon as
       | it goes into any sort of details the information only applies to
       | mysql.
       | 
       | > So here, indexing comes into picture. What indexing does is,
       | sets up the column your search conditions are on, in a sorted
       | order to assist in optimising query performance.
       | 
       | That's mysql specific, and not just that but it's _innodb_
       | specific: innodb uses the primary key as a clustering index, or
       | the first UNIQUE index if there 's no PK.
       | 
       | Other engines don't usually do that, some (e.g. postgres) don't
       | even have clustered indexes (aka indexes which affect the table's
       | own layout), in postgres clustering is an explicit discrete
       | table-rewriting operation.
       | 
       | And of course a table can have multiple indices, so it's not
       | possible for an index to always determine the table order. In
       | databases which support clustered indices, only one index per
       | table can be clustered.
       | 
       | > The major advantage of B-tree is that the data in it is
       | sortable
       | 
       | That's... not really true. A b-tree is sorted, period. You can't
       | have an unordered btree.
       | 
       | > What are other types of indexes?
       | 
       | There are tons of index types not mentioned by the article e.g.
       | BRIN, GIN, GiST
       | 
       | > For example, suppose you want to find out all of the companies
       | which has company_id less than 40. How could you do that with a
       | hash table index? Well, it's not possible because a hash table is
       | only good for looking up key value pairs - which means queries
       | that check for equality (like "WHERE company_id = 18").
       | 
       | And because btrees naturally handle "range" queries, they also
       | handle _prefix_ queries well. After all,  "foo*" is just looking
       | for every value between "foo" included and "fop" excluded.
        
         | da_chicken wrote:
         | > > So here, indexing comes into picture. What indexing does
         | is, sets up the column your search conditions are on, in a
         | sorted order to assist in optimising query performance.
         | 
         | > _That 's mysql specific, and not just that but it's innodb
         | specific: innodb uses the primary key as a clustering index, or
         | the first UNIQUE index if there's no PK._
         | 
         | > _Other engines don 't usually do that, some (e.g. postgres)
         | don't even have clustered indexes (aka indexes which affect the
         | table's own layout), in postgres clustering is an explicit
         | discrete table-rewriting operation._
         | 
         | It kind of depends on exactly what they mean. _The physical
         | index itself_ is essentially always going to be stored in a
         | "sorted" order, clustered or non-clustered. It's [by default
         | almost always] a B-tree of some flavor, but every index is
         | going to be stored in a way that makes searching it faster, and
         | in that sense it can always be described as "sorted" even if
         | the table itself has no clustering index. It's in a well-
         | defined order that aids in identifying the records in the table
         | itself. Even if the query engine identifies that the 100
         | records that satisfy the WHERE clause are in 100 different,
         | discontinuous pages, an index can help with that even if the
         | I/O won't be optimal.
         | 
         | > > The major advantage of B-tree is that the data in it is
         | sortable
         | 
         | > That's... not really true. A b-tree is sorted, period. You
         | can't have an unordered btree.
         | 
         | Oh, I've just noticed you know this already. What's your
         | confusion then?
        
           | masklinn wrote:
           | > It kind of depends on exactly what they mean. The physical
           | index itself is essentially always going to be stored in a
           | "sorted" order, clustered or non-clustered.
           | 
           | They're not "meaning" anything, they're literally saying that
           | adding an index sorts the table by the indexed column, with a
           | little picture showing this change _to the table itself_.
           | 
           | > Oh, I've just noticed you know this already. What's your
           | confusion then?
           | 
           | I'm not confused, I'm pointing out that the article is
           | factually incorrect. Not just "this only applies to one
           | database" but literally saying something which is not true,
           | or at best hopelessly confused.
        
             | da_chicken wrote:
             | > They're not "meaning" anything, they're literally saying
             | that adding an index sorts the table by the indexed column,
             | with a little picture showing this change to the table
             | itself.
             | 
             | No, I don't think so. That's _one_ interpretation of what
             | they said. Since that 's clearly not correct, why should we
             | assume that's what it means? That's dishonest reading.
             | 
             | If I'm explaining to a beginner how an index works -- which
             | is literally what this article is doing -- then I don't see
             | why I wouldn't refer to an index as a specially sorted
             | version of the records in the table. That's not a huge
             | difference from what the author actually wrote. Yeah, you
             | might prefer that the author had said "organized" rather
             | than "sorted", but those two words are synonyms in plain
             | English.
             | 
             | Look, here's the statement in context:
             | 
             | > What indexing does is, sets up the column your search
             | conditions are on, in a sorted order to assist in
             | optimising query performance.
             | 
             | > With index on company_id, table would look like this
             | <image>
             | 
             | > Now, due to sorted order, the database knows which
             | records to examine and when to stop.
             | 
             | > The main point of having a index is to cut down the
             | number of records/rows in a table which needs to be
             | examined by the database, to speed up the searching
             | queries.
             | 
             | If you look at an index physically, it _does_ literally
             | store the values being indexed. That 's why indexes take up
             | disk space and require I/O. They're actually duplicating
             | information. It _does_ actually take the column values from
             | the table and  "sort" or "organize" the columns covered by
             | the index. They're organized into a B-tree [usually], but
             | the values from the table for the indexed columns are there
             | and they are organized in a way you can describe as
             | "sorted." The big difference is that the pages of the
             | index, rather than having leaf records like a table,
             | instead reference the records directly (often using an
             | identifier tied to the clustered unique index).
        
         | cogman10 wrote:
         | > innodb uses the primary key as a clustering index, or the
         | first UNIQUE index if there's no PK.
         | 
         | In MSSQL, the primary key or the even a "Clustered index" is
         | ultimately the ordering of a table. If a neither are provided
         | then MSSQL uses an invisible row id column. If the clustered
         | index is not made with the "UNIQUE" constraint, MSSQL will also
         | create the row id column.
         | 
         | That is to say, yes, this article is very MySQL oriented.
        
           | tomnipotent wrote:
           | > is ultimately the ordering of a table
           | 
           | Clustered index leaf nodes contain the actual row data of the
           | table itself, which is why there can only be one.
        
       | srcreigh wrote:
       | An important detail which this guide leaves out: The "actual
       | location of the row" is usually the leaf node of another B-Tree.
       | That is the primary B-Tree and it's indexed by the primary key.
       | 
       | The major consequence is every non-primary index query involves
       | dereferencing N pointers, which could mean loading N pages from
       | disk/SSD to get leaf nodes spread out across the primary tree.
       | Whereas if you query a range in the primary B-Tree directly, the
       | rows are consecutive so you would only load N/M consecutive pages
       | based on the ratio of rowsize to pagesize.
       | 
       | That's why some people use composite keys for primary key, to get
       | better data locality in the primary B-Tree index.
       | 
       | See "How We've Scaled Dropbox"[1] to hear the same thing.
       | 
       | At 48:36 they explain why they changed from PRIMARY KEY (id) to
       | PRIMARY KEY (ns_id, latest, id). ns_id is kinda like user_id. So
       | that change groups journal entries for users together on disk
       | 
       | Specifically. PRIMARY KEY (id) orders things by creation date
       | whereas PRIMARY KEY (ns_id, latest, id) orders things by ns_id
       | primarily.
       | 
       | [1]: https://youtu.be/PE4gwstWhmc?t=2914
        
         | iaabtpbtpnn wrote:
         | This is true of MySQL (which the guide uses), but not
         | necessarily of other databases such as Postgres.
        
           | srcreigh wrote:
           | You're right. Postgres doesn't give any control over primary
           | data locality. That might cause querying 1 row a bit faster
           | in Postgres (no log(N) traversal of the primary B-tree index)
           | but picking out N rows could be a lot slower.
           | 
           | https://www.postgresql.org/docs/13/indexes-index-only-
           | scans....
        
       | walski wrote:
       | Great video from the SQLite founder that covers a lot of ground
       | in terms of how indices work under the hood:
       | https://m.youtube.com/watch?v=gpxnbly9bz4
        
       ___________________________________________________________________
       (page generated 2021-08-20 23:02 UTC)