[HN Gopher] How database indexing works internally
___________________________________________________________________
How database indexing works internally
Author : the2ndfloorguy
Score : 56 points
Date : 2021-08-19 05:28 UTC (1 days ago)
(HTM) web link (blog.pankajtanwar.in)
(TXT) w3m dump (blog.pankajtanwar.in)
| thinkharderdev wrote:
| I don't understand why ``` SELECT total(amount) FROM orders WHERE
| createdAt BETWEEN '2020-01-01 00:00:00' AND '2020-12-31
| 23:59:59'; ```
|
| would do a full table scan. Wouldn't the engine be able to use
| the index to find just the correct rows and then only do the
| total on those?
| masklinn wrote:
| > Wouldn't the engine be able to use the index to find just the
| correct rows and then only do the total on those?
|
| Yeah, the explanation doesn't really make sense in general,
| though computing anything might be sufficient to throw off
| mysql's query optimiser.
| da_chicken wrote:
| It depends on a lot of factors. It's almost always down to
| cardinality and organization.
|
| How many records are in the table, and what proportion of
| records are estimated to be necessary to read? If it's more
| than a given threshold, then the query engine may decide to
| just read everything than waste time sorting out which pages it
| needs and which it doesn't. I/O is slow, but CPU is not free.
|
| Is the table clustered on the `createdAt` field? If not, the
| system will not assume that the table records are going to be
| stored in any particular order that benefits the query
| execution. After all, the field may not represent when the
| record was created in this particular system. It will then
| estimate how many pages it thinks it will need to read. Again,
| at a certain threshold it will decide to just read every page
| in the table. Sometimes it's more expensive to figure out what
| to read and still end up reading 50% of the table instead of
| just reading everything and scanning through it quickly.
|
| Remember, databases almost always read from disk in _pages_
| that contain multiple records. They can 't read just one
| record. They read the whole page and then read the records out
| of that. That's the smallest unit of I/O. If records are evenly
| distributed, then the system will need to read almost every
| page anyways. That's why clustering indexes help so much, but
| you can only cluster on one set of fields.
|
| As an aside for datetime thresholds it's a better idea to
| specify `WHERE createdAt >= '2020-01-01 00:00:00' AND createdAt
| < '2021-01-01 00:00:00'`. Different systems have different time
| precision. You don't want to miss records because of fractional
| seconds. Yes, it probably won't matter, but you can also just
| write it this way and not have a hole in your logic. BETWEEN is
| just syntactic sugar, too.
| tomnipotent wrote:
| Statistics.
|
| The query planner has decided that it's going to have to visit
| a lot of pages regardless, so rather than reading the index AND
| most of the table it can get the job done with less by just
| scanning the table.
| WJW wrote:
| To elaborate on this with a concrete example: if ALL rows
| were created between 2020-01-01 00:00:00 and 2020-12-31
| 23:59:59, going to the index is pure overhead.
| da_chicken wrote:
| How do you know that's when the rows were created? Maybe it
| means when the orders were created, but the records are
| from a foreign system and they were inserted in a
| completely different order.
|
| More relevantly, how does _the query engine_ know that the
| `createdAt` field has anything to do with the order of
| records stored on disk?
| tomnipotent wrote:
| > how does the query engine know
|
| It doesn't, really.
|
| When a query planner is deciding the best path for
| answering a query, it will generate many different plans
| with different costs. Estimations are sometimes wrong, so
| planners may try several different plans for the same
| query over time and eventually settle on the one that
| actually did the best.
|
| Databases are also constantly collecting statistics on
| the values in each column, such as min/max, distribution,
| cardinality, rows per page/block, values at different
| percentiles. It's possible for these statistics to change
| over time, either suddenly or slowly, and this can mean
| the cached plan is sub-optimal. Or maybe you're
| adding/removing indexes. Plans become stale, and new
| plans will need to be tested.
|
| So let's say you create an index on `createdAt`. When you
| ask for rows `WHERE createdAt >= '2020-01-01' AND
| createdAt < '2020-04-01'`, it's going to generate plans
| considering that index and compare it against a plan
| without that index.
|
| Because the index is sorted and we have a range query
| (min/max), the planner knows it can very quickly find the
| inner b-tree nodes that contain all rows that match the
| query and how many pages may need to be read from disk
| and how many rows that will include. It will then know
| exactly what data pages we need to scan to find the rows
| we're interested in.
|
| Without that index, it has no idea about the distribution
| of values of `createdAt`. It's very possible that 99% of
| all rows are sorted, but for whatever reason 1 row is
| completely out of place (should be row #100, but is row
| #10,000). Every row will need to be scanned to be sure.
|
| Even with the index, the database statistics include
| min/max values. Let's say we change our query to `WHERE
| createdAt >= '1900-01-01' AND createdAt < '2100-01-01'`,
| and it includes EVERY row in the table. The query planner
| will be able to figure this out, and will generate a less
| costly plan that just does a full table scan instead and
| skips the index.
| tomnipotent wrote:
| Or even if it's most of the rows. The optimizer will
| determine a cost which includes data distribution, and the
| cumulative cost of an index scan + table seeks could still
| exceed just the cost of a table scan.
| CapriciousCptl wrote:
| I was curious so I tried it in postgres 13. Postgres, at least,
| uses the index to form a bitmap and scans that when aggregating
| in the first case (10% rows in the bitmap) and not in a second
| case WHERE "createdAt" BETWEEN '1990-01-01 00:00:00' AND
| '2020-12-31 23:59:59'; (100% rows in the bitmap, obviating the
| need for the intermediate step). I also tried ~20% rows
| (2019-2020) and the planner skipped the index. '''
|
| CREATE TABLE temp (id SERIAL PRIMARY KEY, amount MONEY,
| "createdAt" TIMESTAMPTZ); CREATE INDEX ON temp ("createdAt");
|
| INSERT INTO temp(id, "createdAt", amount) SELECT
| generate_series(1,1000000) AS id, NOW() + (random() * (interval
| '10 years')) - interval '10 years' AS createdAt, random() *
| 100::money AS amount.
|
| EXPLAIN SELECT sum(amount) FROM temp WHERE "createdAt" BETWEEN
| '2020-01-01 00:00:00' AND '2020-12-31 23:59:59';
|
| Aggregate (cost=10286.06..10286.07 rows=1 width=8) -> Bitmap
| Heap Scan on temp (cost=2148.00..10033.48 rows=101032 width=8)
| Recheck Cond: (("createdAt" >= '2020-01-01
| 00:00:00-05'::timestamp with time zone) AND ("createdAt" <=
| '2020-12-31 23:59:59-05'::timestamp with time zone)) -> Bitmap
| Index Scan on "temp_createdAt_idx" (cost=0.00..2122.75
| rows=101032 width=0) Index Cond: (("createdAt" >= '2020-01-01
| 00:00:00-05'::timestamp with time zone) AND ("createdAt" <=
| '2020-12-31 23:59:59-05'::timestamp with time zone))
|
| And when running a longer query: Finalize Aggregate
| (cost=14596.71..14596.72 rows=1 width=8) -> Gather
| (cost=14596.49..14596.70 rows=2 width=8) Workers Planned: 2 ->
| Partial Aggregate (cost=13596.49..13596.50 rows=1 width=8) ->
| Parallel Seq Scan on temp (cost=0.00..12620.00 rows=390597
| width=8) Filter: (("createdAt" >= '1990-01-01
| 00:00:00-05'::timestamp with time zone) AND ("createdAt" <=
| '2020-12-31 23:59:59-05'::timestamp with time zone))
| tomnipotent wrote:
| Buffer pool plays a big part. Very possible all the data is
| already in-memory, and for certain data sizes it'll be faster
| to just follow leaf nodes start-to-finish than it is to
| determine what pages you can skip.
|
| Postgres buffer pool is a ring, and relies on "clock sweep"
| to decide what pages it can evict on each iteration. It has a
| shared buffer, and per-query buffers to eliminate shared
| buffer evictions (for costly queries). When doing index
| scans, worst-case the same page is being accessed in random
| order multiple times and it's evicted between those accesses
| so we end up with redundant disk I/O.
|
| Bitmap scans ensure each page is only scanned once and in-
| order, so it's a great solution when you need more than an
| index scan but less than a full table scan (worth of data),
| not to mention multiple indexes can be combined into one
| bitmap scan.
|
| If every page is already in memory, the query planner may
| pick plans that look sub-optimal if you factor in disk I/O
| but are otherwise very efficient in-memory.
| masklinn wrote:
| FWIW you can indent with 4 spaces for a preformatted block
| e.g. CREATE TABLE temp (id SERIAL PRIMARY
| KEY, amount MONEY, "createdAt" TIMESTAMPTZ); CREATE INDEX ON
| temp ("createdAt"); INSERT INTO temp(id, "createdAt",
| amount) SELECT generate_series(1,1000000) AS id, NOW() +
| (random() * (interval '10 years')) - interval '10 years' AS
| createdAt, random() * 100::money AS amount;
| dboreham wrote:
| Indexing isn't magic. You can define an ordered index on the
| createdAt field and your query could use that index if it feels
| by its notion of the best thing to do that it'd be worthwhile.
| You can do things to persuade it what to do. It's up to you.
| masklinn wrote:
| The problem is that the article's explanation makes
| absolutely no sense: it claims that a full scan is performed
| because of the computation (function call) in the `select`
| clause. And worse, that indexing the column _used_ in the
| function call fixes it:
|
| > But, It will still do a full table scan because we are
| using "amount" column to calculate the total. So, we need to
| put an index on "amount" column too.
|
| Seems to me like TFA tried it, saw that the index was not
| used, and invented a justification for it. And after adding
| the second index the query planner had collected the relevant
| information and started using the index, or something.
| eatonphil wrote:
| Awesome guide! If you want to see a very minimal walkthrough of
| actually implementing this in code, the third post of my "write a
| SQL database in Go" series does just this [0].
|
| Once you have the index set up and are able to use it the next
| challenging part is pattern matching (plus algebraic rewrites) to
| figure out if the index is actually applicable.
|
| For example, in my linked blog post, an index is only used if the
| WHERE expression has only AND-ed operations and one of those
| expressions is a filter using <, >, =, <> (and a few other
| operators) and one of the columns is indexed. This is simpler
| than what a real database might attempt.
|
| Another interesting part of seeing it in code (and messing with
| the code) is observing cases like where even though you're using
| an index it doesn't speed anything up if you filter still matches
| every item in the index.
|
| Also if you just want to learn more about indexes as a user, I
| highly recommend the site "Use the index, Luke" that has many
| useful guides/explanations.
|
| [0] https://notes.eatonphil.com/database-basics-indexes.html
|
| [1] https://use-the-index-luke.com/
| masklinn wrote:
| The article claims to talk about database indexing but as soon as
| it goes into any sort of details the information only applies to
| mysql.
|
| > So here, indexing comes into picture. What indexing does is,
| sets up the column your search conditions are on, in a sorted
| order to assist in optimising query performance.
|
| That's mysql specific, and not just that but it's _innodb_
| specific: innodb uses the primary key as a clustering index, or
| the first UNIQUE index if there 's no PK.
|
| Other engines don't usually do that, some (e.g. postgres) don't
| even have clustered indexes (aka indexes which affect the table's
| own layout), in postgres clustering is an explicit discrete
| table-rewriting operation.
|
| And of course a table can have multiple indices, so it's not
| possible for an index to always determine the table order. In
| databases which support clustered indices, only one index per
| table can be clustered.
|
| > The major advantage of B-tree is that the data in it is
| sortable
|
| That's... not really true. A b-tree is sorted, period. You can't
| have an unordered btree.
|
| > What are other types of indexes?
|
| There are tons of index types not mentioned by the article e.g.
| BRIN, GIN, GiST
|
| > For example, suppose you want to find out all of the companies
| which has company_id less than 40. How could you do that with a
| hash table index? Well, it's not possible because a hash table is
| only good for looking up key value pairs - which means queries
| that check for equality (like "WHERE company_id = 18").
|
| And because btrees naturally handle "range" queries, they also
| handle _prefix_ queries well. After all, "foo*" is just looking
| for every value between "foo" included and "fop" excluded.
| da_chicken wrote:
| > > So here, indexing comes into picture. What indexing does
| is, sets up the column your search conditions are on, in a
| sorted order to assist in optimising query performance.
|
| > _That 's mysql specific, and not just that but it's innodb
| specific: innodb uses the primary key as a clustering index, or
| the first UNIQUE index if there's no PK._
|
| > _Other engines don 't usually do that, some (e.g. postgres)
| don't even have clustered indexes (aka indexes which affect the
| table's own layout), in postgres clustering is an explicit
| discrete table-rewriting operation._
|
| It kind of depends on exactly what they mean. _The physical
| index itself_ is essentially always going to be stored in a
| "sorted" order, clustered or non-clustered. It's [by default
| almost always] a B-tree of some flavor, but every index is
| going to be stored in a way that makes searching it faster, and
| in that sense it can always be described as "sorted" even if
| the table itself has no clustering index. It's in a well-
| defined order that aids in identifying the records in the table
| itself. Even if the query engine identifies that the 100
| records that satisfy the WHERE clause are in 100 different,
| discontinuous pages, an index can help with that even if the
| I/O won't be optimal.
|
| > > The major advantage of B-tree is that the data in it is
| sortable
|
| > That's... not really true. A b-tree is sorted, period. You
| can't have an unordered btree.
|
| Oh, I've just noticed you know this already. What's your
| confusion then?
| masklinn wrote:
| > It kind of depends on exactly what they mean. The physical
| index itself is essentially always going to be stored in a
| "sorted" order, clustered or non-clustered.
|
| They're not "meaning" anything, they're literally saying that
| adding an index sorts the table by the indexed column, with a
| little picture showing this change _to the table itself_.
|
| > Oh, I've just noticed you know this already. What's your
| confusion then?
|
| I'm not confused, I'm pointing out that the article is
| factually incorrect. Not just "this only applies to one
| database" but literally saying something which is not true,
| or at best hopelessly confused.
| da_chicken wrote:
| > They're not "meaning" anything, they're literally saying
| that adding an index sorts the table by the indexed column,
| with a little picture showing this change to the table
| itself.
|
| No, I don't think so. That's _one_ interpretation of what
| they said. Since that 's clearly not correct, why should we
| assume that's what it means? That's dishonest reading.
|
| If I'm explaining to a beginner how an index works -- which
| is literally what this article is doing -- then I don't see
| why I wouldn't refer to an index as a specially sorted
| version of the records in the table. That's not a huge
| difference from what the author actually wrote. Yeah, you
| might prefer that the author had said "organized" rather
| than "sorted", but those two words are synonyms in plain
| English.
|
| Look, here's the statement in context:
|
| > What indexing does is, sets up the column your search
| conditions are on, in a sorted order to assist in
| optimising query performance.
|
| > With index on company_id, table would look like this
| <image>
|
| > Now, due to sorted order, the database knows which
| records to examine and when to stop.
|
| > The main point of having a index is to cut down the
| number of records/rows in a table which needs to be
| examined by the database, to speed up the searching
| queries.
|
| If you look at an index physically, it _does_ literally
| store the values being indexed. That 's why indexes take up
| disk space and require I/O. They're actually duplicating
| information. It _does_ actually take the column values from
| the table and "sort" or "organize" the columns covered by
| the index. They're organized into a B-tree [usually], but
| the values from the table for the indexed columns are there
| and they are organized in a way you can describe as
| "sorted." The big difference is that the pages of the
| index, rather than having leaf records like a table,
| instead reference the records directly (often using an
| identifier tied to the clustered unique index).
| cogman10 wrote:
| > innodb uses the primary key as a clustering index, or the
| first UNIQUE index if there's no PK.
|
| In MSSQL, the primary key or the even a "Clustered index" is
| ultimately the ordering of a table. If a neither are provided
| then MSSQL uses an invisible row id column. If the clustered
| index is not made with the "UNIQUE" constraint, MSSQL will also
| create the row id column.
|
| That is to say, yes, this article is very MySQL oriented.
| tomnipotent wrote:
| > is ultimately the ordering of a table
|
| Clustered index leaf nodes contain the actual row data of the
| table itself, which is why there can only be one.
| srcreigh wrote:
| An important detail which this guide leaves out: The "actual
| location of the row" is usually the leaf node of another B-Tree.
| That is the primary B-Tree and it's indexed by the primary key.
|
| The major consequence is every non-primary index query involves
| dereferencing N pointers, which could mean loading N pages from
| disk/SSD to get leaf nodes spread out across the primary tree.
| Whereas if you query a range in the primary B-Tree directly, the
| rows are consecutive so you would only load N/M consecutive pages
| based on the ratio of rowsize to pagesize.
|
| That's why some people use composite keys for primary key, to get
| better data locality in the primary B-Tree index.
|
| See "How We've Scaled Dropbox"[1] to hear the same thing.
|
| At 48:36 they explain why they changed from PRIMARY KEY (id) to
| PRIMARY KEY (ns_id, latest, id). ns_id is kinda like user_id. So
| that change groups journal entries for users together on disk
|
| Specifically. PRIMARY KEY (id) orders things by creation date
| whereas PRIMARY KEY (ns_id, latest, id) orders things by ns_id
| primarily.
|
| [1]: https://youtu.be/PE4gwstWhmc?t=2914
| iaabtpbtpnn wrote:
| This is true of MySQL (which the guide uses), but not
| necessarily of other databases such as Postgres.
| srcreigh wrote:
| You're right. Postgres doesn't give any control over primary
| data locality. That might cause querying 1 row a bit faster
| in Postgres (no log(N) traversal of the primary B-tree index)
| but picking out N rows could be a lot slower.
|
| https://www.postgresql.org/docs/13/indexes-index-only-
| scans....
| walski wrote:
| Great video from the SQLite founder that covers a lot of ground
| in terms of how indices work under the hood:
| https://m.youtube.com/watch?v=gpxnbly9bz4
___________________________________________________________________
(page generated 2021-08-20 23:02 UTC)