[HN Gopher] How does database indexing work? (2008)
       ___________________________________________________________________
        
       How does database indexing work? (2008)
        
       Author : luu
       Score  : 188 points
       Date   : 2022-03-07 22:30 UTC (1 days ago)
        
 (HTM) web link (stackoverflow.com)
 (TXT) w3m dump (stackoverflow.com)
        
       | programmarchy wrote:
       | I had assumed btree in Postgres was referring to a binary tree
       | but that is definitely not the case. I also learned when to use a
       | hash index, rather than a btree. (If I understand correctly if
       | you're only using the = operator rather than something that needs
       | sorting with >, <, <=, etc. then a hash will perform better.)
        
         | efficax wrote:
         | theoretically a hash lookup is O(1) so should perform better if
         | the key is unique, but you give up a lot of other features
         | (like partial matches, searching, and range queries).
         | 
         | Binary trees are the ideal in a world of a pure von Neumann
         | model. In practice data structures perform better when data is
         | grouped together and this is what the b-tree gets you. It's
         | both more disk and cache efficient in real world workflows
        
         | dikei wrote:
         | Unless your own benchmarks prove hash indexes to have major
         | benefits for your queries, stick with the default B-tree index:
         | you never know what queries you'll need in the future.
        
           | latch wrote:
           | That's a bit too YAGNI for me. You have a `users` table with
           | an `tenant_id uuid not null` column. You aren't going to need
           | uniqueness on it and you aren't going to need range
           | operations on it it.
           | 
           | And..even if you did, you could re-index it.
        
         | settrans wrote:
         | Indeed, they perform much better, but caveat emptor: hash index
         | operations are not currently WAL-logged. That means that: 1) if
         | you crash during a hash index update, all bets are off and
         | you'll have to rebuild, and 2) they will just be wrong in a
         | replicated configuration.
         | 
         | EDIT: This is no longer true. Thanks latch!
        
           | latch wrote:
           | This was fixed in PG 10 (Oct 2017). See
           | https://rhaas.blogspot.com/2017/09/postgresqls-hash-
           | indexes-...
        
             | settrans wrote:
             | Oh, that's great news! (I last used evaluated them in
             | August 2017)
        
         | latch wrote:
         | Less important now, but they're unsafe to use prior to PG 10.
         | 
         | As of 14.1, they still don't support unique constraints:
         | select amname, pg_indexam_has_property(oid, 'can_unique') from
         | pg_am             amname | pg_indexam_has_property
         | --------+-------------------------          heap   | $?
         | btree  | t          hash   | f          gist   | f          gin
         | | f          spgist | f          brin   | f
        
         | dhosek wrote:
         | A btree is a kind of binary tree, one which is self-balancing
         | to keep O(log n) lookups from turning into O(n) lookups which
         | is possible without the self-balancing. (Think about a naive
         | binary tree where you insert elements in sorted order--then
         | everything will be in the right (or left if you start with the
         | maximum value) branch and you need to traverse the whole
         | structure to find the node you're looking for.)
         | 
         | For sorting or adjacency queries, you want a btree index. A
         | hash index can give you O(1) lookup with the drawback that you
         | lose ordering completely (whereas getting a sorted selection of
         | records with a tree is a O(1) operation). The trick here is
         | having a hashing function which is both fast and avoids
         | collisions (the worst case scenario is hash(x)=1 where
         | everything hashes to the same value and you end up again with
         | O(n) lookups).
        
           | masklinn wrote:
           | > A btree is a kind of binary tree
           | 
           | By definition a _binary_ tree means there are two children
           | per node.
           | 
           | Binary is the one thing a b-tree is _not_.
        
             | programmarchy wrote:
             | Right, I think you could say a binary tree is a kind of
             | B-tree, though.
        
               | jlokier wrote:
               | It's not, because in a standard B-tree all the leaf nodes
               | are at exactly the same depth, whereas in a binary tree
               | they are not in general, unless the binary tree is
               | balanced with exactly 2N leaves.
               | 
               | This difference occurs because a B-tree can have interior
               | nodes that aren't completely full, whereas a binary
               | tree's interior nodes have exactly 2 children.
        
             | getcrunk wrote:
             | Uhh. A b tree is a self balancing binary (search) tree.
        
               | coding123 wrote:
               | Not binary, it can and generally has more than 2 children
               | per node. That's the main difference.
        
               | broken8ball wrote:
               | Binary implies two. Both B-trees and BSTs exist in the
               | universe of m-ary trees. A BST and a B-tree would be
               | equivalent only if the branching factor of the B-tree was
               | set to 2, but practically this is rarely the case with
               | indexes, given that a higher branching factor is
               | generally more favorable to lookup times ("block
               | accesses") since the hight of the B-tree is reduced as m
               | increases.
        
               | masklinn wrote:
               | > practically this is rarely the case with indexes
               | 
               | Seems like it would be extremely odd to have a 2-2 btree,
               | you'd just get a significantly more complicated BST no?
               | 
               | I'd figure you'd want to fill a cacheline with something
               | typical-ish, so probably at least 4 (this way if you have
               | 4 children and 3 keys, the keys are 8 bytes and the child
               | links are straight pointers your node is 56 bytes and you
               | can add some metadata e.g. a bitmap).
               | 
               | Apparently Rust's BTreeMap is a 6-11 btree but I don't
               | know how they picked the branching factor.
        
       | winrid wrote:
       | Don't forget sorting when doing index design! With the right
       | index the DB doesn't even need to sort - it can just fetch from
       | the index in index order!
        
       | blueberrychpstx wrote:
       | Read here [0]
       | 
       | Why would I do this? [1]
       | 
       | [0]
       | [https://otter.ai/u/SDndzmSLow_a2rNNzm35ohw4y9w](https://otte...
       | 
       | [1]
       | 
       | [https://www.notion.so/enoemos/https-stackoverflow-com-questi...
        
       | leto_ii wrote:
       | Completely besides the important point, but is it the same guy
       | who asked and answered the question? What am I missing?
        
         | yardshop wrote:
         | I've seen it happen a bunch. Since StackOverflow is a Q&A
         | format, if someone knows that many people will ask a question
         | they know the answer to, they will write both. Preemptive
         | questioning perhaps!
         | 
         | He also wrote the other question that he links to, How to index
         | a database column, with the request to get answers for each
         | major type of database. So he's asking not so much to learn the
         | answer but to provide a place for others to provide a catalog
         | of answers.
         | 
         | This is different of course from the case where someone asks a
         | genuine question then comes back and writes an answer to
         | themselves when they have learned it. This happens a lot too.
        
       | westurner wrote:
       | https://en.wikipedia.org/wiki/Database_index
       | 
       | https://en.wikipedia-on-ipfs.org/wiki/Database_index
       | 
       | From "Hosting SQLite Databases on GitHub Pages"
       | https://news.ycombinator.com/item?id=28021766 re: edgesearch,
       | HTTP/3 QUIC UDP, :
       | 
       | > _Serverless full-text search with Cloudflare Workers,
       | WebAssembly, and Roaring
       | Bitmapshttps://github.com/wilsonzlin/edgesearch _
       | 
       | >> _How it works: Edgesearch builds a reverse index by mapping
       | terms to a compressed bit set (using Roaring Bitmaps) of IDs of
       | documents containing the term, and creates a custom worker script
       | and data to upload to Cloudflare Workers_
        
       | manish_gill wrote:
       | Relevant Book: https://www.databass.dev/
        
       | [deleted]
        
       | dhx wrote:
       | Also remember to create tables using efficient structure packing
       | and reordering[1], taking into consideration how the storage
       | serializer operates (see [2] for PostgreSQL). An example provided
       | by Gitlab[3] demonstrates structure packing and reordering
       | principles in PostgreSQL. It may also be beneficial depending on
       | types of workload to limit the number of columns in a table to
       | optimise for locality of reference[4] once the data is
       | deserialized into memory.
       | 
       | [1] http://www.catb.org/esr/structure-packing/
       | 
       | [2] https://www.postgresql.org/docs/14/storage-page-layout.html
       | 
       | [3]
       | https://docs.gitlab.com/ee/development/ordering_table_column...
       | 
       | [4] https://en.wikipedia.org/wiki/Locality_of_reference
        
       | snake_plissken wrote:
       | Can someone expand on this statement (from the top answer, last
       | paragraph):
       | 
       | "Since indices are only used to speed up the searching for a
       | matching field within the records, it stands to reason that
       | indexing fields used only for output would be simply a waste of
       | disk space and processing time when doing an insert or delete
       | operation, and thus should be avoided."
       | 
       | Specifically "indexing fields used only for output", what does
       | this mean? I interpreted it as "indexes used only for `select`
       | statements", but that would see completely counter to the main
       | reason you'd want an index e.g. to speed up record retrieval.
        
         | esnard wrote:
         | "fields used only for output" means "fields that aren't used in
         | any WHERE, ORDER BY, GROUP BY clauses".
         | 
         | They are just SELECTed, i.e. sent in the output (the query
         | results).
        
         | fabian2k wrote:
         | That part is wrong if your database can perform index-only
         | queries. In that case if all columns you query are part of the
         | index the query can be answered from the index alone, so you
         | don't have to read any parts of the actual table.
        
         | geophile wrote:
         | I think that advice is not necessarily correct.
         | 
         | Suppose you have this query:                   select a, b
         | from T         where x = 1
         | 
         | And suppose you have an index on x. You can locate the x=1
         | INDEX records using the index quickly (probably no more than
         | 1-3 disk accesses). But then the qualifying TABLE records have
         | to be retrieved. That index lookup could turn up thousands of
         | qualifying records, and now you have to retrieve each record to
         | get the a, b values.
         | 
         | Now suppose that you replace your index on (x) with an index
         | keyed by (x, a, b). You can still search for x=1 using this
         | index, but now, the (a, b) values that you are SELECTing are
         | present in the index itself. You don't have get the table
         | records to get those values. That can save a lot of random
         | accesses, and there are secondary effects, since the pages
         | containing those records aren't brought into the disk cache.
         | 
         | Yes, this wider index has space and update costs, but the
         | benefit is often worth it.
        
           | uhoh-itsmaciek wrote:
           | Right, Postgres supports this use case explicitly with
           | INCLUDE [1], which can be more efficient than just indexing
           | all the columns you need. Pretty sure SQL Server has
           | something similar.
           | 
           | [1]: https://www.postgresql.org/docs/current/sql-
           | createindex.html
        
           | [deleted]
        
       | didip wrote:
       | You know what I would like to see (for learning)?
       | 
       | A simplified project how indices, for example in b-tree, are
       | stored in segment files.
        
         | xtracto wrote:
         | This. I was hopeful that the linked article would explain how
         | b-trees are practically used for indexes in DB engines. I
         | already know what a db index is (databases 101 software eng)
         | and why are they used, but would love a clear explanation of
         | how are they implemented
        
           | btilly wrote:
           | https://perl.plover.com/BTree/article.html is an introductory
           | article about how a b-tree is implemented. It has example
           | code for it in Perl.
           | 
           | It doesn't actually match low level implementations for a
           | number of reasons. Such as the need to fix the page size the
           | block fits in rather than the count of things in the block,
           | and locking for concurrent access. But the data structure
           | itself still looks the same.
        
       | kyberias wrote:
       | Does anyone know what an index card was?
        
         | Xorakios wrote:
         | In school and public libraries it was a physical piece of thick
         | paper, similar to a Rolodex card, which had multiple copies,
         | always one per author and one per title, but sometimes backups
         | by subject manager. And when I was a kid, there were often
         | multiple, redundant copies that one could take to the librarian
         | to check out the physical book if it were particularly
         | important.
         | 
         | Yep, that old :(
        
           | dhosek wrote:
           | Wait, you had card catalogs where you took the cards out of
           | the drawer? I grew up with card catalogs where there was a
           | metal rod at the bottom of each drawer that ran through holes
           | in the bottom of the cards. There were cards for author,
           | title and subject with multiple subject card possible for
           | each book. The CIP blocks on the copyright pages of books
           | list all the possible subject headings.
           | 
           | And I've been thinking that I don't use the subject indexes
           | on the online catalog anywhere near enough. They can surface
           | things that the standard keyword search might miss or bury
           | among irrelevant results.
        
       | tomhallett wrote:
       | I highly recommend the book "SQL Performance Explained" by Markus
       | Winand: https://www.amazon.com/Performance-Explained-Everything-
       | Deve...
        
       | eatonphil wrote:
       | If you want to see a dumb example of how you can implement
       | indexing on top of a SQL database without it, I wrote a tutorial
       | on implementing basic indexes [0] as part of a series on making a
       | SQL database from scratch.
       | 
       | And of course, Use the Index Luke is a great reference for the
       | real world.
       | 
       | [0] https://notes.eatonphil.com/database-basics-indexes.html
       | 
       | [1] https://use-the-index-luke.com/
        
         | unemphysbro wrote:
         | Phil, These are great! Thank you!
        
         | bogomipz wrote:
         | I loved your series on writing a SQL DB from scratch. I also
         | enjoyed your recent SMTP post as well. I hope you continue to
         | do these, they're always good reading.
        
           | eatonphil wrote:
           | Thank you! There's a very long list of topics to write about
           | so I'm not too worried about continuing. :) Always open to
           | ideas as well.
        
         | frosted-flakes wrote:
         | [1] above is a fantastic resource. I highly recommend Markus
         | Winand's book "SQL Performance Explained" if you're in need of
         | better performing SQL queries. It's quite short and small, but
         | it covers all the bases, and uses introductory language.
        
       | jihadjihad wrote:
       | My team has really enjoyed the concise and informative book [0]
       | from Markus Winand. It comes up sometimes here on HN, and IMHO
       | it's a great resource on the topic for anyone who touches a
       | relational database in their line of work.
       | 
       | His website "Use the Index, Luke!" [1] is also great.
       | 
       | 0: https://sql-performance-explained.com
       | 
       | 1: https://use-the-index-luke.com
        
       | quelltext wrote:
       | So how does a binary search work when the disk blocks are
       | essentially a linked list as the top answer initially says? They
       | just assume that it's possible to jump through a contiguous array
       | of blocks for their later analysis but initially state blocks
       | aren't necessary contiguous.
        
         | jameshart wrote:
         | Their point is that in order to get log(n) access speed you
         | need a sorted random access list, not that if you sort the
         | items in a linked block list you can use binary search on it.
         | They go on to explain how a database index constructs such a
         | thing over the linked block list.
        
         | jlokier wrote:
         | Honest answer: If the blocks aren't* contiguous, there's an
         | index for that!
         | 
         | (* It depends on the database type. In an LSM-tree layer, the
         | blocks are typically contiguous in a file, so direct binary
         | search is possible, though it might not be the most efficient
         | method. The filesystem handles locating blocks in this case.)
         | 
         | The database table is disk blocks containing data where the
         | blocks are logically in order of keys, as if in a contiguous
         | array. Really they are laid out differently on disk.
         | 
         | So there's a block index, which is just a smaller version of
         | the same table data structure: a table mapping keys to values,
         | implemented as blocks in logical order of keys, as if in a
         | contiguous array. Except in this index, the map is from key-
         | ranges of the first table to block locations on disk. This
         | index lets you go from keys to block locations on disk, and it
         | also lets you do a course-grained version of the binary search
         | to narrow down to a single block of the larger table.
         | 
         | Ok, but then how do you look things up in the block index if
         | it's using the same kind of data structure, made of multiple
         | blocks? Same again: Each block index has its own smaller block
         | index.
         | 
         | This neat recursive definition gives you a tower of
         | progressively smaller block indexes until the size is just one
         | block, which doesn't need an index.
         | 
         | Guess what you get when each table of logically sorted,
         | contiguous blocks has a smaller index to say where the blocks
         | are really located?
         | 
         | The data structure is called a B-tree (technically a B+tree),
         | and the smallest index is the root block.
         | 
         | The tree structure arises from the recursive description, where
         | each table has another table (until it stops).
         | 
         | This is a decidedly non-standard way of describing the B-tree
         | data structure. There's no explicit tree. But it's a valid and
         | useful alternative view. (Note, these block indexes are not
         | what is generally meant by database indexes, and they are not
         | visible at the SQL level. They are an implementation detail.)
         | 
         | One of the useful things to emerge from this view is that each
         | level is just logically a flat table, made of blocks that are
         | logically in key order but have arbitrary physical location.
         | It's not necessary for every index to use the same data
         | structure, or be on the same storage. Depending on how you
         | think about algorithms, this description might be simpler to
         | work with.
        
         | dvirsky wrote:
         | That's why disk indexes usually use B-trees.
        
           | anamax wrote:
           | disk block # = f(index) can be implemented with a b-tree but
           | it can also be implemented (in some cases) with a deep-
           | learning based approach.
        
       | tonymet wrote:
       | Take a large unordered text file and then create an ordered index
       | in a separate file of word ==> line
       | 
       | write a brief binary search algo to search the index.
       | 
       | Compare searching the words with a "table scan" on the first file
       | using grep, vs the binary search on the index.
       | 
       | You will find the table scan is O(n) and your binary search is
       | roughly O(log n)
       | 
       | In 60 minutes you'll understand more about indexing than reading
       | stack overflow.
        
         | tonymet wrote:
         | bonus points your interview skills will improve on both coding
         | & system design
        
       | maxmcd wrote:
       | Obligatory recommendation for the CMU Databases Systems lecture
       | series by Andy Pavlo.
       | https://www.youtube.com/watch?v=oeYBdghaIjc&list=PLSE8ODhjZX...
       | 
       | Here's the lecture on tree indexes:
       | https://www.youtube.com/watch?v=JHZFc4hMGhk
        
       | nijave wrote:
       | ...the same way as non-database indexing? e.g. a lookup table for
       | values in a different order than they're stored
        
         | crdrost wrote:
         | There's kind of three things you want to show people to get
         | them to understand indexes. Let's say you sell an internet
         | police service and have a table of URLs you are crawling for a
         | table of clients, each client gets some client_id, each URL
         | references the client_id of the client who wants it policed.
         | Here are the things you want to say:
         | 
         | 1. Binary search. It's be really handy to have the URLs stored
         | in a different order, maybe grouped by client_id first and then
         | all the URLs for the same client are alphabetical. The index
         | conceptually just contains the same rows in a different order.
         | 
         | 2. The read vs write trade-off. Your data fits in memory if
         | there's less than 10 TB of it (possibly up to 100 TB) so the
         | big thing is latency, what an index does is, it requires you to
         | write the same information twice in order to read it
         | exponentially faster, so depending on how often do you write
         | versus how often you read, indexes make less or more sense.
         | 
         | 3. High fan-out trees. Once you have explained that the index
         | is just an the same data in a different order, you have
         | inserted a sort of ticking time bomb of misunderstanding. The
         | problem is, a beginner programmer's notion of a freely
         | indexable list, such that binary search works on it, is an
         | array. How do we insert into an array? With all the work of
         | array copies! So you get people who think that random ID
         | columns should never be indexed, because every write into the
         | middle of the array has to shove half the data one cell to the
         | right--unacceptable! So it really helps to say, "now we can
         | store this as a binary tree, make one comparison per level of
         | the tree." Pause for understanding that the "list," can be
         | stored with such a tree. "but, modern computers have a nice
         | property that right after accessing some memory the memory near
         | that location is loaded into a cache... this makes arrays real
         | fast. Suppose we don't just have the binary tree thing of 2
         | child nodes each representing 50% of our data, but an array
         | with 100 pointers each representing 1% of the data, you get
         | something like a 6x memory speedup from the cache, because your
         | tree is one sixth the depth of the binary tree. And that's
         | basically what a B-tree is, you use a higher fan-out of your
         | sort tree to exploit the cache."
        
         | mikestew wrote:
         | It is, but with all respect, it's not a _useful_ answer (or
         | even accurate; see below). If one takes, for example and
         | contrast, the top-voted answer, well, I could probably
         | implement a database index scheme of some sort based on that
         | explanation. Might not be a good one, but it would work. Parent
         | comment's answer? I wouldn't know where to start.
         | 
         | Additionally, parent's answer doesn't actually answer the
         | question, by saying what an index _is_ , but the question was
         | "how does it work?" The top SO answer does a great job with
         | this one, I think. Another one a few answer down also goes into
         | some of the downsides (it slows down writes, for one). The
         | whole page is worth a skim if you're not a DBA but have to
         | fiddle with DBs on occasion.
        
         | zomnoys wrote:
         | This feels a bit reductionary. Database indexes definitely have
         | more to consider than just simple lookup tables. I'd say that
         | mental model breaks down in practice.
        
       ___________________________________________________________________
       (page generated 2022-03-08 23:01 UTC)