hngopher.com

       [HN Gopher] Cases where full scans are better than indexes
       ___________________________________________________________________
        
       Cases where full scans are better than indexes
        
       Author : xnx
       Score  : 165 points
       Date   : 2023-05-25 14:47 UTC (8 hours ago)
        
 (HTM) web link (www.jefftk.com)
 (TXT) w3m dump (www.jefftk.com)
        
       | electroly wrote:
       | If the data is small, then the index will also be small, so it
       | doesn't really matter either way. The reason to _avoid_ indexes
       | is if you can 't accept the time or space cost. Only the last of
       | OP's examples is a situation where choosing an index causes a
       | problem; the rest are examples where an index _isn 't necessary_
       | but does not cause problems. Several examples are so trivial that
       | you'd be crazy to use an external database at all. I'm not sure
       | I'm really sold here.
        
         | deepsun wrote:
         | Unless the small data changes often -- then DBMS needs to
         | change both data and each index.
        
           | electroly wrote:
           | By definition, the changes will also be small. It's just not
           | possible for it to blow up in a way that would cause a
           | problem, because we're only talking about tiny tables.
           | There's nothing you can do in a single[1] index for a table
           | with 350 rows that will cause any heartache vs. an unindexed
           | table with 350 rows. 4 of the 5 examples in the article are
           | just "this table is too small to care."
           | 
           | [1] Edited per discussion; a large number of indices on a
           | small table can still get you in trouble.
        
             | tomnipotent wrote:
             | > By definition, the changes will also be small
             | 
             | Even if all you're updating is an int32 column from 1 to 2,
             | the RDBMS is still writing much more data than 4 bytes. At
             | a minimum it's making a copy of the full page the data is
             | in prior to the update (so 8k for pg, 16k for mysql) and
             | usually multiples of that (mysql writes both before/after).
        
             | deepsun wrote:
             | > There's nothing you can do in an index for a table with
             | 350 rows
             | 
             | I meant if you have 100s of indices -- it might be slow
             | even on small tables.
             | 
             | I actually had a production issues from using a lot of
             | indices, but it's not apples-to-apples with current
             | discussion, because the table sizes, DBMS and update rates
             | were much larger, fixed by removing indices and splitting
             | table into multiple.
        
               | electroly wrote:
               | Wow! You're right; I wasn't considering that anyone would
               | try to have that many indices. 100 is a lot. I should
               | qualify my statement and say that you won't get in
               | trouble with _one_ index on a small table. I 've edited
               | my post accordingly.
        
               | sumtechguy wrote:
               | In those sort of cases your schema is probably not the
               | correct pattern and needs to be reviewed.
               | 
               | I had one proj where they had 40 indexes on one table.
               | Inserts were terrible for that table. No one wanted to
               | touch it as it happened to be the table where everything
               | was at. They started pulling out 'nosql' sharding and
               | other sorts of tools to fix it. Anything to not change
               | that table with 200+ columns and 30 indexes and the
               | flotilla of stored procs to go with it. They even ported
               | it from MSSQL to Oracle. Thinking somehow magically
               | oracle was going to make the poor schema run faster.
        
         | jefftk wrote:
         | In cases where your data is small enough that you don't need an
         | index, an index is adding technical complexity. In some cases
         | that will be small (you are already using a database and it has
         | built-in support for the kind of index you need) and then,
         | sure, go ahead. But in other cases it will not be so small (you
         | can't get indexes without switching to a more complicated tool,
         | you'd need a somewhat unusual kind of index which requires a
         | plugin, etc) and it's worth thinking about whether your system
         | will be more reliable and low-maintenance long-term if you go
         | without.
        
           | electroly wrote:
           | I think we're in agreement here; this is the part where I
           | said "Several examples are so trivial that you'd be crazy to
           | use an external database at all". For your shell history, you
           | not only don't need indexes, you don't need the database
           | either (as you conclude in the post). But if you're already
           | in the database, it falls into that first "the index feature
           | is right there" category. This HN thread is full of people
           | skipping the index on their SQL tables in situations where I
           | would write the index. Mostly I think people are just missing
           | that you need to write indices in response to queries rather
           | than what you're imagining when you first create the table.
        
         | fnordpiglet wrote:
         | In the case i mention above of log data where data access
         | patterns dominate storage without retrieval computing indices
         | up front is absurdly expensive and challenging at extreme
         | scales. In that case minimal time and perhaps probabilistic
         | indices computed on ingestion and extremely parallel brute
         | force search on search works out both in terms of unbounded
         | scale and cost - by like one to two orders of magnitude.
        
           | electroly wrote:
           | Indeed, we also use an unindexed store for our logs (having
           | migrated to that from a system that _was_ indexed). This is
           | clearly _not_ the  "data is so trivially small that nothing
           | matters" situation that the OP is describing in 4 of the 5 of
           | their examples. I hope you can see this isn't at all
           | incompatible with my post.
        
             | fnordpiglet wrote:
             | I can. I'm just pointing out there are definitely cases
             | where indexing sucks. I agree the original post doesn't
             | touch those cases.
        
       | jeffreyrogers wrote:
       | He didn't show that full scans are better than indexes, just that
       | they aren't obviously worse when the data is small.
        
       | hamilyon2 wrote:
       | 1) Postgres optimizer at some point prefers full scans. It
       | depends on your data, your query and configuration. So on read-
       | heavy load you basically lose nothing by adding all appropriate
       | indexes from the get-go. Overhead having index on small tables is
       | small.
       | 
       | 2) If your load is write-heavy, you should consider trade-offs.
       | Yes, full scans are sometimes your friends. Interesting thing to
       | notice here that write amplification is basically same even if
       | table is small.
        
       | [deleted]
        
       | OliverJones wrote:
       | I take the author's point for single-user write-frequently
       | search-rarely use cases. Plenty of small greenfield projects look
       | like that.
       | 
       | Bigger projects, or projects that are enjoying fast growth, may
       | well need indexing. Why? to keep search performance up under a
       | multi-user workload and/or a mixed write-query workload. Complex
       | multi-table searches often also benefit from judicious indexing.
       | 
       | Here's something to keep in mind: Successful projects often find
       | that an unpredicted query pattern has become very popular and/or
       | a bottleneck. Often the bottleneck is remediated in production by
       | adding appropriate indexes. SQL's designed to add indexing
       | without changes to the data model, for precisely that reason.
       | It's normal to add indexes as projects grow, and no sign of bad
       | design or expensive technical debt. (Unsuccessful projects, or
       | projects that don't need to grow? They don't have the problem.)
        
       | fnordpiglet wrote:
       | CloudWatch Logs Insights, Snowflake, Grafana Loki all do
       | minimally indexed queries with highly parallel brute force scans
       | of tons and tons of smallish s3 objects. A lot of data,
       | especially data like logs, the ratio of "retrieved:ingested" can
       | be in excess of 1:100. That makes it very much worth while to not
       | index up front but to pay the retrieval cost for that rare amount
       | of data that is actually retrieved. The minimal indexing
       | mentioned is typically time and sometimes bloom filteresque term
       | existence indices at the object level.
        
         | londons_explore wrote:
         | > CloudWatch Logs Insights, Snowflake, Grafana Loki
         | 
         | And the common thing in this list of tools? They're all laggy
         | and not a nice experience to use.
         | 
         | Even if queries are rare, there is often an actual human
         | sitting waiting for the results of a query. Whereas while
         | ingesting logs, there is no human waiting. I would prefer to
         | burn more CPU time to save some human time.
        
           | fnordpiglet wrote:
           | I use all three and for large scales they're considerably
           | faster than the alternative; which at some scales
           | (peta/exabyte) is "none."
        
       | taeric wrote:
       | I feel this somewhat generalizes. "Cases where a simple
       | list/array is better than a hashed collection." Which, oddly has
       | the perverse problem of "when you should give up on a single
       | collection of your data, and embrace storing it in different
       | ways."
       | 
       | That is, few folks seem to internalize or consider that every
       | index is essentially a copy of the data. A complete copy, if you
       | are doing a fully projected index. So, when is it better to just
       | do a full scan over something else? Well, its probably never
       | better in isolation of every other concern. But, when is it
       | better to just do a scan of the data over duplicating it in a
       | place that would let you hash in? Depends on how much extra work
       | you are doing because of it?
       | 
       | Having typed that, really I think the trick is to ask yourself
       | roughly (not exactly, amusingly) what it means to add an index.
       | Or to use a hashtable. Or to use a list and a hashtable. And
       | realize that the question of using an index is the same as the
       | question of keeping another copy of your data somewhere.
        
       | [deleted]
        
       | vivegi wrote:
       | Once upon a time (25+ years ago) I used to maintain an Oracle
       | database that had around 50M records in its main table and 2M
       | records in a related table. There were a dozen or so dimension
       | (lookup) tables.
       | 
       | Each day, we would receive on the order of 100K records that had
       | to be normalized and loaded into database. I am simplifying this
       | a lot, but that was the gist of the process.
       | 
       | Our first design was a PL/SQL program that looped through the
       | 100K records that would be loaded into a staging table from a
       | flat file, performing the normalization and inserts into the
       | destination tables. Given the size of our large tables, indices
       | and the overhead of the inserts, we never could finish the batch
       | process in the night that was reserved for the records intake.
       | 
       | We finally rewrote the entire thing using SQL, parallel table
       | scans, staging tables and disabling and rebuilding the indexes
       | and got the entire end-to-end intake process to finish in 45
       | minutes.
       | 
       | Set theory is your friend and SQL is great for that.
        
         | zvmaz wrote:
         | > Set theory is your friend and SQL is great for that.
         | 
         | Could you tell us how set theory was useful in your case?
        
           | vivegi wrote:
           | Simple example.
           | 
           | INTAKE_TABLE contains 100k records and each record has up to
           | 8 names and addresses. The following SQL statements would
           | perform the name de-duplication, new id assignment and
           | normalization.                 -- initialize       truncate
           | table STAGING_NAMES_TABLE;            -- get unique names
           | insert /*+ APPEND PARALLEL(STAGING_NAMES_TABLE, 4) */ into
           | STAGING_NAMES_TABLE (name)       select /*+
           | PARALLEL(INTAKE_TABLE, 4) */ name1 from INTAKE_TABLE
           | union       select /*+ PARALLEL(INTAKE_TABLE, 4) */ name2
           | from INTAKE_TABLE        union       ...        union
           | select /*+ PARALLEL(INTAKE_TABLE, 4) */ name8 from
           | INTAKE_TABLE;            commit;            -- assign new
           | name ids using the name_seq sequence number       insert /*+
           | APPEND */ into NAMES_TABLE (name, name_id)       select name,
           | name_seq.NEXT_VAL from         (select name from
           | STAGING_NAMES_TABLE          minus         select name from
           | NAMES_TABLE);            commit;            -- normalize the
           | names       insert /*+ APPEND PARALLEL(NORMALIZED_TABLE, 4)
           | */ into NORMALIZED_TABLE (        rec_id,        name_id1,
           | name_id2,        ...        name_id8 )       select /*+
           | PARALLEL(SNT, 4) */        int.rec_id rec_id,
           | nt1.name_id name_id1,         nt2.name_id name_id2,
           | ...,         nt8.name_id name_id8        from
           | INTAKE_TABLE int,        NAMES_TABLE nt1,        NAMES_TABLE
           | nt2,        ...        NAMES_TABLE nt8       where
           | int.name1 = nt1.name and        int.name2 = nt2.name and
           | ...        int.name8 = nt8.name;            commit;
           | 
           | The database can do sorting and de-duplication (i.e., the
           | query UNION operation) much much faster than any application
           | code. Even though the INTAKE_TABLE (100k records) are TABLE-
           | SCANNED 8 times, the union query runs quite fast.
           | 
           | The new id generation does a set exclusion (i.e., the MINUS
           | operation) and then generates new sequence numbers for each
           | new unique name and adds the new records to the table.
           | 
           | The normalization (i.e., the name lookup) step joins the
           | NAME_TABLE that now contains the new names and performs the
           | name to id conversion with the join query.
           | 
           | Realizing that the UNION, MINUS and even nasty 8-way joins
           | can be done by the database engine way faster than
           | application code was eye-opening. I never feared table scans
           | after that. Discovering that the reads (i.e., SELECTs) and
           | writes (i.e., INSERTs) can be done in parallel with
           | optimization hints such as the APPEND hint was a superpower.
           | 
           | Using patterns such a TRUNCATE TABLE (for staging tables) at
           | the top of the SQL script made the scripts idempotent. i.e.,
           | we could trivially run the script again. The subsequent runs
           | will not generate the new sequence numbers for the names.
           | With some careful organization of the statements, this became
           | rigorous.
           | 
           | Although I haven't shown here, we used to do the entire
           | normalization process in staging tables and finally do a copy
           | over to the main table using a final INSERT / SELECT
           | statement.
           | 
           | My Oracle SQL-fu is a bit rusty. Apologies for any syntax
           | errors.
        
           | wswope wrote:
           | The implication is that the PL/SQL job was operating rowwise.
           | 
           | Batch processing, backed by set theory in this case, is far
           | more efficient than rowwise operations because of the
           | potential for greater parallelism (SIMD) and fewer cache
           | misses.
           | 
           | E.g., if inserting into a table with a foreign key pointing
           | to the user table, batch processing means you can load the
           | user id index once, and validate referential integrity in one
           | fell swoop. If you do that same operation rowwise, the user
           | id index is liable to get discarded and pulled from disk
           | multiple times, depending on how busy the DB is.
        
             | lazide wrote:
             | Also, each insert will have to acquire locks, verify
             | referential integrity, and write to any indexes
             | individually. This is time consuming.
        
         | gopher_space wrote:
         | > loaded into a staging table from a flat file, performing the
         | normalization and inserts into the destination tables.
         | 
         | This is how I've always done things, but I'm starting to think
         | decoupling normalization from the entire process might be a
         | good idea. This would have simplified design on a few projects
         | (especially the ones that grew "organically") and seems like it
         | would scale nicely.
         | 
         | I'm starting to look at normalization as a completely separate
         | concern and this has led to some fun ideas. E.g. if a system
         | operated on flat files that were already normalized it could
         | save time writing tools; you're at the command line and
         | everyone involved can get their grubby mitts on the data.
        
       | cirrus3 wrote:
       | It is never really mentioned why adding an index upfront is such
       | a burden?
       | 
       | I don't really understand how the downside of adding an index
       | almost ever outweighs the potential downsides of not adding an
       | index. I'm sure there are some cases, and that was what I was
       | expecting to see in this article, but it just seems to boil down
       | to being so lazy that he wants to avoid typing a few more
       | characters.
        
       | paxys wrote:
       | I don't get it...they are just bragging about writing terrible
       | code? There isn't a single reason given in the article about why
       | a full scan is better than an index other than "I don't need
       | these fancy index things, I'm perfectly happy with my 200ms query
       | time".
       | 
       | Well add the index anyway and add an extra 195ms sleep() call if
       | you really want to. It's still better for your hardware and
       | electricity bill.
        
         | mrits wrote:
         | While I'm not sure why you moved their recently I can see why
         | you wouldn't have noticed how nice it used to be. As I said, it
         | already started the decline a decade ago.
        
       | grumple wrote:
       | The HN title says something different than the blog post title /
       | intent. The HN title is incorrect and misleading imo.
       | 
       | The blog post can be summarized as "you don't need indexes on
       | tiny tables".
        
       | mastax wrote:
       | > I recently came across someone maintaining a 0.5GB full text
       | index to support searching their shell history, 100k commands. I
       | use grep on a flat file, and testing now it takes 200ms for a
       | query across my 180k history entries.
       | 
       | I just started using a tool that stores my bash history and
       | replaces C-R that I found on HN, it's written in Rust and uses
       | SQLite of course. But it'll randomly cause commands to pause for
       | a few seconds before executing (I think the hook to write the
       | command to history gets blocked). Probably some simple bug, but I
       | could just have my command history in a flat file and use rg or
       | fzf in a 5 line shell function and have no lag.
        
         | mastax wrote:
         | Okay getting real off-topic now (welcome to HN) but this is not
         | what I expected:
         | 
         | https://github.com/ellie/atuin/issues/952#issuecomment-15378...
         | https://github.com/openzfs/zfs/issues/14290
         | 
         | Apparently when SQLite calls `ftruncate` on the `-shm` file,
         | ZFS will sometimes block for several seconds. Maybe it's
         | waiting for txg timeout? Strange.
        
         | scarmig wrote:
         | Can you share the tool? I've been thinking of making something
         | similar for fun backed by a suffix array, interested in what
         | else is out there.
        
           | wswope wrote:
           | https://github.com/ellie/atuin
        
       | chubot wrote:
       | Anyone remember gsearch by Jeff Dean circa 2006 at Google? It was
       | a full scan regex search, using 40-80 machines in parallel, not
       | an index
       | 
       | Also sourcerer was a full scan of VCS metadata, on a single
       | machine. i.e. if you listed all the commits by a certain person,
       | it would just do a for loop in C++ going through about 1-2 M
       | commits. They were just C++ structs in memory, and it was
       | probably submillisecond times, even with hardware at the time
       | 
       | I remember reading those 2 programs and being impressed that they
       | were both very fast, and full scans! (gsearch got slow later as
       | the codebase grew, but the first version was a <1 second
       | distributed grep of the whole Google codebase.)
        
       | SoftTalker wrote:
       | Indexing doesn't add much for very small tables. It also doesn't
       | really help much if the cardinality of the data is low.
       | 
       | I usually index any columns that are used in joins, or frequently
       | used in predicates.
        
       | nubinetwork wrote:
       | Indexes got me from 15 seconds down to under a second. I'll
       | always use them even if they wind up being as much space as the
       | actual data. MySQL is just weird that way.
       | 
       | Edit: believe me, I tried it all... refining my queries, multiple
       | tables, bigger hardware, faster disks, changing all the
       | tuneables... nothing was as good as adding a whack-tonne of
       | indexes.
        
       | ydnaclementine wrote:
       | What is the rule that you use on which fields to index? I've been
       | following the general when adding fields: if the field is
       | mentioned in a sql WHERE, ORDER BY, or GROUP BY
       | 
       | EDIT: JOINs
        
         | jdmichal wrote:
         | The answer will heavily depend on what you're doing.
         | 
         | CRUD patterns where you're pulling out a small group of rows
         | from several tables, you probably want indexes on `JOIN`
         | clauses so that you can quickly find those links using nested
         | loop index scans. And maybe a few indexes on common ways to
         | find records.
         | 
         | Analytics patterns where you're aggregating a large number of
         | rows, indexing `JOIN`s won't help you, because you really want
         | hash or merge joins anyway. Instead, tune your indexes for
         | `WHERE` clauses, because you want a filtered index scans
         | instead of table scans on the reads.
         | 
         | You also want to learn about cardinality and selectivity.
         | Ideally your indexes are organized by selectivity.
        
         | AnotherGoodName wrote:
         | Don't overthink that part. Just rely on EXPLAIN. It will tell
         | you if the query had an index to use or if it scanned. You'd be
         | surprised when a DB does/doesn't use an index if you try to
         | figure it out yourself.
         | 
         | A run book for any org when writing queries should be
         | 
         | 1) Write the query
         | 
         | 2) Before shipping the query run the query through EXPLAIN. Did
         | it scan or did it use an index?
         | 
         | 3) If it scanned can you re-write it to use an existing index
         | (the usual answer honestly).
         | 
         | 4) Create the index.
        
         | ksherlock wrote:
         | You should also consider how unique that field is. If half your
         | widgets are blue and half your widgets are red, indexing the
         | color is probably not helpful (ie, the query planner will
         | ignore the widget_color index and do a full table scan). A rule
         | of thumb number I vaguely recall is 10%
        
           | wiredfool wrote:
           | My recollection in postgres anyway is that low cardinality
           | indexes aren't useful, because it doesn't take into account
           | which side of the 1/99% you're on when determining to use the
           | index.
           | 
           | What is useful is to do a partial index on values where
           | foo=small % ,value because then it will use that index when
           | it matches the query, but not when it's in the majority case.
        
             | anarazel wrote:
             | > My recollection in postgres anyway is that low
             | cardinality indexes aren't useful, because it doesn't take
             | into account which side of the 1/99% you're on when
             | determining to use the index.
             | 
             | It does take that into account. Demo:                   =#
             | CREATE TABLE low_cardinality AS SELECT generate_series(-10,
             | 1000000) < 0 AS col;         =# CREATE INDEX ON
             | low_cardinality (col);         =# ANALYZE low_cardinality;
             | =# EXPLAIN SELECT * FROM low_cardinality WHERE col = false;
             | +----------------------------------------------------------
             | ---------------+         |
             | QUERY PLAN                                |         +------
             | -----------------------------------------------------------
             | --------+         | Seq Scan on low_cardinality
             | (cost=0.00..14425.11 rows=1000011 width=1) |         |
             | Filter: (NOT col)
             | |         +------------------------------------------------
             | -------------------------+         (2 rows)              =#
             | EXPLAIN SELECT * FROM low_cardinality WHERE col = true;
             | +----------------------------------------------------------
             | ------------------------------------------+         |
             | QUERY PLAN                                             |
             | +----------------------------------------------------------
             | ------------------------------------------+         | Index
             | Only Scan using low_cardinality_col_idx on low_cardinality
             | (cost=0.42..4.44 rows=1 width=1) |         |   Index Cond:
             | (col = true)
             | |         +------------------------------------------------
             | ----------------------------------------------------+
             | (2 rows)
        
               | anarazel wrote:
               | Forgot to say: Obviously a partial index is going to be
               | considerably better, performance and size wise.
        
       | raldi wrote:
       | When I wrote Reddit Gold (now called Reddit Premium) I
       | intentionally left the paying-user table index-free, looking
       | forward to the day it would become necessary.
       | 
       | It hadn't happened by the time I left the company, even though
       | sales were exceeding expectations.
        
         | hardware2win wrote:
         | What was the reasoning?
        
           | raldi wrote:
           | I had read a Joel Spolsky post about how they used to get an
           | email every time they made a sale ("Ding!") and the day they
           | finally turned it off was full of pride. (I did that too,
           | originally not even filtered out of my inbox.)
        
         | bcrosby95 wrote:
         | So no primary key or uniqueness constraints?
        
           | Macha wrote:
           | Unlikely, given Reddit's past schema design. One table of
           | "things", and then another table of attributes of those
           | things in a entity,key,value format.
           | 
           | https://kevin.burke.dev/kevin/reddits-database-has-two-
           | table...
        
           | ComodoHacker wrote:
           | Modern DB engines can enforce Unique or PK constraints
           | without indexes. Yes, they perform scans.
        
             | tomnipotent wrote:
             | Every major RDBMS requires an index for both constraints.
        
               | hinkley wrote:
               | FK constraints however are a pretty common gotcha. You
               | have a table that allows deletes, and every table that
               | has a column pointing to that table gets scanned on each
               | delete operation.
               | 
               | So you have to add an index on that column, or start
               | talking about tombstoning data instead of deleting it,
               | but in which case you may still need the FK index to
               | search out references to an old row to replace with a new
               | one.
        
               | tomnipotent wrote:
               | An FK also adds the burden of keeping a copy of each
               | related row during the lifespan of a transaction. This
               | means small-but-frequently-updated tables that are joined
               | to a lot can be a source of unexpected headaches.
        
         | BeefWellington wrote:
         | It's not really index-free if a separate table has constraints
         | of any kind.
        
         | cactusplant7374 wrote:
         | Indexes are such an easy thing to add. I don't get it. Seems
         | like under optimization when you consider the tradeoffs.
        
           | VWWHFSfQ wrote:
           | > Indexes are such an easy thing to add.
           | 
           | Indexes can have catastrophic consequences depending on the
           | access patterns of your table. I use indexes very sparingly
           | and only when I know for sure the benefit will outweigh the
           | cost.
        
           | jefftk wrote:
           | Consider the example from the post of searching your shell
           | history. If I don't need indexes I can just have it all in
           | one flat file and use grep. Switching to a tool that supports
           | indexing adds a lot of potential complexity and ways things
           | could go wrong.
           | 
           | Or consider the example of a developer querying frontend
           | logs: queries are many orders of magnitude less common than
           | writes, and (at the scale I used to work at) an index would
           | be incredibly expensive.
        
             | wswope wrote:
             | I take the point of your examples as written in the post,
             | but I think both of those are a bad comparison to the
             | Reddit Premium subscriber table being discussed, because:
             | 
             | - We're already using a database, so there's very minimal
             | added complexity
             | 
             | - This is an extremely hot table getting read millions of
             | times a day
             | 
             | - The scale of the data isn't well-bounded
             | 
             | - Usage patterns of the table are liable to change over
             | time
        
               | raldi wrote:
               | It wasn't an extremely hot table, and the scale of the
               | data was well-bounded insofar as the only way for it to
               | become non-negligible would be for us to have had revenue
               | beyond our then-wildest dreams.
        
           | SoftTalker wrote:
           | They add overhead to updates and inserts. If you don't need
           | them, why take that hit. Know your data.
        
             | EGreg wrote:
             | Wait a second, the whole thing is you don't mind O(N)
             | overhead in searching on every request, but you mind O(log
             | N) overhead for updates and inserts?
        
               | titaniczero wrote:
               | Well, my guess is that updates and inserts are much more
               | frequent than searches in their use case. You're assuming
               | a balanced frequency for these operations and it hardly
               | ever happens.
        
               | wswope wrote:
               | This is a read-heavy workload per the OP:
               | https://news.ycombinator.com/item?id=36071799
        
             | toast0 wrote:
             | If the table is small enough that indexes aren't critical
             | for reads, they probably don't impact the rate of writes
             | either.
             | 
             | Of course, if the object is large scale bulk inserts, with
             | only very occasional selects, then yes.
        
             | tomnipotent wrote:
             | Considering how much data is written to support
             | redo/undo/wal logs, a single index for user id is very
             | little overhead.
        
             | holoduke wrote:
             | My experience is that it's uaually better to add indices
             | where it's expected to be needed beforehand. Adding indices
             | on large production tables will millions of rows can bring
             | down databases for hours or even days worst case. It's
             | tricky to manage.
        
           | raldi wrote:
           | If you have nothing to optimize yet, how will you know if
           | you're optimizing it right?
        
             | wswope wrote:
             | Typically, you create a large number of test records and
             | see how your expected access patterns perform.
        
         | eterm wrote:
         | That goes against my intuition, because the performance would
         | be more impacted (beneficial to have an index) where you have
         | very few bits set on a boolean field.
         | 
         | If you have 10m users and 100 have "IsPremium = 1" then an
         | index massively speeds up anything with WHERE IsPremium = 1,
         | compared to if the data is fairly evenly spread between
         | IsPremium = 1 and IsPremium = 0 where an index won't be much
         | benefit.
         | 
         | So low sales would increase the need for an index.
         | 
         | That said, I'm assuming searching for "Users WHERE IsPremium =
         | 1" is a fairly rare thing, so an index might not make much
         | difference either way.
        
           | abtinf wrote:
           | > paying-user _table_ index-free
           | 
           | Presumably, that means they created a new table intended to
           | be straight-joined back to the user table. No need to search
           | a column.
        
             | Vvector wrote:
             | Joining a table on a column without an index will require a
             | linear scan of the column. You almost always want an index
             | on JOIN columns
        
               | abtinf wrote:
               | Two relations with a one-to-one relationship implies
               | identical primary keys. Most implementations default to
               | creating an index on the primary key.
               | 
               | In this case, the premium user table only needs the user
               | id primary key/surrogate key because it only contains
               | premium users. A query starting with this table is
               | naturally constrained to only premium user rows. You can
               | think of this sort of like a partial index.
               | 
               | One consequence of this approach is that queries
               | filtering only for _non_ -premium users will be more
               | expensive.
        
               | lazide wrote:
               | The primary key (userid?) will almost always be indexed
               | implicitly. You'd usually have to go out of your way to
               | avoid it.
               | 
               | So it was probably being joined by an indexed column, but
               | without an explicitly defined index.
        
           | raldi wrote:
           | We had lots of indexes on the user table; there was a
           | separate table with details of paying users.
        
           | holoduke wrote:
           | Having a separate table containing all the premium users is
           | different than an extra column in the normal user table. In
           | the extra table example you don't really need an index (in
           | premium user table) if you have only 100 premium users
        
           | pc86 wrote:
           | You wouldn't need any of these WHERE clauses if you have all
           | the premium users in one table. Even managing the users in
           | both tables in pretty trivial when you just add someone as
           | they pay and remove them as the premium account expires.
        
           | [deleted]
        
         | dmak wrote:
         | Wow really? This seems really surprising to me.
        
           | twoodfin wrote:
           | Speculating, but presumably this table only needed
           | consultation during creation of a new user session. That's
           | probably a pretty heavyweight operation to begin with, so
           | adding a scan of a few KB (user IDs being modestly sized
           | integers) for every thousand currently paying users is NBD.
        
             | raldi wrote:
             | Nope; it was used every time we needed to look up whether
             | someone was a subscriber, which was cached for some things
             | but not all of them.
        
       | mrguyorama wrote:
       | It is infinitely easier to drop an index on a table that is
       | causing poor performance than it is to add a good index to a
       | table that suddenly needs one.
       | 
       | Index everything, drop those that make things worse.
        
       | winrid wrote:
       | Don't add indexes. Add them later when things are slow and
       | everything is on fire!
       | 
       | Source: guy that gets hired to fix database stuff on fire.
        
         | [deleted]
        
       | [deleted]
        
       | syngrog66 wrote:
       | rule of thumb: O(n) latency is almost always fine when (n==1)
        
       | jameshart wrote:
       | "We'll add an index when it gets slow" is saying "screw the guy
       | who's on call the night this system starts timing out".
       | 
       | Invisible cliffs in your code that it will fall off at some
       | unknown point in the future are the antithesis of engineering.
       | 
       | If you deliberately aren't implementing indexing, know at what
       | point indexing would begin to matter. Put in place guard rails so
       | the system doesn't just blow up unexpectedly.
       | 
       | There are 100% cases where _you shouldn't use indexes_. That's
       | not the same thing as _being able to get away without an index_.
        
         | lmm wrote:
         | Indices are far more likely to cause a cliff than to remove
         | one, IME. Full table scans are linear and grow linearly.
         | Indices are crazy fast until one day they're suddenly not.
        
         | themerone wrote:
         | Usually things will slow down gradually or fail right away
         | under production loads.
        
           | magicalhippo wrote:
           | Unless production loads are non-uniform in time, think Black
           | Friday or similar.
        
             | jameshart wrote:
             | Yup. Quarter end reporting has been going swimmingly for
             | years but this time the orders query is hitting a
             | connection timeout and the numbers are due tomorrow.
             | 
             | That threshold doesn't gradually creep up on you, it hits
             | you all at once at the worst possible time.
        
           | pixl97 wrote:
           | W.U.C.T
           | 
           | Works until critical threshold.
           | 
           | In enterprise software I noticed software tends to work in a
           | WUCT'ed up manner. Things may slow down over time, but no one
           | complains about it because "the software is just slow", then
           | suddenly one day you hit the timeout component of some higher
           | layer and then the software is completely and utterly broke.
        
             | marcosdumay wrote:
             | Well, this isn't a valid reason. It's not the software that
             | is causing the WUCT.
             | 
             | If you let WUCT run free in any complex process, you are
             | just asking for everything to break all the time and people
             | to do nothing but fire-fighting. So, you are much better
             | dealing with the WUCT causes instead of insisting that
             | software or any other component scales forever? (Because
             | they never will, and you will just make something else
             | bream.)
             | 
             | WUCT is one of the main reasons why we monitor system
             | operations.
        
             | bawolff wrote:
             | Everything has a critical threshold. Even your perfect db
             | schema will still eventually run out of disk. There are no
             | magic solutions that work on all scale levels.
        
           | david422 wrote:
           | But when you have a 100 places all adding little slowdowns it
           | can be difficult to realize.
        
         | jefftk wrote:
         | _> screw the guy who's on call the night this system starts
         | timing out_
         | 
         | This was a very small billing practice, and that person was
         | going to be me. I thought then, and still think now, that I
         | made a reasonable trade off between what would be work at the
         | time and potential future urgent work.
         | 
         | Additionally, this wasn't the sort of thing that would fail
         | suddenly when you hit a critical point. Instead, running full
         | text searches (a very small fraction of what the database was
         | handling) would just slow down in a way that would have been
         | very clear to the users.
        
           | jameshart wrote:
           | Stealing time from future you also doesn't pay off. Future
           | you wants to take vacations and change jobs and have family
           | events and sleep and stuff.
           | 
           | It doesn't take much work to do the back of envelope math:
           | 
           | How long should these queries take? Less than two seconds?
           | 
           | How much slower do they get as record counts increase? 100ms
           | every 250,000 records? Okay, so this will become intolerably
           | slow when the record count hits about 5 million. When's that
           | going to happen? Never? Then we're done here. Within a few
           | months? Well let's plan for that now. Within a few years?
           | Maybe not within the lifetime of this system? Judgement call.
           | Let's put something in place that forces us to at least look
           | back at this before it gets bad. Even if that's just a google
           | calendar reminder.
        
             | jefftk wrote:
             | When building systems we are always making trade-offs
             | between the present and the future. It's only "stealing
             | time from future you" when you make bad trade-off;
             | otherwise it's prioritization.
             | 
             | In this case, leaving out an index for full text search
             | meant I didn't need to maintain the software for that index
             | either, which would have been stand-alone at the time I was
             | building this. Possibly this even was a choice that was, in
             | expectation, time-saving for future me.
        
         | plugin-baby wrote:
         | > Invisible cliffs in your code that it will fall off at some
         | unknown point in the future are the antithesis of engineering.
         | 
         | On the other hand, they are the essence of _software_
         | engineering.
        
         | [deleted]
        
         | hinkley wrote:
         | As the number of cliffs increases, so does the likelihood that
         | you will hit a second cliff in the middle of fixing the first
         | one. At some point you hit a Kessler Syndrome situation where
         | people start quitting and the people who should have fixed the
         | problem in the first place aren't around anymore.
         | 
         | Nobody wants to stick around for a lecture they can see coming.
        
         | bob1029 wrote:
         | > Invisible cliffs in your code that it will fall off at some
         | unknown point in the future are the antithesis of engineering.
         | 
         | I think this describes the general shape of problem seen in any
         | engineering domain. After a certain point of time (i.e.
         | unknown), we can no longer guarantee certain properties of the
         | system. This is why we add all kinds of qualifiers and
         | constraints in our discussions with the customer.
         | 
         | Certainly, you can make some predictions about how tables will
         | grow over time, but there are also some potential headaches
         | that can be incurred if you arbitrarily add indexing. "We
         | _might_ need to expire records to manage growth " so you go
         | ahead and index that CreatedUtc column. Turns out, the usage
         | patterns of your app changed such that that table is now
         | experiencing 20x more insert volume than you could have ever
         | anticipated and your prediction is now a gigantic liability.
         | With 20x insert volume you would have _never_ indexed that
         | column. You would have used some alternative path involving
         | temporary tables, etc. Now, you are stuck with essentially same
         | problem - a judgement call at 2am regarding whether or not you
         | should drop an index while the rest of your team is in bed.
         | 
         | Since I am unable to predict the future I feel like waiting
         | until the _actual_ problem arises and then dealing with it at
         | that moment is the best option. Strategically, _not_ having an
         | index will likely fail more gracefully than if an index is
         | bottlenecking requests during peak load (i.e. instantaneous vs
         | cumulative data volume).
        
           | kevan wrote:
           | The key to sleeping at night is to add metrics and alarms
           | near those cliffs or 'edges of the map' so they aren't
           | invisible anymore. It's hard to anticipate every angle to
           | this but in any case where you're making this kind of
           | assumption or tradeoff it's a good idea.
           | 
           | A simple example is setting an alarm on your max load test
           | traffic. When you get that alarm you know your system is now
           | operating in unknown territory and it's probably time to do
           | another load test or at least take a close look at scaling.
        
           | jorblumesea wrote:
           | Largely agree with this take, it often becomes "add an index
           | because query _might_ be slow " without much discussion or
           | trade offs around query patterns. There's a lot of "what if"
           | over engineering that happens where you end up in
           | hypothetical scenarios.
           | 
           | Look at your real world use cases, query patterns and think
           | hard about it.
        
           | jameshart wrote:
           | Not advocating for building indexes for queries you might
           | later want to run.
           | 
           | But for the queries you are building today? Don't build them
           | such that they will gradually slow down over time.
        
           | [deleted]
        
       | captainmuon wrote:
       | The inverse is also true, often you have linear or even quadratic
       | lookup times when something like an index can give you the result
       | "immediately".
       | 
       | I wish languages other than SQL had a concept of adding an index.
       | Imagine a variant of Python's list or C++'s vector where you can
       | add indicies on arbitrary fields of the contents, like a more
       | flexible dict. Something like (pseudocode)
       | books = List<Book, indices=[Book.title, Book.author]>
       | 
       | and then you could up by title or author without traversing the
       | whole list. Sure you can just use a separate dict, but then you
       | have to keep both in sync, it is harder to bind it to UI
       | controls, and so on.
        
         | magicalhippo wrote:
         | For C++ there's Boost's multi_index[1].
         | 
         | [1]:
         | https://www.boost.org/doc/libs/1_82_0/libs/multi_index/doc/i...
        
           | Mailtemi wrote:
           | I used multi_index a lot in the past. However, since I also
           | frequently use sqlite, I have decided to exclusively use
           | SQLite for multi_index in memory. If there is a problem,
           | temporarily using sqlite as a file makes it a lot easier to
           | track the issue. When multi_index patterns become too
           | complex, it's natural to wrap them in a unit test
           | specifically designed for sqlite (multi_index).
        
         | jmcomets wrote:
         | In my experience these type of abstractions end up bloated with
         | too many concerns which adds maintenance overhead. It's a lot
         | simpler to simply add a `HashMap<Book.title, Book>` and cover
         | with tests, than use an abstraction.
         | 
         | I want languages to do less magic, not more, since I read code
         | more than I write it.
        
       | vertica3525 wrote:
       | [flagged]
        
       | PaulHoule wrote:
       | One of the problems of the RDF and SPARQL world is that the
       | standard index structure considers the 6 possible permutations of
       | ?subject ?predicate ?object.
       | 
       | You can answer many queries with good efficiency with those
       | indexes but building them gets brutal when you are handling
       | billions of triples.
       | 
       | I went to a talk at Dreamforce years ago where the CTO explained
       | their database architecture and it is interesting that the core
       | table in Saleforce is literally a collection of triples. They
       | have a smart optimizer that builds views, materialized views,
       | indexes, etc. in Oracle based on the queries you run.
       | 
       | I built a bot that would download a lot of records from an
       | organization, build a "machine learning" model (built by a
       | geospatial analyst and we didn't call it that these days) that
       | automatically assigns opportunities to salespeople, and then
       | pushes the assignments into the system. The first time I ran the
       | big query in the test system it timed out, then I went to the
       | bathroom and when I came back the second time it worked
       | perfectly. When it went to production exactly the same thing
       | happened.
        
         | cdcarter wrote:
         | > it is interesting that the core table in Saleforce is
         | literally a collection of triples
         | 
         | This is... not entirely true. Most of the data for each
         | "Object" in salesforce are stored in a series of columns on a
         | single table. These are called "flex columns" in the official
         | whitepaper describing the architecture. [0]
         | 
         | However, _foreign key relationships_ are indeed stored in a
         | table that acts like a collection of triples. This is used in
         | place of native foreign keys, and is indeed deeply integrated
         | with the optimizer.
         | 
         | [0]:
         | https://www.developerforce.com/media/ForcedotcomBookLibrary/...
        
       | CathalMullan wrote:
       | Lately, I've been investigating 'serverless' databases
       | (PlanetScale, Turso) that charge based on the number of rows
       | read/scanned. As a result, instead of just relying on the
       | perceived performance of a query, I've started monitoring the
       | number of rows scanned, which has emphasized the importance of
       | indexes to me. Granted, the cost per rows read is minuscule
       | (PlanetScale charges $1 per billion!), but it's still something I
       | keep in mind.
        
       | magicalhippo wrote:
       | On the flip side, a missing index can bring down production.
       | 
       | We experienced that just a couple of weeks ago, where a missing
       | index and an unexpected 1000x increase in volume at a customer
       | brought the DB to its knees. Sure it was still serving queries
       | but at such a low rate it was effectively useless.
       | 
       | For queries that will be run more than once, I try make sure
       | there's an index it can use for something fairly unique.
        
         | remus wrote:
         | > ...an unexpected 1000x increase in volume at a customer
         | brought the DB to its knees.
         | 
         | From an outside perspective it seems like the huge increase in
         | volume was more the issue! It sounds like an index helped a
         | lot, but it would also have added cost for all those customers
         | who didn't see that jump in volume.
        
           | magicalhippo wrote:
           | Well, of course the volume had something to do with it, but
           | adding the missing index meant the system could easily handle
           | that volume.
           | 
           | The other customers pay for that index as well of course but
           | either the volume is low enough that it's trivial or it's
           | large enough that they too saw ann increase in speed.
        
         | fnordpiglet wrote:
         | Was this because of a missing index or because the optimizer
         | vomited on the queries without an index to provide it
         | statistics? My gripe with RDBs is generally the query optimizer
         | is non deterministic with respect to the query, I.e., it can
         | make a great decision with certain statistics and flip to a
         | terrible one with slightly different statistics even though the
         | original query would have performed basically the same.
         | 
         | I'd rather have a database that query performance degrade
         | gracefully and predictably with scale than some sort of black
         | magic mumbo jumbo wake me up in the middle of the night because
         | it lost its statistical little puny pea brain simply because it
         | couldn't compute a bounded expectation any more.
        
           | magicalhippo wrote:
           | In this case it was a missing index, on a query run on every
           | order line when saving aka a lot.
           | 
           | It had gone under the radar because the volume with our other
           | customers had been low enough that the table scan of the
           | queried table wasn't noticed.
           | 
           | As mentioned a customer suddenly got 1000x the volume and it
           | quickly became an issue.
           | 
           | But yea, we have a job running in the weekends to recalculate
           | statistics on key tables, as we've had issues with that
           | grinding production to a halt before.
           | 
           | And recently I sped up a query by 100x by removing a filter
           | from the where clause that for some weird reason caused the
           | DB to run a really poor plan. It was just a simple check
           | intended to filter out some duplicates in a few edge cases,
           | but couldn't find a way to make the DB run it as a post-
           | predicate. Moved it to my code for the 100x performance
           | win...
        
           | londons_explore wrote:
           | I really want the optimizer to make an estimate of the CPU/IO
           | to complete a query. Then, during the query, if that much
           | effort has been expended and we haven't yet completed the
           | query, then update the estimate. If the updated estimate now
           | shows that the query plan is no longer quickest, then abort
           | the query and restart with a different plan.
           | 
           | Years ago I forked postgres and tried to implement the above.
           | Initial results were promising - there were a good chunk of
           | queries that ended up taking a different plan, and sometimes
           | returning 100x quicker.
           | 
           | Alas, the postgres codebase is heindously complex, and
           | implementing the above to be production grade would be many
           | months work - and, due to the way postgres streams results to
           | the client, might have actually required a change to the wire
           | format.
        
         | mumblemumble wrote:
         | I used to be one of the "keep the database running" people at a
         | high frequency trading firm. So, super high pressure; any
         | outage meant we were potentially hemorrhaging money at a brain-
         | melting pace.
         | 
         | My lessons from that experience were twofold. First, you can't
         | plan for sudden performance regressions ahead of time. The
         | number of stars that have to align for that anticipatory index
         | to actually help you are just too great. Even if you correctly
         | guessed that a given table might be affected by one in the
         | future, you'll never guess what fields it needs to include, and
         | in which order, to support whatever new query or query plan
         | change or whatever caused the problem. Second, Murphy's Law
         | dictates that any attempt at preventing future read performance
         | regressions by speculatively creating an index will end up
         | causing a write performance problem instead.
         | 
         | Better instead to just get in the habit of periodically
         | reviewing query plans for every query in the database. If you
         | know the scaling characteristics of the various operations and
         | the kinds of statistics changes are likely to cause the
         | optimizer to choose different ones (and you should), then it's
         | easy enough to select for better scaling characteristics. This
         | is, incidentally, an excellent excuse for making the engineers
         | use sprocs or a clean data access layer or something instead of
         | a heavyweight ORM, so that you have some sensible way of
         | getting a warning before they change queries on you. You can't
         | effectively aim for a target that's bouncing around
         | erratically. Even better still - and only realistically
         | feasible if you somehow managed to win that bigger war I just
         | proposed - set up a tool to monitor query plans and send you an
         | alert when they change unexpectedly, so you can hear about the
         | problem when it's still small.
        
           | winrid wrote:
           | Are you just trying to create employment for yourself? This
           | is amazingly bad advice. I literally get hired to fix these
           | setups by people that take this approach.
           | 
           | Yes, it's good to periodically review query plans.
           | 
           | But designing and planning for data retrieval is part of a
           | design process. And it's easy. Even Django since 2003 allowed
           | you to define indexes inside your models. Got a 100m record
           | table and you're adding a column you want to query by
           | interatively? Rollout an index. This is normal day-to-day
           | software development...
           | 
           | The write overhead is wild speculation. 99% of business are
           | read heavy.
        
             | lazide wrote:
             | It sounds like he's thinking of it from the DBA
             | perspective, where they have to react to sudden changes in
             | behavior from devs with little or no warning - since the
             | devs don't talk to them.
             | 
             | DBAs doing proactive index creation when they don't know
             | what the devs are doing is indeed futile.
             | 
             | The devs, however, should definitely be doing proactive
             | database design/planning whenever they do anything, since
             | they can (and will!) cause emergencies and downtime by not
             | considering the impact of how they are interacting
             | with/using the database.
             | 
             | If the devs are directly writing SQL, this is also
             | relatively easy to get them into doing. If they're using a
             | heavyweight ORM, it's nearly impossible to figure out what
             | SQL it's going to run sometimes (and difficult to even
             | trigger in a test), and so the devs often won't even try.
        
               | mumblemumble wrote:
               | Exactly this.
               | 
               | The company had actually banned new uses of ORM and was
               | in the process of eliminating existing usage of it while
               | I was there. We had discovered that teams that used ORM
               | had much higher production incident rates than teams that
               | didn't, and it was fairly directly attributable to lack
               | of understanding and predictability of what was happening
               | in the database.
               | 
               | Maybe not a huge deal if you're in a low-load situation.
               | But HFT requires the A game at all times because you
               | never know when some exciting news that causes trading
               | volume to increase 50-fold almost instantaneously might
               | happen.
               | 
               | For the record, I was a dev and not a DBA. But I did work
               | closely with the DBAs. And I was pretty irritated when I
               | found out ORM was banned, because it definitely made the
               | "writing new code" part of the job more laborious. But,
               | hey, learning experience - it turns out that it was the
               | right move. In the long run we were able to move faster
               | once we stopped having to deal with the blowback from
               | breaking things quite so often.
               | 
               | It's a little bit like when I play Mario Kart with my
               | kids. Why do I go so much faster than them? Mostly
               | because they are constantly pushing hard on the
               | accelerator button, while I ease off the gas and even hit
               | the brakes sometimes. They think speed management is
               | annoying. I think that I spend a lot less time bouncing
               | off of walls and giving precious coins to Lakitu.
        
       | pkulak wrote:
       | Just because the table scan is under some threshold doesn't
       | automatically make it better. If a table scan takes 250ms vs 0.01
       | for a indexed lookup, you're still gonna have to justify to me
       | why making silicon work that damn hard is worth even the
       | electrical use. Are you inserting and deleting so many rows that
       | maintaining the index is prohibitive? Do you have space concerns,
       | and are not able to keep the index in memory, or maybe even on
       | disk? The Bash history thing makes sense, because indexing is a
       | lot of work. But otherwise, just use an index and move on to the
       | next problem.
       | 
       | EDIT: Has anyone else forgotten to add an index, then noticed the
       | performance regression a year later? That's always fun, because
       | now adding the index could mean downtime, and even knowing if or
       | how much is difficult because putting that much data somewhere
       | else to test isn't trivial. No thank you.
        
         | eklitzke wrote:
         | For small enough tables doing a full scan will be faster than
         | an index. This is also true for regular non-database
         | applications like checking if an item is present in a small
         | collection: it's faster to do a linear scan over a small vector
         | than it is to do a lookup in a hash table or b-tree. With a
         | linear scan (whether it's for data on disk, or a data structure
         | in memory) you don't have to do hashing or branching (except
         | possibly to terminate the loop). With a binary search you
         | basically get worst possible case for branch predictor, because
         | if you're searching random values whether you go right/left on
         | each recursion level is basically random. This is true for in-
         | memory data structures, and it's even more true on disk, since
         | disks (even SSDs) are especially optimized for linear scans
         | compared to random accesses.
         | 
         | It's been a while since I looked at this, but I think MySQL and
         | Postgres both take this into account and will let you add
         | indexes to small tables but will actually ignore them and do a
         | full table scan for all queries if the table is small enough.
        
           | pkulak wrote:
           | Yes, that's correct. If you do an explain for a tiny table,
           | as long as your stats are up to date, the index will be
           | ignored. In that case, it's there for insurance if the table
           | grows in the future.
        
           | josephg wrote:
           | What do you think the threshold number is where scanning a
           | list will outperform a hash table? 10 items? 100? 1000? For
           | what it's worth, for small data sets I agree with you. But I
           | think it's very hard to be calibrated correctly on what small
           | means here.
           | 
           | Re: binary search, the biggest cost to scanning data in a
           | database is loading that data from persistent storage. A few
           | mispredicted branches aren't going to matter much if it means
           | you can do fewer reads.
        
           | ignoramous wrote:
           | > _With a binary search you basically get worst possible case
           | for branch predictor, because if you 're searching random
           | values whether you go right/left on each recursion level is
           | basically random._
           | 
           | I think folks writing databases know a thing or two about
           | writing faster search and sort algorithms.
        
             | birdyrooster wrote:
             | Why do you think that? Seems like a hard problem.
        
             | anyfoo wrote:
             | I'm not sure what you are trying to say, given the context
             | of what you quoted was. Binary search has optimal worst
             | case runtime complexity, but for small tables it is, on
             | real computers, overall still slower than a linear scan,
             | which effectively has the _worst_ worst case runtime
             | complexity (unless you start to get pathological). This is
             | because the constant in front of the runtime complexity,
             | the one that O-notation explicitly ignores, starts to
             | matter: Branch prediction and cache locality come into
             | play.
             | 
             | What exactly do you mean with "writing faster search and
             | sort algorithms"?
        
           | mcqueenjordan wrote:
           | All of this looks accurate, but it's worth contextualizing:
           | this is an optimization that bears the most fruit in "local
           | frames of reference" -- the timescales where linear scans
           | beat index lookups are likely to be strictly dominated by the
           | network latency to transceive from the database. The
           | conclusion is then that the optimization ~only optimizes for
           | cases that effectively don't matter.
        
         | lazide wrote:
         | I agree - for 90% of cases.
         | 
         | There are situations where the indexes end up larger than, or
         | the same size as, the actual data and the query doesn't
         | meaningfully benefit from having the data indexed because, for
         | instance, all the data is going to be analyzed anyway, or the
         | search type doesn't index usefully with the normal index types
         | (like geo searches, or clustering type queries).
         | 
         | Adding indexes that never get used have real costs on an
         | ongoing basis with insert/updates and schema updates too, as it
         | adds potentially significant overhead on every operation and
         | can make certain schema operations impossible without downtime
         | too.
         | 
         | Foreign key columns, 'soft delete' columns (like deleted/not),
         | basic auditing type stuff (created on, updated on, etc),
         | 'unique' or external reference values like a order id or
         | whatever (even if not a primary/unique key in the schema),
         | basic numeric/analysis columns are almost always worth indexing
         | though, to your point.
         | 
         | Stuff that is not always a clear win without some real thinking
         | is Freeform text fields, structured binary data (like a PDF or
         | image), geo location data without a clear existing use (tends
         | to require specialized index types which are also expensive to
         | load/use), etc.
         | 
         | Many times some preprocessing is necessary anyway to convert
         | what you have to what you actually want, and putting that in a
         | secondary column to be indexed is far more valuable.
        
         | MaulingMonkey wrote:
         | Justify the dev time to save micro-pennies worth of electricity
         | to me instead.
         | 
         | A typical naive index won't help with my regular expression
         | based queries, which aren't easily accelerated by an index. Or
         | given an in-memory index, you've just increased memory use from
         | O(1) to O(N), and I'll OOM on large files. Perhaps you'll throw
         | a database at the problem, complicating I/O (especially when
         | the data is generated/accessed by third party tools that aren't
         | database based), tying me to yet another library's update
         | cycle, and perhaps forcing me to tackle the additional problem
         | of cache invalidation. Perhaps I need reverse lookups, in which
         | case whatever forward indexing I might've pessemistically added
         | ahead of time will be of no help at all.
         | 
         | If it'a a 5 second "this is probably the right choice" kneejerk
         | reaction, maybe it's fine. Or if you're google indexing the
         | internet, I suppose. But I am frequently plagued by shit,
         | buggy, useless indexes that merely distract from the proper
         | alexanderian solution to the gordian knot - brute force -
         | wasting more time than they ever saved, for both people and
         | CPUs.
        
           | pkulak wrote:
           | > Justify the dev time to save micro-pennies worth of
           | electricity to me instead.
           | 
           | KEY (user_id)
           | 
           | I mean, it's a dozen characters. Do you need to know how fast
           | I type before you run the calculation?
        
             | MaulingMonkey wrote:
             | > Do you need to know how fast I type before you run the
             | calculation?
             | 
             | I'll assume 100WPM, call that two words, billed at
             | $200/hour and call that $0.06, falling under "too cheap to
             | be worth arguing against", which falls under the
             | aforementioned:
             | 
             | >> If it'a a 5 second "this is probably the right choice"
             | kneejerk reaction, maybe it's fine.
             | 
             | That said, there's a decent chance those 6 cents won't pay
             | for themselves if this is a login on a single user system,
             | with any theoretical benefits of O(...) scaling being
             | drowned out by extra compile times, extra code to load -
             | and I'd be plenty willing to NAK code reviews that merely
             | attempt to replace /etc/passwd and /etc/shadow with this,
             | as the extra time code reviewing the replacement still has
             | negative expected ROI, and it'll be a lot more involved
             | than a mere dozen characters.
             | 
             | Now, maybe we're attempting to centralize login management
             | with Kerberos or something, perhaps with good reasons,
             | perhaps which does something similar under the hood, and we
             | could talk about _that_ and _possible_ positive ROI,
             | despite some actual downtime and teething concerns?
        
               | TylerE wrote:
               | Now document the two words, run the test suite to verify
               | no-breakagem commit to source control, and push to
               | production.
               | 
               | Suddenly those two words cost a lot more than $0.06, and
               | that's IF everything goes smoothly.
        
             | jefftk wrote:
             | (author)
             | 
             | If you're already using a relational database you should
             | almost certainly set up indexes on your table ids and
             | foreign keys. But that's pretty different from the examples
             | in the post!
             | 
             | I'm not anti-index, I'm anti-"if you ever have a full table
             | scan in production you're doing it wrong".
        
             | shpx wrote:
             | First tell me the minimum amount of time typing this would
             | have to take for you to agree it's not worth it and I will
             | try to keep adding things like the time it takes for
             | someone to ask you to do this, for you to start VS Code,
             | find the file, press ctrl-s, deploy the changes, and
             | possibly some estimation of how long it takes a new
             | developer to read and understand this system (please tell
             | me how fast you speak and an agreeable value for how fast
             | the average developer reads as well for this part) vs a
             | simpler one until it goes over that limit.
        
           | scott_w wrote:
           | > Justify the dev time to save micro-pennies worth of
           | electricity to me instead.
           | 
           | The time spent justifying it is longer than the dev time
           | itself. Any semi experienced engineer will throw basic
           | indexes into their data model without even thinking and cover
           | the most common use cases.
           | 
           | If they never use them... who cares? It took no time to add.
        
             | yuliyp wrote:
             | An RDBMS is not the best data store for all data. Sometimes
             | flat files or other are the simplest tool for the job, as
             | shown in the article. Adding a database to any of those
             | tools would definitely not be worth the trade-off.
        
           | muhehe wrote:
           | > Justify the dev time
           | 
           | And this is exactly the sentiment that got us where we are.
        
             | lazide wrote:
             | The mistake in your statement I think, is assuming there is
             | a course of action that wouldn't result in a problem
             | somewhere.
             | 
             | It may look a little different, but there is nothing
             | without tradeoffs.
             | 
             | Including indexing _everything_.
        
           | teen wrote:
           | not sure if this is trolling?
        
         | jasonwatkinspdx wrote:
         | So, forgetting an index and then growth pushing you over the
         | threshold is a valid concern. I think every dev has run into
         | that at some early point in their career.
         | 
         | But what your comment is skipping past is there's potentially a
         | 100x or higher difference in throughput for sequential scans vs
         | indexes. If you know the data will have a bounded size this
         | means indexes aren't necessarily a good choice. SSDs have
         | narrowed this gap a great deal but it still exists because of
         | latency and unpredictable prefecting. It even exists with pure
         | in memory applications. Another key aspect is how much of the
         | data the query ends up touching. If you're hitting 25% of the
         | data anyhow a linear scan is likely faster.
         | 
         | There's also more niche ideas like arranging to convoy multiple
         | queries along one linear scan, something impossible with
         | indexed scans.
         | 
         | It's useful to know about this asymmetry and that sometimes a
         | brute force linear scan is in fact the best tool.
        
         | leetrout wrote:
         | Agree but you can always add an index without downtime. It just
         | becomes a money and complexity issue ;)
        
           | zht wrote:
           | can you do this (on a large table) without adding significant
           | IO load/clogging up replication for an extended period of
           | time?
        
             | lazide wrote:
             | It's worth noting that if your DB instance is so heavily
             | loaded that this is a real concern, you already have a
             | _huge_ problem that needs fixing.
        
               | mschuster91 wrote:
               | AWS is particularly bad with their performance credit
               | system on RDS... and there's to my knowledge no way to
               | tell MySQL to limit index creation IOPS, which means in
               | the worst case you're stuck with a system swamped under
               | load for days and constantly running into IO starvation,
               | if you forget to scale up your cluster beforehand.
               | 
               | Even if the cluster is scaled to easily take on your
               | normal workload, indexing may prove to be too much for IO
               | burst credits
        
               | nightpool wrote:
               | I have never had any problems with CONCURRENT index
               | creations under significant load using Postgres, fwiw
        
               | lazide wrote:
               | That does seem like a real problem! Adding indexes
               | periodically is a pretty regular thing for any production
               | system where I come from.
        
               | pornel wrote:
               | It may be so loaded from all the full table scans it's
               | doing.
        
         | anyfoo wrote:
         | > If a table scan takes 250ms vs 0.01 for a indexed lookup
         | 
         | I'm not sure whether the units of that latter number are still
         | ms or now s or whatever, but either way isn't that where you
         | are wrong? On real computers, there are lots of situations
         | where linear access is trivially faster than a "better" data
         | structure with theoretically better runtime complexity.
         | 
         | 1000 accesses to a single page for a linear scan are going to
         | be many, many orders of magnitude faster than 5 accesses
         | chasing pointers across 5 pages that have no TLB entry, or
         | excruciatingly worse, that need to be paged in from disk.
         | 
         | This is an extreme (but absolutely not unrealistic!) example
         | for illustration. Slightly more subtle scenarios involving e.g.
         | branch prediction have already been named here.
         | 
         | This problem exists across multiple layers, not just close to
         | the CPU like above. For example, for data on spinning disks,
         | linear reads are much faster than head seeking. SSDs have
         | changed that dramatically, but they still have artifacts (for
         | example, having to access another block, especially an
         | unpredictable one, still has overhead from different causes).
        
       | slaymaker1907 wrote:
       | Tiddlywiki only uses an indices for tags/fields yet it still
       | works pretty well so long as the wiki isn't huge. There's
       | something to be said for the flexibility an approach like that
       | provides.
        
       | jacobsenscott wrote:
       | Most databases will ignore indexes if the table is small enough,
       | so it sort of does this for you. So the index can just be a waste
       | of space, until it isn't.
        
       | pyrophane wrote:
       | I think the risk of creating indexes prematurely is really that
       | you will make assumptions about how your table is going to be
       | queried that don't turn out to be correct, so you create indexes
       | that you will never need, even if the table gets large.
       | 
       | That probably isn't the end of the world, but it can make it
       | harder for others to understand what is going on in your database
       | (indexes can also be a form of documentation) and could make your
       | ORM code somewhat more complex.
       | 
       | The worst case scenario, I suppose, would be you create an index
       | on a table that grows large
        
         | giraffe_lady wrote:
         | This has matched my experience too. Most experienced developers
         | can probably identify an inappropriately hot table and add an
         | index to it. Where a DB with dozens of individually modest
         | indexes that add up to be huge is much trickier and more labor
         | intensive to confidently improve.
        
         | jmcomets wrote:
         | FWIW, most DBMS have built-in index usage stats, and it's not
         | too difficult to query. In my previous job we had a Postgres
         | health dashboard that notably showed `log10(index size) / index
         | hits`, making it pretty clear when some recently added index
         | was useless.
         | 
         | My opinion is that indexes should be either logical (ex: used
         | for exclusion constraints) or purely for performance (tradeoff
         | space for better QPS). Query patterns change, specs change, so
         | monitoring your DB's performance is the way to go.
        
       | Aachen wrote:
       | Moral of the post: don't do premature optimization. Common adage
       | but it's a good reminder and example of it.
       | 
       | Aside from one case where OP argued that the queries were so rare
       | compared to the data updates that maintaining the index is more
       | expensive. Which is also pretty classic when you're taught about
       | indexes.
       | 
       | What I recently learned the hard way about indexes is that
       | they're slow when you have to read "large" chunks of a table to
       | find an answer, in my case computing a median for an area. I'm
       | indexing geospatial data and Europe+NA are overrepresented. When
       | you view an area the size of ~Germany, the query would take 20
       | minutes and compute the median over ~3% of all records whereas a
       | full table scan (whole world) took something like 40 seconds
       | (with the median being computed over the same 3%, it would just
       | evaluate the WHERE clause against every row instead of finding
       | rows to include via the index). That's the power of a sequential
       | read as compared to reading an Rtree. I haven't had such bad
       | experiences with Btree indexes, not sure if that would behave
       | just as badly. On the other hand, if you ask for an area the size
       | of Amsterdam, the index is normal fast, and for <50 rows it's
       | basically instant whereas the full table scan would still take
       | the same 40 seconds.
        
         | iLoveOncall wrote:
         | Adding an index is an incredibly expensive task on a dataset so
         | big that it needs one to be added.
         | 
         | It's something that is likely to require massive engineering
         | effort on a live system.
         | 
         | Removing an index, on the other hand, is never an issue.
         | 
         | It's not at all premature optimization, it's simply basic
         | software design.
        
           | Aachen wrote:
           | Read OP's post please. They're talking of systems where it
           | was acceptable to always do full table scans, in one case
           | having 350 rows.
           | 
           | Not "an incredibly expensive task on a dataset so big that it
           | needs one". Thus I read OP's post as recommending to not do
           | premature optimization (without using those words literally).
        
       | zokier wrote:
       | > I use grep on a flat file, and testing now it takes 200ms for a
       | query across my 180k history entries
       | 
       | 200ms is order of magnitude more than I'd prefer for interactive
       | use
        
         | fnordpiglet wrote:
         | Yeah that was my thought too, and I thought about fishing out
         | the unbranded $1 1Gb microsd in my drawer I use for 3d printing
         | to help him out with his 0.5Gb storage challenge
        
         | PhilipRoman wrote:
         | 200ms for 180k entries sounds way too slow. Just tested on my
         | 150k line log file and it takes about 10ms on average for
         | various kinds of patterns with various hit frequencies.
         | 
         | On an unrelated note, from a quick check with "strace", gnu
         | grep seems to detect when output is redirected to /dev/null as
         | no write syscalls are being made in that case.
        
           | watermelon0 wrote:
           | Unrelated note intrigued me, so I went looking in the source
           | code; detection is here: https://git.savannah.gnu.org/cgit/gr
           | ep.git/tree/src/grep.c#n...
           | 
           | Interesting that they would optimize for this use case.
        
             | remram wrote:
             | Especially since grep -q exists. It would never come to my
             | mind to direct grep's output to /dev/null, I wonder how
             | this optimization came to be?
        
         | jefftk wrote:
         | I was going to say that it was "instant", because that's how it
         | feels when using it, but got 200ms from running "time" on
         | grepping it to include a number in the post.
        
           | doodlesdev wrote:
           | Have you tried out using ripgrep [0] or fzf [1]? 200ms is
           | quite a lot for history search, I'd expect search to be 20ms
           | or less. In fact, it should be instantaneous as it should all
           | fit in memory really. I've been using fzf and the searches
           | are all faster than my monitor can display frames (so less
           | than 16ms), although I only have 10k entries on my history as
           | I clear it up once in a while.
           | 
           | [0]: https://github.com/BurntSushi/ripgrep
           | 
           | [1]: https://github.com/junegunn/fzf
        
             | jefftk wrote:
             | Trying now, "cat > /dev/null" takes ~150ms, so I suspect
             | it's related to keeping my history in Google Drive (with
             | local mirroring).
             | 
             | If it gets annoyingly slows I'll stop doing that and use a
             | different syncing system.
        
       | perlgeek wrote:
       | Slightly off topic, but just the other day we had a case where we
       | had an index, and deleting ~100k entries from a table with a few,
       | small columns caused a huge delay.
       | 
       | Context: Mariadb 10.2. In this table we stored traffic
       | statistics, aggregated by IP address and day. Probably less than
       | 200 bytes per row. They were also written per day, and then
       | cleaned up by a cron job (every month an entire month worth of
       | data was discarded), so "DELETE FROM traffic_stats WHERE day < ?"
       | 
       | This worked fine on the primary, but our monitoring told us that
       | the replicas started to lag, and showed that they spent
       | inordinate amounts of time on the delete statement above. Even
       | though we had an index on the "day" column.
       | 
       | Even weirder, the problem didn't show up in our QA database
       | cluster, which had far less resources (vmware vm in QA vs. pretty
       | solid bare metal in prod).
       | 
       | Even more puzzling, skipping the replication step and running the
       | DELETE statement manually was quite fast (a second or so).
       | 
       | After some frantic search, it turned out that, for reasons lost
       | in time, the QA db cluster used "mixed" replication, while the
       | prod cluster used row-based replication. In row-based
       | replication, the leader communicates to the replicas which _rows_
       | were deleted. And since we didn 't have a primary key on that
       | table, the replicas did a full-table scan to find which record to
       | delete, and repeated that for each of the 100k records to be
       | deleted.
       | 
       | Adding a primary key to the table fixed the issue immediately,
       | and we switched to mixed replication shortly after.
        
         | eitally wrote:
         | I've very familiar with this sort of problem. I used to work in
         | high-volume electronics manufacturing and coercing high &
         | variable transaction OLTP databases to be able to be 1)
         | backupable, and 2) replicable for analytics workloads often
         | feels like a dark art. In a lot of cases, trial and error is a
         | perfectly reasonable approach, for better or for worse.
        
       | XCSme wrote:
       | Why not just add the index? It future-proofs the performance and,
       | as mentioned in the article, for small datasets, performance is
       | not an issue anyway, so adding an index won't affect the small-
       | dataset performance.
        
         | jasfi wrote:
         | I agree, you don't want performance to suddenly degrade and
         | then you have to find out why. This can be especially bad when
         | this happens with many tables that should have indexes, and
         | performance is sluggish, but not for any specific action.
         | 
         | Add indexes appropriately is a best practice.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-05-25 23:00 UTC)