[HN Gopher] Cases where full scans are better than indexes
___________________________________________________________________
Cases where full scans are better than indexes
Author : xnx
Score : 165 points
Date : 2023-05-25 14:47 UTC (8 hours ago)
(HTM) web link (www.jefftk.com)
(TXT) w3m dump (www.jefftk.com)
| electroly wrote:
| If the data is small, then the index will also be small, so it
| doesn't really matter either way. The reason to _avoid_ indexes
| is if you can 't accept the time or space cost. Only the last of
| OP's examples is a situation where choosing an index causes a
| problem; the rest are examples where an index _isn 't necessary_
| but does not cause problems. Several examples are so trivial that
| you'd be crazy to use an external database at all. I'm not sure
| I'm really sold here.
| deepsun wrote:
| Unless the small data changes often -- then DBMS needs to
| change both data and each index.
| electroly wrote:
| By definition, the changes will also be small. It's just not
| possible for it to blow up in a way that would cause a
| problem, because we're only talking about tiny tables.
| There's nothing you can do in a single[1] index for a table
| with 350 rows that will cause any heartache vs. an unindexed
| table with 350 rows. 4 of the 5 examples in the article are
| just "this table is too small to care."
|
| [1] Edited per discussion; a large number of indices on a
| small table can still get you in trouble.
| tomnipotent wrote:
| > By definition, the changes will also be small
|
| Even if all you're updating is an int32 column from 1 to 2,
| the RDBMS is still writing much more data than 4 bytes. At
| a minimum it's making a copy of the full page the data is
| in prior to the update (so 8k for pg, 16k for mysql) and
| usually multiples of that (mysql writes both before/after).
| deepsun wrote:
| > There's nothing you can do in an index for a table with
| 350 rows
|
| I meant if you have 100s of indices -- it might be slow
| even on small tables.
|
| I actually had a production issues from using a lot of
| indices, but it's not apples-to-apples with current
| discussion, because the table sizes, DBMS and update rates
| were much larger, fixed by removing indices and splitting
| table into multiple.
| electroly wrote:
| Wow! You're right; I wasn't considering that anyone would
| try to have that many indices. 100 is a lot. I should
| qualify my statement and say that you won't get in
| trouble with _one_ index on a small table. I 've edited
| my post accordingly.
| sumtechguy wrote:
| In those sort of cases your schema is probably not the
| correct pattern and needs to be reviewed.
|
| I had one proj where they had 40 indexes on one table.
| Inserts were terrible for that table. No one wanted to
| touch it as it happened to be the table where everything
| was at. They started pulling out 'nosql' sharding and
| other sorts of tools to fix it. Anything to not change
| that table with 200+ columns and 30 indexes and the
| flotilla of stored procs to go with it. They even ported
| it from MSSQL to Oracle. Thinking somehow magically
| oracle was going to make the poor schema run faster.
| jefftk wrote:
| In cases where your data is small enough that you don't need an
| index, an index is adding technical complexity. In some cases
| that will be small (you are already using a database and it has
| built-in support for the kind of index you need) and then,
| sure, go ahead. But in other cases it will not be so small (you
| can't get indexes without switching to a more complicated tool,
| you'd need a somewhat unusual kind of index which requires a
| plugin, etc) and it's worth thinking about whether your system
| will be more reliable and low-maintenance long-term if you go
| without.
| electroly wrote:
| I think we're in agreement here; this is the part where I
| said "Several examples are so trivial that you'd be crazy to
| use an external database at all". For your shell history, you
| not only don't need indexes, you don't need the database
| either (as you conclude in the post). But if you're already
| in the database, it falls into that first "the index feature
| is right there" category. This HN thread is full of people
| skipping the index on their SQL tables in situations where I
| would write the index. Mostly I think people are just missing
| that you need to write indices in response to queries rather
| than what you're imagining when you first create the table.
| fnordpiglet wrote:
| In the case i mention above of log data where data access
| patterns dominate storage without retrieval computing indices
| up front is absurdly expensive and challenging at extreme
| scales. In that case minimal time and perhaps probabilistic
| indices computed on ingestion and extremely parallel brute
| force search on search works out both in terms of unbounded
| scale and cost - by like one to two orders of magnitude.
| electroly wrote:
| Indeed, we also use an unindexed store for our logs (having
| migrated to that from a system that _was_ indexed). This is
| clearly _not_ the "data is so trivially small that nothing
| matters" situation that the OP is describing in 4 of the 5 of
| their examples. I hope you can see this isn't at all
| incompatible with my post.
| fnordpiglet wrote:
| I can. I'm just pointing out there are definitely cases
| where indexing sucks. I agree the original post doesn't
| touch those cases.
| jeffreyrogers wrote:
| He didn't show that full scans are better than indexes, just that
| they aren't obviously worse when the data is small.
| hamilyon2 wrote:
| 1) Postgres optimizer at some point prefers full scans. It
| depends on your data, your query and configuration. So on read-
| heavy load you basically lose nothing by adding all appropriate
| indexes from the get-go. Overhead having index on small tables is
| small.
|
| 2) If your load is write-heavy, you should consider trade-offs.
| Yes, full scans are sometimes your friends. Interesting thing to
| notice here that write amplification is basically same even if
| table is small.
| [deleted]
| OliverJones wrote:
| I take the author's point for single-user write-frequently
| search-rarely use cases. Plenty of small greenfield projects look
| like that.
|
| Bigger projects, or projects that are enjoying fast growth, may
| well need indexing. Why? to keep search performance up under a
| multi-user workload and/or a mixed write-query workload. Complex
| multi-table searches often also benefit from judicious indexing.
|
| Here's something to keep in mind: Successful projects often find
| that an unpredicted query pattern has become very popular and/or
| a bottleneck. Often the bottleneck is remediated in production by
| adding appropriate indexes. SQL's designed to add indexing
| without changes to the data model, for precisely that reason.
| It's normal to add indexes as projects grow, and no sign of bad
| design or expensive technical debt. (Unsuccessful projects, or
| projects that don't need to grow? They don't have the problem.)
| fnordpiglet wrote:
| CloudWatch Logs Insights, Snowflake, Grafana Loki all do
| minimally indexed queries with highly parallel brute force scans
| of tons and tons of smallish s3 objects. A lot of data,
| especially data like logs, the ratio of "retrieved:ingested" can
| be in excess of 1:100. That makes it very much worth while to not
| index up front but to pay the retrieval cost for that rare amount
| of data that is actually retrieved. The minimal indexing
| mentioned is typically time and sometimes bloom filteresque term
| existence indices at the object level.
| londons_explore wrote:
| > CloudWatch Logs Insights, Snowflake, Grafana Loki
|
| And the common thing in this list of tools? They're all laggy
| and not a nice experience to use.
|
| Even if queries are rare, there is often an actual human
| sitting waiting for the results of a query. Whereas while
| ingesting logs, there is no human waiting. I would prefer to
| burn more CPU time to save some human time.
| fnordpiglet wrote:
| I use all three and for large scales they're considerably
| faster than the alternative; which at some scales
| (peta/exabyte) is "none."
| taeric wrote:
| I feel this somewhat generalizes. "Cases where a simple
| list/array is better than a hashed collection." Which, oddly has
| the perverse problem of "when you should give up on a single
| collection of your data, and embrace storing it in different
| ways."
|
| That is, few folks seem to internalize or consider that every
| index is essentially a copy of the data. A complete copy, if you
| are doing a fully projected index. So, when is it better to just
| do a full scan over something else? Well, its probably never
| better in isolation of every other concern. But, when is it
| better to just do a scan of the data over duplicating it in a
| place that would let you hash in? Depends on how much extra work
| you are doing because of it?
|
| Having typed that, really I think the trick is to ask yourself
| roughly (not exactly, amusingly) what it means to add an index.
| Or to use a hashtable. Or to use a list and a hashtable. And
| realize that the question of using an index is the same as the
| question of keeping another copy of your data somewhere.
| [deleted]
| vivegi wrote:
| Once upon a time (25+ years ago) I used to maintain an Oracle
| database that had around 50M records in its main table and 2M
| records in a related table. There were a dozen or so dimension
| (lookup) tables.
|
| Each day, we would receive on the order of 100K records that had
| to be normalized and loaded into database. I am simplifying this
| a lot, but that was the gist of the process.
|
| Our first design was a PL/SQL program that looped through the
| 100K records that would be loaded into a staging table from a
| flat file, performing the normalization and inserts into the
| destination tables. Given the size of our large tables, indices
| and the overhead of the inserts, we never could finish the batch
| process in the night that was reserved for the records intake.
|
| We finally rewrote the entire thing using SQL, parallel table
| scans, staging tables and disabling and rebuilding the indexes
| and got the entire end-to-end intake process to finish in 45
| minutes.
|
| Set theory is your friend and SQL is great for that.
| zvmaz wrote:
| > Set theory is your friend and SQL is great for that.
|
| Could you tell us how set theory was useful in your case?
| vivegi wrote:
| Simple example.
|
| INTAKE_TABLE contains 100k records and each record has up to
| 8 names and addresses. The following SQL statements would
| perform the name de-duplication, new id assignment and
| normalization. -- initialize truncate
| table STAGING_NAMES_TABLE; -- get unique names
| insert /*+ APPEND PARALLEL(STAGING_NAMES_TABLE, 4) */ into
| STAGING_NAMES_TABLE (name) select /*+
| PARALLEL(INTAKE_TABLE, 4) */ name1 from INTAKE_TABLE
| union select /*+ PARALLEL(INTAKE_TABLE, 4) */ name2
| from INTAKE_TABLE union ... union
| select /*+ PARALLEL(INTAKE_TABLE, 4) */ name8 from
| INTAKE_TABLE; commit; -- assign new
| name ids using the name_seq sequence number insert /*+
| APPEND */ into NAMES_TABLE (name, name_id) select name,
| name_seq.NEXT_VAL from (select name from
| STAGING_NAMES_TABLE minus select name from
| NAMES_TABLE); commit; -- normalize the
| names insert /*+ APPEND PARALLEL(NORMALIZED_TABLE, 4)
| */ into NORMALIZED_TABLE ( rec_id, name_id1,
| name_id2, ... name_id8 ) select /*+
| PARALLEL(SNT, 4) */ int.rec_id rec_id,
| nt1.name_id name_id1, nt2.name_id name_id2,
| ..., nt8.name_id name_id8 from
| INTAKE_TABLE int, NAMES_TABLE nt1, NAMES_TABLE
| nt2, ... NAMES_TABLE nt8 where
| int.name1 = nt1.name and int.name2 = nt2.name and
| ... int.name8 = nt8.name; commit;
|
| The database can do sorting and de-duplication (i.e., the
| query UNION operation) much much faster than any application
| code. Even though the INTAKE_TABLE (100k records) are TABLE-
| SCANNED 8 times, the union query runs quite fast.
|
| The new id generation does a set exclusion (i.e., the MINUS
| operation) and then generates new sequence numbers for each
| new unique name and adds the new records to the table.
|
| The normalization (i.e., the name lookup) step joins the
| NAME_TABLE that now contains the new names and performs the
| name to id conversion with the join query.
|
| Realizing that the UNION, MINUS and even nasty 8-way joins
| can be done by the database engine way faster than
| application code was eye-opening. I never feared table scans
| after that. Discovering that the reads (i.e., SELECTs) and
| writes (i.e., INSERTs) can be done in parallel with
| optimization hints such as the APPEND hint was a superpower.
|
| Using patterns such a TRUNCATE TABLE (for staging tables) at
| the top of the SQL script made the scripts idempotent. i.e.,
| we could trivially run the script again. The subsequent runs
| will not generate the new sequence numbers for the names.
| With some careful organization of the statements, this became
| rigorous.
|
| Although I haven't shown here, we used to do the entire
| normalization process in staging tables and finally do a copy
| over to the main table using a final INSERT / SELECT
| statement.
|
| My Oracle SQL-fu is a bit rusty. Apologies for any syntax
| errors.
| wswope wrote:
| The implication is that the PL/SQL job was operating rowwise.
|
| Batch processing, backed by set theory in this case, is far
| more efficient than rowwise operations because of the
| potential for greater parallelism (SIMD) and fewer cache
| misses.
|
| E.g., if inserting into a table with a foreign key pointing
| to the user table, batch processing means you can load the
| user id index once, and validate referential integrity in one
| fell swoop. If you do that same operation rowwise, the user
| id index is liable to get discarded and pulled from disk
| multiple times, depending on how busy the DB is.
| lazide wrote:
| Also, each insert will have to acquire locks, verify
| referential integrity, and write to any indexes
| individually. This is time consuming.
| gopher_space wrote:
| > loaded into a staging table from a flat file, performing the
| normalization and inserts into the destination tables.
|
| This is how I've always done things, but I'm starting to think
| decoupling normalization from the entire process might be a
| good idea. This would have simplified design on a few projects
| (especially the ones that grew "organically") and seems like it
| would scale nicely.
|
| I'm starting to look at normalization as a completely separate
| concern and this has led to some fun ideas. E.g. if a system
| operated on flat files that were already normalized it could
| save time writing tools; you're at the command line and
| everyone involved can get their grubby mitts on the data.
| cirrus3 wrote:
| It is never really mentioned why adding an index upfront is such
| a burden?
|
| I don't really understand how the downside of adding an index
| almost ever outweighs the potential downsides of not adding an
| index. I'm sure there are some cases, and that was what I was
| expecting to see in this article, but it just seems to boil down
| to being so lazy that he wants to avoid typing a few more
| characters.
| paxys wrote:
| I don't get it...they are just bragging about writing terrible
| code? There isn't a single reason given in the article about why
| a full scan is better than an index other than "I don't need
| these fancy index things, I'm perfectly happy with my 200ms query
| time".
|
| Well add the index anyway and add an extra 195ms sleep() call if
| you really want to. It's still better for your hardware and
| electricity bill.
| mrits wrote:
| While I'm not sure why you moved their recently I can see why
| you wouldn't have noticed how nice it used to be. As I said, it
| already started the decline a decade ago.
| grumple wrote:
| The HN title says something different than the blog post title /
| intent. The HN title is incorrect and misleading imo.
|
| The blog post can be summarized as "you don't need indexes on
| tiny tables".
| mastax wrote:
| > I recently came across someone maintaining a 0.5GB full text
| index to support searching their shell history, 100k commands. I
| use grep on a flat file, and testing now it takes 200ms for a
| query across my 180k history entries.
|
| I just started using a tool that stores my bash history and
| replaces C-R that I found on HN, it's written in Rust and uses
| SQLite of course. But it'll randomly cause commands to pause for
| a few seconds before executing (I think the hook to write the
| command to history gets blocked). Probably some simple bug, but I
| could just have my command history in a flat file and use rg or
| fzf in a 5 line shell function and have no lag.
| mastax wrote:
| Okay getting real off-topic now (welcome to HN) but this is not
| what I expected:
|
| https://github.com/ellie/atuin/issues/952#issuecomment-15378...
| https://github.com/openzfs/zfs/issues/14290
|
| Apparently when SQLite calls `ftruncate` on the `-shm` file,
| ZFS will sometimes block for several seconds. Maybe it's
| waiting for txg timeout? Strange.
| scarmig wrote:
| Can you share the tool? I've been thinking of making something
| similar for fun backed by a suffix array, interested in what
| else is out there.
| wswope wrote:
| https://github.com/ellie/atuin
| chubot wrote:
| Anyone remember gsearch by Jeff Dean circa 2006 at Google? It was
| a full scan regex search, using 40-80 machines in parallel, not
| an index
|
| Also sourcerer was a full scan of VCS metadata, on a single
| machine. i.e. if you listed all the commits by a certain person,
| it would just do a for loop in C++ going through about 1-2 M
| commits. They were just C++ structs in memory, and it was
| probably submillisecond times, even with hardware at the time
|
| I remember reading those 2 programs and being impressed that they
| were both very fast, and full scans! (gsearch got slow later as
| the codebase grew, but the first version was a <1 second
| distributed grep of the whole Google codebase.)
| SoftTalker wrote:
| Indexing doesn't add much for very small tables. It also doesn't
| really help much if the cardinality of the data is low.
|
| I usually index any columns that are used in joins, or frequently
| used in predicates.
| nubinetwork wrote:
| Indexes got me from 15 seconds down to under a second. I'll
| always use them even if they wind up being as much space as the
| actual data. MySQL is just weird that way.
|
| Edit: believe me, I tried it all... refining my queries, multiple
| tables, bigger hardware, faster disks, changing all the
| tuneables... nothing was as good as adding a whack-tonne of
| indexes.
| ydnaclementine wrote:
| What is the rule that you use on which fields to index? I've been
| following the general when adding fields: if the field is
| mentioned in a sql WHERE, ORDER BY, or GROUP BY
|
| EDIT: JOINs
| jdmichal wrote:
| The answer will heavily depend on what you're doing.
|
| CRUD patterns where you're pulling out a small group of rows
| from several tables, you probably want indexes on `JOIN`
| clauses so that you can quickly find those links using nested
| loop index scans. And maybe a few indexes on common ways to
| find records.
|
| Analytics patterns where you're aggregating a large number of
| rows, indexing `JOIN`s won't help you, because you really want
| hash or merge joins anyway. Instead, tune your indexes for
| `WHERE` clauses, because you want a filtered index scans
| instead of table scans on the reads.
|
| You also want to learn about cardinality and selectivity.
| Ideally your indexes are organized by selectivity.
| AnotherGoodName wrote:
| Don't overthink that part. Just rely on EXPLAIN. It will tell
| you if the query had an index to use or if it scanned. You'd be
| surprised when a DB does/doesn't use an index if you try to
| figure it out yourself.
|
| A run book for any org when writing queries should be
|
| 1) Write the query
|
| 2) Before shipping the query run the query through EXPLAIN. Did
| it scan or did it use an index?
|
| 3) If it scanned can you re-write it to use an existing index
| (the usual answer honestly).
|
| 4) Create the index.
| ksherlock wrote:
| You should also consider how unique that field is. If half your
| widgets are blue and half your widgets are red, indexing the
| color is probably not helpful (ie, the query planner will
| ignore the widget_color index and do a full table scan). A rule
| of thumb number I vaguely recall is 10%
| wiredfool wrote:
| My recollection in postgres anyway is that low cardinality
| indexes aren't useful, because it doesn't take into account
| which side of the 1/99% you're on when determining to use the
| index.
|
| What is useful is to do a partial index on values where
| foo=small % ,value because then it will use that index when
| it matches the query, but not when it's in the majority case.
| anarazel wrote:
| > My recollection in postgres anyway is that low
| cardinality indexes aren't useful, because it doesn't take
| into account which side of the 1/99% you're on when
| determining to use the index.
|
| It does take that into account. Demo: =#
| CREATE TABLE low_cardinality AS SELECT generate_series(-10,
| 1000000) < 0 AS col; =# CREATE INDEX ON
| low_cardinality (col); =# ANALYZE low_cardinality;
| =# EXPLAIN SELECT * FROM low_cardinality WHERE col = false;
| +----------------------------------------------------------
| ---------------+ |
| QUERY PLAN | +------
| -----------------------------------------------------------
| --------+ | Seq Scan on low_cardinality
| (cost=0.00..14425.11 rows=1000011 width=1) | |
| Filter: (NOT col)
| | +------------------------------------------------
| -------------------------+ (2 rows) =#
| EXPLAIN SELECT * FROM low_cardinality WHERE col = true;
| +----------------------------------------------------------
| ------------------------------------------+ |
| QUERY PLAN |
| +----------------------------------------------------------
| ------------------------------------------+ | Index
| Only Scan using low_cardinality_col_idx on low_cardinality
| (cost=0.42..4.44 rows=1 width=1) | | Index Cond:
| (col = true)
| | +------------------------------------------------
| ----------------------------------------------------+
| (2 rows)
| anarazel wrote:
| Forgot to say: Obviously a partial index is going to be
| considerably better, performance and size wise.
| raldi wrote:
| When I wrote Reddit Gold (now called Reddit Premium) I
| intentionally left the paying-user table index-free, looking
| forward to the day it would become necessary.
|
| It hadn't happened by the time I left the company, even though
| sales were exceeding expectations.
| hardware2win wrote:
| What was the reasoning?
| raldi wrote:
| I had read a Joel Spolsky post about how they used to get an
| email every time they made a sale ("Ding!") and the day they
| finally turned it off was full of pride. (I did that too,
| originally not even filtered out of my inbox.)
| bcrosby95 wrote:
| So no primary key or uniqueness constraints?
| Macha wrote:
| Unlikely, given Reddit's past schema design. One table of
| "things", and then another table of attributes of those
| things in a entity,key,value format.
|
| https://kevin.burke.dev/kevin/reddits-database-has-two-
| table...
| ComodoHacker wrote:
| Modern DB engines can enforce Unique or PK constraints
| without indexes. Yes, they perform scans.
| tomnipotent wrote:
| Every major RDBMS requires an index for both constraints.
| hinkley wrote:
| FK constraints however are a pretty common gotcha. You
| have a table that allows deletes, and every table that
| has a column pointing to that table gets scanned on each
| delete operation.
|
| So you have to add an index on that column, or start
| talking about tombstoning data instead of deleting it,
| but in which case you may still need the FK index to
| search out references to an old row to replace with a new
| one.
| tomnipotent wrote:
| An FK also adds the burden of keeping a copy of each
| related row during the lifespan of a transaction. This
| means small-but-frequently-updated tables that are joined
| to a lot can be a source of unexpected headaches.
| BeefWellington wrote:
| It's not really index-free if a separate table has constraints
| of any kind.
| cactusplant7374 wrote:
| Indexes are such an easy thing to add. I don't get it. Seems
| like under optimization when you consider the tradeoffs.
| VWWHFSfQ wrote:
| > Indexes are such an easy thing to add.
|
| Indexes can have catastrophic consequences depending on the
| access patterns of your table. I use indexes very sparingly
| and only when I know for sure the benefit will outweigh the
| cost.
| jefftk wrote:
| Consider the example from the post of searching your shell
| history. If I don't need indexes I can just have it all in
| one flat file and use grep. Switching to a tool that supports
| indexing adds a lot of potential complexity and ways things
| could go wrong.
|
| Or consider the example of a developer querying frontend
| logs: queries are many orders of magnitude less common than
| writes, and (at the scale I used to work at) an index would
| be incredibly expensive.
| wswope wrote:
| I take the point of your examples as written in the post,
| but I think both of those are a bad comparison to the
| Reddit Premium subscriber table being discussed, because:
|
| - We're already using a database, so there's very minimal
| added complexity
|
| - This is an extremely hot table getting read millions of
| times a day
|
| - The scale of the data isn't well-bounded
|
| - Usage patterns of the table are liable to change over
| time
| raldi wrote:
| It wasn't an extremely hot table, and the scale of the
| data was well-bounded insofar as the only way for it to
| become non-negligible would be for us to have had revenue
| beyond our then-wildest dreams.
| SoftTalker wrote:
| They add overhead to updates and inserts. If you don't need
| them, why take that hit. Know your data.
| EGreg wrote:
| Wait a second, the whole thing is you don't mind O(N)
| overhead in searching on every request, but you mind O(log
| N) overhead for updates and inserts?
| titaniczero wrote:
| Well, my guess is that updates and inserts are much more
| frequent than searches in their use case. You're assuming
| a balanced frequency for these operations and it hardly
| ever happens.
| wswope wrote:
| This is a read-heavy workload per the OP:
| https://news.ycombinator.com/item?id=36071799
| toast0 wrote:
| If the table is small enough that indexes aren't critical
| for reads, they probably don't impact the rate of writes
| either.
|
| Of course, if the object is large scale bulk inserts, with
| only very occasional selects, then yes.
| tomnipotent wrote:
| Considering how much data is written to support
| redo/undo/wal logs, a single index for user id is very
| little overhead.
| holoduke wrote:
| My experience is that it's uaually better to add indices
| where it's expected to be needed beforehand. Adding indices
| on large production tables will millions of rows can bring
| down databases for hours or even days worst case. It's
| tricky to manage.
| raldi wrote:
| If you have nothing to optimize yet, how will you know if
| you're optimizing it right?
| wswope wrote:
| Typically, you create a large number of test records and
| see how your expected access patterns perform.
| eterm wrote:
| That goes against my intuition, because the performance would
| be more impacted (beneficial to have an index) where you have
| very few bits set on a boolean field.
|
| If you have 10m users and 100 have "IsPremium = 1" then an
| index massively speeds up anything with WHERE IsPremium = 1,
| compared to if the data is fairly evenly spread between
| IsPremium = 1 and IsPremium = 0 where an index won't be much
| benefit.
|
| So low sales would increase the need for an index.
|
| That said, I'm assuming searching for "Users WHERE IsPremium =
| 1" is a fairly rare thing, so an index might not make much
| difference either way.
| abtinf wrote:
| > paying-user _table_ index-free
|
| Presumably, that means they created a new table intended to
| be straight-joined back to the user table. No need to search
| a column.
| Vvector wrote:
| Joining a table on a column without an index will require a
| linear scan of the column. You almost always want an index
| on JOIN columns
| abtinf wrote:
| Two relations with a one-to-one relationship implies
| identical primary keys. Most implementations default to
| creating an index on the primary key.
|
| In this case, the premium user table only needs the user
| id primary key/surrogate key because it only contains
| premium users. A query starting with this table is
| naturally constrained to only premium user rows. You can
| think of this sort of like a partial index.
|
| One consequence of this approach is that queries
| filtering only for _non_ -premium users will be more
| expensive.
| lazide wrote:
| The primary key (userid?) will almost always be indexed
| implicitly. You'd usually have to go out of your way to
| avoid it.
|
| So it was probably being joined by an indexed column, but
| without an explicitly defined index.
| raldi wrote:
| We had lots of indexes on the user table; there was a
| separate table with details of paying users.
| holoduke wrote:
| Having a separate table containing all the premium users is
| different than an extra column in the normal user table. In
| the extra table example you don't really need an index (in
| premium user table) if you have only 100 premium users
| pc86 wrote:
| You wouldn't need any of these WHERE clauses if you have all
| the premium users in one table. Even managing the users in
| both tables in pretty trivial when you just add someone as
| they pay and remove them as the premium account expires.
| [deleted]
| dmak wrote:
| Wow really? This seems really surprising to me.
| twoodfin wrote:
| Speculating, but presumably this table only needed
| consultation during creation of a new user session. That's
| probably a pretty heavyweight operation to begin with, so
| adding a scan of a few KB (user IDs being modestly sized
| integers) for every thousand currently paying users is NBD.
| raldi wrote:
| Nope; it was used every time we needed to look up whether
| someone was a subscriber, which was cached for some things
| but not all of them.
| mrguyorama wrote:
| It is infinitely easier to drop an index on a table that is
| causing poor performance than it is to add a good index to a
| table that suddenly needs one.
|
| Index everything, drop those that make things worse.
| winrid wrote:
| Don't add indexes. Add them later when things are slow and
| everything is on fire!
|
| Source: guy that gets hired to fix database stuff on fire.
| [deleted]
| [deleted]
| syngrog66 wrote:
| rule of thumb: O(n) latency is almost always fine when (n==1)
| jameshart wrote:
| "We'll add an index when it gets slow" is saying "screw the guy
| who's on call the night this system starts timing out".
|
| Invisible cliffs in your code that it will fall off at some
| unknown point in the future are the antithesis of engineering.
|
| If you deliberately aren't implementing indexing, know at what
| point indexing would begin to matter. Put in place guard rails so
| the system doesn't just blow up unexpectedly.
|
| There are 100% cases where _you shouldn't use indexes_. That's
| not the same thing as _being able to get away without an index_.
| lmm wrote:
| Indices are far more likely to cause a cliff than to remove
| one, IME. Full table scans are linear and grow linearly.
| Indices are crazy fast until one day they're suddenly not.
| themerone wrote:
| Usually things will slow down gradually or fail right away
| under production loads.
| magicalhippo wrote:
| Unless production loads are non-uniform in time, think Black
| Friday or similar.
| jameshart wrote:
| Yup. Quarter end reporting has been going swimmingly for
| years but this time the orders query is hitting a
| connection timeout and the numbers are due tomorrow.
|
| That threshold doesn't gradually creep up on you, it hits
| you all at once at the worst possible time.
| pixl97 wrote:
| W.U.C.T
|
| Works until critical threshold.
|
| In enterprise software I noticed software tends to work in a
| WUCT'ed up manner. Things may slow down over time, but no one
| complains about it because "the software is just slow", then
| suddenly one day you hit the timeout component of some higher
| layer and then the software is completely and utterly broke.
| marcosdumay wrote:
| Well, this isn't a valid reason. It's not the software that
| is causing the WUCT.
|
| If you let WUCT run free in any complex process, you are
| just asking for everything to break all the time and people
| to do nothing but fire-fighting. So, you are much better
| dealing with the WUCT causes instead of insisting that
| software or any other component scales forever? (Because
| they never will, and you will just make something else
| bream.)
|
| WUCT is one of the main reasons why we monitor system
| operations.
| bawolff wrote:
| Everything has a critical threshold. Even your perfect db
| schema will still eventually run out of disk. There are no
| magic solutions that work on all scale levels.
| david422 wrote:
| But when you have a 100 places all adding little slowdowns it
| can be difficult to realize.
| jefftk wrote:
| _> screw the guy who's on call the night this system starts
| timing out_
|
| This was a very small billing practice, and that person was
| going to be me. I thought then, and still think now, that I
| made a reasonable trade off between what would be work at the
| time and potential future urgent work.
|
| Additionally, this wasn't the sort of thing that would fail
| suddenly when you hit a critical point. Instead, running full
| text searches (a very small fraction of what the database was
| handling) would just slow down in a way that would have been
| very clear to the users.
| jameshart wrote:
| Stealing time from future you also doesn't pay off. Future
| you wants to take vacations and change jobs and have family
| events and sleep and stuff.
|
| It doesn't take much work to do the back of envelope math:
|
| How long should these queries take? Less than two seconds?
|
| How much slower do they get as record counts increase? 100ms
| every 250,000 records? Okay, so this will become intolerably
| slow when the record count hits about 5 million. When's that
| going to happen? Never? Then we're done here. Within a few
| months? Well let's plan for that now. Within a few years?
| Maybe not within the lifetime of this system? Judgement call.
| Let's put something in place that forces us to at least look
| back at this before it gets bad. Even if that's just a google
| calendar reminder.
| jefftk wrote:
| When building systems we are always making trade-offs
| between the present and the future. It's only "stealing
| time from future you" when you make bad trade-off;
| otherwise it's prioritization.
|
| In this case, leaving out an index for full text search
| meant I didn't need to maintain the software for that index
| either, which would have been stand-alone at the time I was
| building this. Possibly this even was a choice that was, in
| expectation, time-saving for future me.
| plugin-baby wrote:
| > Invisible cliffs in your code that it will fall off at some
| unknown point in the future are the antithesis of engineering.
|
| On the other hand, they are the essence of _software_
| engineering.
| [deleted]
| hinkley wrote:
| As the number of cliffs increases, so does the likelihood that
| you will hit a second cliff in the middle of fixing the first
| one. At some point you hit a Kessler Syndrome situation where
| people start quitting and the people who should have fixed the
| problem in the first place aren't around anymore.
|
| Nobody wants to stick around for a lecture they can see coming.
| bob1029 wrote:
| > Invisible cliffs in your code that it will fall off at some
| unknown point in the future are the antithesis of engineering.
|
| I think this describes the general shape of problem seen in any
| engineering domain. After a certain point of time (i.e.
| unknown), we can no longer guarantee certain properties of the
| system. This is why we add all kinds of qualifiers and
| constraints in our discussions with the customer.
|
| Certainly, you can make some predictions about how tables will
| grow over time, but there are also some potential headaches
| that can be incurred if you arbitrarily add indexing. "We
| _might_ need to expire records to manage growth " so you go
| ahead and index that CreatedUtc column. Turns out, the usage
| patterns of your app changed such that that table is now
| experiencing 20x more insert volume than you could have ever
| anticipated and your prediction is now a gigantic liability.
| With 20x insert volume you would have _never_ indexed that
| column. You would have used some alternative path involving
| temporary tables, etc. Now, you are stuck with essentially same
| problem - a judgement call at 2am regarding whether or not you
| should drop an index while the rest of your team is in bed.
|
| Since I am unable to predict the future I feel like waiting
| until the _actual_ problem arises and then dealing with it at
| that moment is the best option. Strategically, _not_ having an
| index will likely fail more gracefully than if an index is
| bottlenecking requests during peak load (i.e. instantaneous vs
| cumulative data volume).
| kevan wrote:
| The key to sleeping at night is to add metrics and alarms
| near those cliffs or 'edges of the map' so they aren't
| invisible anymore. It's hard to anticipate every angle to
| this but in any case where you're making this kind of
| assumption or tradeoff it's a good idea.
|
| A simple example is setting an alarm on your max load test
| traffic. When you get that alarm you know your system is now
| operating in unknown territory and it's probably time to do
| another load test or at least take a close look at scaling.
| jorblumesea wrote:
| Largely agree with this take, it often becomes "add an index
| because query _might_ be slow " without much discussion or
| trade offs around query patterns. There's a lot of "what if"
| over engineering that happens where you end up in
| hypothetical scenarios.
|
| Look at your real world use cases, query patterns and think
| hard about it.
| jameshart wrote:
| Not advocating for building indexes for queries you might
| later want to run.
|
| But for the queries you are building today? Don't build them
| such that they will gradually slow down over time.
| [deleted]
| captainmuon wrote:
| The inverse is also true, often you have linear or even quadratic
| lookup times when something like an index can give you the result
| "immediately".
|
| I wish languages other than SQL had a concept of adding an index.
| Imagine a variant of Python's list or C++'s vector where you can
| add indicies on arbitrary fields of the contents, like a more
| flexible dict. Something like (pseudocode)
| books = List<Book, indices=[Book.title, Book.author]>
|
| and then you could up by title or author without traversing the
| whole list. Sure you can just use a separate dict, but then you
| have to keep both in sync, it is harder to bind it to UI
| controls, and so on.
| magicalhippo wrote:
| For C++ there's Boost's multi_index[1].
|
| [1]:
| https://www.boost.org/doc/libs/1_82_0/libs/multi_index/doc/i...
| Mailtemi wrote:
| I used multi_index a lot in the past. However, since I also
| frequently use sqlite, I have decided to exclusively use
| SQLite for multi_index in memory. If there is a problem,
| temporarily using sqlite as a file makes it a lot easier to
| track the issue. When multi_index patterns become too
| complex, it's natural to wrap them in a unit test
| specifically designed for sqlite (multi_index).
| jmcomets wrote:
| In my experience these type of abstractions end up bloated with
| too many concerns which adds maintenance overhead. It's a lot
| simpler to simply add a `HashMap<Book.title, Book>` and cover
| with tests, than use an abstraction.
|
| I want languages to do less magic, not more, since I read code
| more than I write it.
| vertica3525 wrote:
| [flagged]
| PaulHoule wrote:
| One of the problems of the RDF and SPARQL world is that the
| standard index structure considers the 6 possible permutations of
| ?subject ?predicate ?object.
|
| You can answer many queries with good efficiency with those
| indexes but building them gets brutal when you are handling
| billions of triples.
|
| I went to a talk at Dreamforce years ago where the CTO explained
| their database architecture and it is interesting that the core
| table in Saleforce is literally a collection of triples. They
| have a smart optimizer that builds views, materialized views,
| indexes, etc. in Oracle based on the queries you run.
|
| I built a bot that would download a lot of records from an
| organization, build a "machine learning" model (built by a
| geospatial analyst and we didn't call it that these days) that
| automatically assigns opportunities to salespeople, and then
| pushes the assignments into the system. The first time I ran the
| big query in the test system it timed out, then I went to the
| bathroom and when I came back the second time it worked
| perfectly. When it went to production exactly the same thing
| happened.
| cdcarter wrote:
| > it is interesting that the core table in Saleforce is
| literally a collection of triples
|
| This is... not entirely true. Most of the data for each
| "Object" in salesforce are stored in a series of columns on a
| single table. These are called "flex columns" in the official
| whitepaper describing the architecture. [0]
|
| However, _foreign key relationships_ are indeed stored in a
| table that acts like a collection of triples. This is used in
| place of native foreign keys, and is indeed deeply integrated
| with the optimizer.
|
| [0]:
| https://www.developerforce.com/media/ForcedotcomBookLibrary/...
| CathalMullan wrote:
| Lately, I've been investigating 'serverless' databases
| (PlanetScale, Turso) that charge based on the number of rows
| read/scanned. As a result, instead of just relying on the
| perceived performance of a query, I've started monitoring the
| number of rows scanned, which has emphasized the importance of
| indexes to me. Granted, the cost per rows read is minuscule
| (PlanetScale charges $1 per billion!), but it's still something I
| keep in mind.
| magicalhippo wrote:
| On the flip side, a missing index can bring down production.
|
| We experienced that just a couple of weeks ago, where a missing
| index and an unexpected 1000x increase in volume at a customer
| brought the DB to its knees. Sure it was still serving queries
| but at such a low rate it was effectively useless.
|
| For queries that will be run more than once, I try make sure
| there's an index it can use for something fairly unique.
| remus wrote:
| > ...an unexpected 1000x increase in volume at a customer
| brought the DB to its knees.
|
| From an outside perspective it seems like the huge increase in
| volume was more the issue! It sounds like an index helped a
| lot, but it would also have added cost for all those customers
| who didn't see that jump in volume.
| magicalhippo wrote:
| Well, of course the volume had something to do with it, but
| adding the missing index meant the system could easily handle
| that volume.
|
| The other customers pay for that index as well of course but
| either the volume is low enough that it's trivial or it's
| large enough that they too saw ann increase in speed.
| fnordpiglet wrote:
| Was this because of a missing index or because the optimizer
| vomited on the queries without an index to provide it
| statistics? My gripe with RDBs is generally the query optimizer
| is non deterministic with respect to the query, I.e., it can
| make a great decision with certain statistics and flip to a
| terrible one with slightly different statistics even though the
| original query would have performed basically the same.
|
| I'd rather have a database that query performance degrade
| gracefully and predictably with scale than some sort of black
| magic mumbo jumbo wake me up in the middle of the night because
| it lost its statistical little puny pea brain simply because it
| couldn't compute a bounded expectation any more.
| magicalhippo wrote:
| In this case it was a missing index, on a query run on every
| order line when saving aka a lot.
|
| It had gone under the radar because the volume with our other
| customers had been low enough that the table scan of the
| queried table wasn't noticed.
|
| As mentioned a customer suddenly got 1000x the volume and it
| quickly became an issue.
|
| But yea, we have a job running in the weekends to recalculate
| statistics on key tables, as we've had issues with that
| grinding production to a halt before.
|
| And recently I sped up a query by 100x by removing a filter
| from the where clause that for some weird reason caused the
| DB to run a really poor plan. It was just a simple check
| intended to filter out some duplicates in a few edge cases,
| but couldn't find a way to make the DB run it as a post-
| predicate. Moved it to my code for the 100x performance
| win...
| londons_explore wrote:
| I really want the optimizer to make an estimate of the CPU/IO
| to complete a query. Then, during the query, if that much
| effort has been expended and we haven't yet completed the
| query, then update the estimate. If the updated estimate now
| shows that the query plan is no longer quickest, then abort
| the query and restart with a different plan.
|
| Years ago I forked postgres and tried to implement the above.
| Initial results were promising - there were a good chunk of
| queries that ended up taking a different plan, and sometimes
| returning 100x quicker.
|
| Alas, the postgres codebase is heindously complex, and
| implementing the above to be production grade would be many
| months work - and, due to the way postgres streams results to
| the client, might have actually required a change to the wire
| format.
| mumblemumble wrote:
| I used to be one of the "keep the database running" people at a
| high frequency trading firm. So, super high pressure; any
| outage meant we were potentially hemorrhaging money at a brain-
| melting pace.
|
| My lessons from that experience were twofold. First, you can't
| plan for sudden performance regressions ahead of time. The
| number of stars that have to align for that anticipatory index
| to actually help you are just too great. Even if you correctly
| guessed that a given table might be affected by one in the
| future, you'll never guess what fields it needs to include, and
| in which order, to support whatever new query or query plan
| change or whatever caused the problem. Second, Murphy's Law
| dictates that any attempt at preventing future read performance
| regressions by speculatively creating an index will end up
| causing a write performance problem instead.
|
| Better instead to just get in the habit of periodically
| reviewing query plans for every query in the database. If you
| know the scaling characteristics of the various operations and
| the kinds of statistics changes are likely to cause the
| optimizer to choose different ones (and you should), then it's
| easy enough to select for better scaling characteristics. This
| is, incidentally, an excellent excuse for making the engineers
| use sprocs or a clean data access layer or something instead of
| a heavyweight ORM, so that you have some sensible way of
| getting a warning before they change queries on you. You can't
| effectively aim for a target that's bouncing around
| erratically. Even better still - and only realistically
| feasible if you somehow managed to win that bigger war I just
| proposed - set up a tool to monitor query plans and send you an
| alert when they change unexpectedly, so you can hear about the
| problem when it's still small.
| winrid wrote:
| Are you just trying to create employment for yourself? This
| is amazingly bad advice. I literally get hired to fix these
| setups by people that take this approach.
|
| Yes, it's good to periodically review query plans.
|
| But designing and planning for data retrieval is part of a
| design process. And it's easy. Even Django since 2003 allowed
| you to define indexes inside your models. Got a 100m record
| table and you're adding a column you want to query by
| interatively? Rollout an index. This is normal day-to-day
| software development...
|
| The write overhead is wild speculation. 99% of business are
| read heavy.
| lazide wrote:
| It sounds like he's thinking of it from the DBA
| perspective, where they have to react to sudden changes in
| behavior from devs with little or no warning - since the
| devs don't talk to them.
|
| DBAs doing proactive index creation when they don't know
| what the devs are doing is indeed futile.
|
| The devs, however, should definitely be doing proactive
| database design/planning whenever they do anything, since
| they can (and will!) cause emergencies and downtime by not
| considering the impact of how they are interacting
| with/using the database.
|
| If the devs are directly writing SQL, this is also
| relatively easy to get them into doing. If they're using a
| heavyweight ORM, it's nearly impossible to figure out what
| SQL it's going to run sometimes (and difficult to even
| trigger in a test), and so the devs often won't even try.
| mumblemumble wrote:
| Exactly this.
|
| The company had actually banned new uses of ORM and was
| in the process of eliminating existing usage of it while
| I was there. We had discovered that teams that used ORM
| had much higher production incident rates than teams that
| didn't, and it was fairly directly attributable to lack
| of understanding and predictability of what was happening
| in the database.
|
| Maybe not a huge deal if you're in a low-load situation.
| But HFT requires the A game at all times because you
| never know when some exciting news that causes trading
| volume to increase 50-fold almost instantaneously might
| happen.
|
| For the record, I was a dev and not a DBA. But I did work
| closely with the DBAs. And I was pretty irritated when I
| found out ORM was banned, because it definitely made the
| "writing new code" part of the job more laborious. But,
| hey, learning experience - it turns out that it was the
| right move. In the long run we were able to move faster
| once we stopped having to deal with the blowback from
| breaking things quite so often.
|
| It's a little bit like when I play Mario Kart with my
| kids. Why do I go so much faster than them? Mostly
| because they are constantly pushing hard on the
| accelerator button, while I ease off the gas and even hit
| the brakes sometimes. They think speed management is
| annoying. I think that I spend a lot less time bouncing
| off of walls and giving precious coins to Lakitu.
| pkulak wrote:
| Just because the table scan is under some threshold doesn't
| automatically make it better. If a table scan takes 250ms vs 0.01
| for a indexed lookup, you're still gonna have to justify to me
| why making silicon work that damn hard is worth even the
| electrical use. Are you inserting and deleting so many rows that
| maintaining the index is prohibitive? Do you have space concerns,
| and are not able to keep the index in memory, or maybe even on
| disk? The Bash history thing makes sense, because indexing is a
| lot of work. But otherwise, just use an index and move on to the
| next problem.
|
| EDIT: Has anyone else forgotten to add an index, then noticed the
| performance regression a year later? That's always fun, because
| now adding the index could mean downtime, and even knowing if or
| how much is difficult because putting that much data somewhere
| else to test isn't trivial. No thank you.
| eklitzke wrote:
| For small enough tables doing a full scan will be faster than
| an index. This is also true for regular non-database
| applications like checking if an item is present in a small
| collection: it's faster to do a linear scan over a small vector
| than it is to do a lookup in a hash table or b-tree. With a
| linear scan (whether it's for data on disk, or a data structure
| in memory) you don't have to do hashing or branching (except
| possibly to terminate the loop). With a binary search you
| basically get worst possible case for branch predictor, because
| if you're searching random values whether you go right/left on
| each recursion level is basically random. This is true for in-
| memory data structures, and it's even more true on disk, since
| disks (even SSDs) are especially optimized for linear scans
| compared to random accesses.
|
| It's been a while since I looked at this, but I think MySQL and
| Postgres both take this into account and will let you add
| indexes to small tables but will actually ignore them and do a
| full table scan for all queries if the table is small enough.
| pkulak wrote:
| Yes, that's correct. If you do an explain for a tiny table,
| as long as your stats are up to date, the index will be
| ignored. In that case, it's there for insurance if the table
| grows in the future.
| josephg wrote:
| What do you think the threshold number is where scanning a
| list will outperform a hash table? 10 items? 100? 1000? For
| what it's worth, for small data sets I agree with you. But I
| think it's very hard to be calibrated correctly on what small
| means here.
|
| Re: binary search, the biggest cost to scanning data in a
| database is loading that data from persistent storage. A few
| mispredicted branches aren't going to matter much if it means
| you can do fewer reads.
| ignoramous wrote:
| > _With a binary search you basically get worst possible case
| for branch predictor, because if you 're searching random
| values whether you go right/left on each recursion level is
| basically random._
|
| I think folks writing databases know a thing or two about
| writing faster search and sort algorithms.
| birdyrooster wrote:
| Why do you think that? Seems like a hard problem.
| anyfoo wrote:
| I'm not sure what you are trying to say, given the context
| of what you quoted was. Binary search has optimal worst
| case runtime complexity, but for small tables it is, on
| real computers, overall still slower than a linear scan,
| which effectively has the _worst_ worst case runtime
| complexity (unless you start to get pathological). This is
| because the constant in front of the runtime complexity,
| the one that O-notation explicitly ignores, starts to
| matter: Branch prediction and cache locality come into
| play.
|
| What exactly do you mean with "writing faster search and
| sort algorithms"?
| mcqueenjordan wrote:
| All of this looks accurate, but it's worth contextualizing:
| this is an optimization that bears the most fruit in "local
| frames of reference" -- the timescales where linear scans
| beat index lookups are likely to be strictly dominated by the
| network latency to transceive from the database. The
| conclusion is then that the optimization ~only optimizes for
| cases that effectively don't matter.
| lazide wrote:
| I agree - for 90% of cases.
|
| There are situations where the indexes end up larger than, or
| the same size as, the actual data and the query doesn't
| meaningfully benefit from having the data indexed because, for
| instance, all the data is going to be analyzed anyway, or the
| search type doesn't index usefully with the normal index types
| (like geo searches, or clustering type queries).
|
| Adding indexes that never get used have real costs on an
| ongoing basis with insert/updates and schema updates too, as it
| adds potentially significant overhead on every operation and
| can make certain schema operations impossible without downtime
| too.
|
| Foreign key columns, 'soft delete' columns (like deleted/not),
| basic auditing type stuff (created on, updated on, etc),
| 'unique' or external reference values like a order id or
| whatever (even if not a primary/unique key in the schema),
| basic numeric/analysis columns are almost always worth indexing
| though, to your point.
|
| Stuff that is not always a clear win without some real thinking
| is Freeform text fields, structured binary data (like a PDF or
| image), geo location data without a clear existing use (tends
| to require specialized index types which are also expensive to
| load/use), etc.
|
| Many times some preprocessing is necessary anyway to convert
| what you have to what you actually want, and putting that in a
| secondary column to be indexed is far more valuable.
| MaulingMonkey wrote:
| Justify the dev time to save micro-pennies worth of electricity
| to me instead.
|
| A typical naive index won't help with my regular expression
| based queries, which aren't easily accelerated by an index. Or
| given an in-memory index, you've just increased memory use from
| O(1) to O(N), and I'll OOM on large files. Perhaps you'll throw
| a database at the problem, complicating I/O (especially when
| the data is generated/accessed by third party tools that aren't
| database based), tying me to yet another library's update
| cycle, and perhaps forcing me to tackle the additional problem
| of cache invalidation. Perhaps I need reverse lookups, in which
| case whatever forward indexing I might've pessemistically added
| ahead of time will be of no help at all.
|
| If it'a a 5 second "this is probably the right choice" kneejerk
| reaction, maybe it's fine. Or if you're google indexing the
| internet, I suppose. But I am frequently plagued by shit,
| buggy, useless indexes that merely distract from the proper
| alexanderian solution to the gordian knot - brute force -
| wasting more time than they ever saved, for both people and
| CPUs.
| pkulak wrote:
| > Justify the dev time to save micro-pennies worth of
| electricity to me instead.
|
| KEY (user_id)
|
| I mean, it's a dozen characters. Do you need to know how fast
| I type before you run the calculation?
| MaulingMonkey wrote:
| > Do you need to know how fast I type before you run the
| calculation?
|
| I'll assume 100WPM, call that two words, billed at
| $200/hour and call that $0.06, falling under "too cheap to
| be worth arguing against", which falls under the
| aforementioned:
|
| >> If it'a a 5 second "this is probably the right choice"
| kneejerk reaction, maybe it's fine.
|
| That said, there's a decent chance those 6 cents won't pay
| for themselves if this is a login on a single user system,
| with any theoretical benefits of O(...) scaling being
| drowned out by extra compile times, extra code to load -
| and I'd be plenty willing to NAK code reviews that merely
| attempt to replace /etc/passwd and /etc/shadow with this,
| as the extra time code reviewing the replacement still has
| negative expected ROI, and it'll be a lot more involved
| than a mere dozen characters.
|
| Now, maybe we're attempting to centralize login management
| with Kerberos or something, perhaps with good reasons,
| perhaps which does something similar under the hood, and we
| could talk about _that_ and _possible_ positive ROI,
| despite some actual downtime and teething concerns?
| TylerE wrote:
| Now document the two words, run the test suite to verify
| no-breakagem commit to source control, and push to
| production.
|
| Suddenly those two words cost a lot more than $0.06, and
| that's IF everything goes smoothly.
| jefftk wrote:
| (author)
|
| If you're already using a relational database you should
| almost certainly set up indexes on your table ids and
| foreign keys. But that's pretty different from the examples
| in the post!
|
| I'm not anti-index, I'm anti-"if you ever have a full table
| scan in production you're doing it wrong".
| shpx wrote:
| First tell me the minimum amount of time typing this would
| have to take for you to agree it's not worth it and I will
| try to keep adding things like the time it takes for
| someone to ask you to do this, for you to start VS Code,
| find the file, press ctrl-s, deploy the changes, and
| possibly some estimation of how long it takes a new
| developer to read and understand this system (please tell
| me how fast you speak and an agreeable value for how fast
| the average developer reads as well for this part) vs a
| simpler one until it goes over that limit.
| scott_w wrote:
| > Justify the dev time to save micro-pennies worth of
| electricity to me instead.
|
| The time spent justifying it is longer than the dev time
| itself. Any semi experienced engineer will throw basic
| indexes into their data model without even thinking and cover
| the most common use cases.
|
| If they never use them... who cares? It took no time to add.
| yuliyp wrote:
| An RDBMS is not the best data store for all data. Sometimes
| flat files or other are the simplest tool for the job, as
| shown in the article. Adding a database to any of those
| tools would definitely not be worth the trade-off.
| muhehe wrote:
| > Justify the dev time
|
| And this is exactly the sentiment that got us where we are.
| lazide wrote:
| The mistake in your statement I think, is assuming there is
| a course of action that wouldn't result in a problem
| somewhere.
|
| It may look a little different, but there is nothing
| without tradeoffs.
|
| Including indexing _everything_.
| teen wrote:
| not sure if this is trolling?
| jasonwatkinspdx wrote:
| So, forgetting an index and then growth pushing you over the
| threshold is a valid concern. I think every dev has run into
| that at some early point in their career.
|
| But what your comment is skipping past is there's potentially a
| 100x or higher difference in throughput for sequential scans vs
| indexes. If you know the data will have a bounded size this
| means indexes aren't necessarily a good choice. SSDs have
| narrowed this gap a great deal but it still exists because of
| latency and unpredictable prefecting. It even exists with pure
| in memory applications. Another key aspect is how much of the
| data the query ends up touching. If you're hitting 25% of the
| data anyhow a linear scan is likely faster.
|
| There's also more niche ideas like arranging to convoy multiple
| queries along one linear scan, something impossible with
| indexed scans.
|
| It's useful to know about this asymmetry and that sometimes a
| brute force linear scan is in fact the best tool.
| leetrout wrote:
| Agree but you can always add an index without downtime. It just
| becomes a money and complexity issue ;)
| zht wrote:
| can you do this (on a large table) without adding significant
| IO load/clogging up replication for an extended period of
| time?
| lazide wrote:
| It's worth noting that if your DB instance is so heavily
| loaded that this is a real concern, you already have a
| _huge_ problem that needs fixing.
| mschuster91 wrote:
| AWS is particularly bad with their performance credit
| system on RDS... and there's to my knowledge no way to
| tell MySQL to limit index creation IOPS, which means in
| the worst case you're stuck with a system swamped under
| load for days and constantly running into IO starvation,
| if you forget to scale up your cluster beforehand.
|
| Even if the cluster is scaled to easily take on your
| normal workload, indexing may prove to be too much for IO
| burst credits
| nightpool wrote:
| I have never had any problems with CONCURRENT index
| creations under significant load using Postgres, fwiw
| lazide wrote:
| That does seem like a real problem! Adding indexes
| periodically is a pretty regular thing for any production
| system where I come from.
| pornel wrote:
| It may be so loaded from all the full table scans it's
| doing.
| anyfoo wrote:
| > If a table scan takes 250ms vs 0.01 for a indexed lookup
|
| I'm not sure whether the units of that latter number are still
| ms or now s or whatever, but either way isn't that where you
| are wrong? On real computers, there are lots of situations
| where linear access is trivially faster than a "better" data
| structure with theoretically better runtime complexity.
|
| 1000 accesses to a single page for a linear scan are going to
| be many, many orders of magnitude faster than 5 accesses
| chasing pointers across 5 pages that have no TLB entry, or
| excruciatingly worse, that need to be paged in from disk.
|
| This is an extreme (but absolutely not unrealistic!) example
| for illustration. Slightly more subtle scenarios involving e.g.
| branch prediction have already been named here.
|
| This problem exists across multiple layers, not just close to
| the CPU like above. For example, for data on spinning disks,
| linear reads are much faster than head seeking. SSDs have
| changed that dramatically, but they still have artifacts (for
| example, having to access another block, especially an
| unpredictable one, still has overhead from different causes).
| slaymaker1907 wrote:
| Tiddlywiki only uses an indices for tags/fields yet it still
| works pretty well so long as the wiki isn't huge. There's
| something to be said for the flexibility an approach like that
| provides.
| jacobsenscott wrote:
| Most databases will ignore indexes if the table is small enough,
| so it sort of does this for you. So the index can just be a waste
| of space, until it isn't.
| pyrophane wrote:
| I think the risk of creating indexes prematurely is really that
| you will make assumptions about how your table is going to be
| queried that don't turn out to be correct, so you create indexes
| that you will never need, even if the table gets large.
|
| That probably isn't the end of the world, but it can make it
| harder for others to understand what is going on in your database
| (indexes can also be a form of documentation) and could make your
| ORM code somewhat more complex.
|
| The worst case scenario, I suppose, would be you create an index
| on a table that grows large
| giraffe_lady wrote:
| This has matched my experience too. Most experienced developers
| can probably identify an inappropriately hot table and add an
| index to it. Where a DB with dozens of individually modest
| indexes that add up to be huge is much trickier and more labor
| intensive to confidently improve.
| jmcomets wrote:
| FWIW, most DBMS have built-in index usage stats, and it's not
| too difficult to query. In my previous job we had a Postgres
| health dashboard that notably showed `log10(index size) / index
| hits`, making it pretty clear when some recently added index
| was useless.
|
| My opinion is that indexes should be either logical (ex: used
| for exclusion constraints) or purely for performance (tradeoff
| space for better QPS). Query patterns change, specs change, so
| monitoring your DB's performance is the way to go.
| Aachen wrote:
| Moral of the post: don't do premature optimization. Common adage
| but it's a good reminder and example of it.
|
| Aside from one case where OP argued that the queries were so rare
| compared to the data updates that maintaining the index is more
| expensive. Which is also pretty classic when you're taught about
| indexes.
|
| What I recently learned the hard way about indexes is that
| they're slow when you have to read "large" chunks of a table to
| find an answer, in my case computing a median for an area. I'm
| indexing geospatial data and Europe+NA are overrepresented. When
| you view an area the size of ~Germany, the query would take 20
| minutes and compute the median over ~3% of all records whereas a
| full table scan (whole world) took something like 40 seconds
| (with the median being computed over the same 3%, it would just
| evaluate the WHERE clause against every row instead of finding
| rows to include via the index). That's the power of a sequential
| read as compared to reading an Rtree. I haven't had such bad
| experiences with Btree indexes, not sure if that would behave
| just as badly. On the other hand, if you ask for an area the size
| of Amsterdam, the index is normal fast, and for <50 rows it's
| basically instant whereas the full table scan would still take
| the same 40 seconds.
| iLoveOncall wrote:
| Adding an index is an incredibly expensive task on a dataset so
| big that it needs one to be added.
|
| It's something that is likely to require massive engineering
| effort on a live system.
|
| Removing an index, on the other hand, is never an issue.
|
| It's not at all premature optimization, it's simply basic
| software design.
| Aachen wrote:
| Read OP's post please. They're talking of systems where it
| was acceptable to always do full table scans, in one case
| having 350 rows.
|
| Not "an incredibly expensive task on a dataset so big that it
| needs one". Thus I read OP's post as recommending to not do
| premature optimization (without using those words literally).
| zokier wrote:
| > I use grep on a flat file, and testing now it takes 200ms for a
| query across my 180k history entries
|
| 200ms is order of magnitude more than I'd prefer for interactive
| use
| fnordpiglet wrote:
| Yeah that was my thought too, and I thought about fishing out
| the unbranded $1 1Gb microsd in my drawer I use for 3d printing
| to help him out with his 0.5Gb storage challenge
| PhilipRoman wrote:
| 200ms for 180k entries sounds way too slow. Just tested on my
| 150k line log file and it takes about 10ms on average for
| various kinds of patterns with various hit frequencies.
|
| On an unrelated note, from a quick check with "strace", gnu
| grep seems to detect when output is redirected to /dev/null as
| no write syscalls are being made in that case.
| watermelon0 wrote:
| Unrelated note intrigued me, so I went looking in the source
| code; detection is here: https://git.savannah.gnu.org/cgit/gr
| ep.git/tree/src/grep.c#n...
|
| Interesting that they would optimize for this use case.
| remram wrote:
| Especially since grep -q exists. It would never come to my
| mind to direct grep's output to /dev/null, I wonder how
| this optimization came to be?
| jefftk wrote:
| I was going to say that it was "instant", because that's how it
| feels when using it, but got 200ms from running "time" on
| grepping it to include a number in the post.
| doodlesdev wrote:
| Have you tried out using ripgrep [0] or fzf [1]? 200ms is
| quite a lot for history search, I'd expect search to be 20ms
| or less. In fact, it should be instantaneous as it should all
| fit in memory really. I've been using fzf and the searches
| are all faster than my monitor can display frames (so less
| than 16ms), although I only have 10k entries on my history as
| I clear it up once in a while.
|
| [0]: https://github.com/BurntSushi/ripgrep
|
| [1]: https://github.com/junegunn/fzf
| jefftk wrote:
| Trying now, "cat > /dev/null" takes ~150ms, so I suspect
| it's related to keeping my history in Google Drive (with
| local mirroring).
|
| If it gets annoyingly slows I'll stop doing that and use a
| different syncing system.
| perlgeek wrote:
| Slightly off topic, but just the other day we had a case where we
| had an index, and deleting ~100k entries from a table with a few,
| small columns caused a huge delay.
|
| Context: Mariadb 10.2. In this table we stored traffic
| statistics, aggregated by IP address and day. Probably less than
| 200 bytes per row. They were also written per day, and then
| cleaned up by a cron job (every month an entire month worth of
| data was discarded), so "DELETE FROM traffic_stats WHERE day < ?"
|
| This worked fine on the primary, but our monitoring told us that
| the replicas started to lag, and showed that they spent
| inordinate amounts of time on the delete statement above. Even
| though we had an index on the "day" column.
|
| Even weirder, the problem didn't show up in our QA database
| cluster, which had far less resources (vmware vm in QA vs. pretty
| solid bare metal in prod).
|
| Even more puzzling, skipping the replication step and running the
| DELETE statement manually was quite fast (a second or so).
|
| After some frantic search, it turned out that, for reasons lost
| in time, the QA db cluster used "mixed" replication, while the
| prod cluster used row-based replication. In row-based
| replication, the leader communicates to the replicas which _rows_
| were deleted. And since we didn 't have a primary key on that
| table, the replicas did a full-table scan to find which record to
| delete, and repeated that for each of the 100k records to be
| deleted.
|
| Adding a primary key to the table fixed the issue immediately,
| and we switched to mixed replication shortly after.
| eitally wrote:
| I've very familiar with this sort of problem. I used to work in
| high-volume electronics manufacturing and coercing high &
| variable transaction OLTP databases to be able to be 1)
| backupable, and 2) replicable for analytics workloads often
| feels like a dark art. In a lot of cases, trial and error is a
| perfectly reasonable approach, for better or for worse.
| XCSme wrote:
| Why not just add the index? It future-proofs the performance and,
| as mentioned in the article, for small datasets, performance is
| not an issue anyway, so adding an index won't affect the small-
| dataset performance.
| jasfi wrote:
| I agree, you don't want performance to suddenly degrade and
| then you have to find out why. This can be especially bad when
| this happens with many tables that should have indexes, and
| performance is sluggish, but not for any specific action.
|
| Add indexes appropriately is a best practice.
| [deleted]
___________________________________________________________________
(page generated 2023-05-25 23:00 UTC)