[HN Gopher] A SQL Heuristic: Ors Are Expensive
___________________________________________________________________
A SQL Heuristic: Ors Are Expensive
Author : ethanseal
Score : 42 points
Date : 2025-09-29 13:29 UTC (9 hours ago)
(HTM) web link (ethanseal.com)
(TXT) w3m dump (ethanseal.com)
| ntonozzi wrote:
| I really like the extension pattern. I wish more of the tables at
| my company used it.
|
| Another huge benefit that we're realizing as we move some of our
| heaviest tables to this pattern is that it makes it really easy
| to index a core set of fields in Elasticsearch, along with a tag
| to associate it with a particular domain model. This has
| drastically cut down on our search-specific denormlization and
| let us avoid expensive index schema updates.
| prein wrote:
| This sort of thing is why looking at generated SQL while
| developing instead of just trusting the ORM to write good queries
| is so important.
|
| I find query planning (and databases in general) to be very
| difficult to reason about, basically magic. Does anyone have some
| recommended reading or advice?
| parshimers wrote:
| https://pages.cs.wisc.edu/~dbbook/ is a great overview,
| specifically chapters 13 and 14 on query optimization. it is
| difficult to reason about though, and every compiler is
| different. it takes time and enough examples to look at a query
| and have an intuition for what the plan should look like, and
| if there's something the compiler is not handling well.
| jalk wrote:
| It's a big help if you know how to retrieve and interpret
| execution plans for the database you use.
| hobs wrote:
| Yes, I was going to say, seeing the generated SQL can be
| almost useless depending on the execution plan.
|
| When you have a solid view of the schema and data sizes you
| can start to be more predictive about what your code will
| actually do, THEN you can layer on the complexity of the ORM
| hell code.
| ethanseal wrote:
| Highly recommend https://use-the-index-luke.com/
|
| It's very readable - I always ask new hires and interns to read
| it.
| jalk wrote:
| Can we agree that this is only applies to queries where all the
| filter conditions use cols with indexes? If no indexes can be
| used, a single full table scan with OR surely is faster than
| multiple full table scans.
| ethanseal wrote:
| Absolutely. Though I don't recall seeing multiple sequential
| scans without a self-join or subquery. A basic filter within a
| sequential scan/loop is the most naive/simplest way of
| performing queries like these, so postgres falls back to that.
| Also, fwiw, BitmapOr is only used with indexes:
| https://pganalyze.com/docs/explain/other-nodes/bitmap-or.
| jalk wrote:
| That was the extreme case - the multi-scan would be gotten if
| a casual reader tried your (neat by all means) AND-only query
| on a non-indexed table (or partially indexed for that
| matter).
| ethanseal wrote:
| Gotcha, I misunderstood your comment. The multiple counts
| is a definitely very contrived example to demonstrate the
| overhead of BitmapOr and general risk of sequential scans.
| cyanydeez wrote:
| Yeah, just watch community. Every OR splits the universe, duh.
| galkk wrote:
| I strongly dislike the way the polroblem is presented and the
| "solution" is promoted. Author mentions merge join with count of
| top, and if th database supports index merges, it can be
| extremely efficient in described scenario. There are a lot of
| real optimizations that can be baked in such merges that author
| chooses to ignore.
|
| The generalized guidance without even mentioning database server
| as a baseline, without showing plans and/or checking if the
| database can be guided is just a bad teaching.
|
| It looks like author discovered a trick and tries to hammer
| everything into it by overgeneralizing.
| ethanseal wrote:
| For sure, there's definitely a lot of cool techniques (and I'm
| not aware of all of them)! And the first example is very much
| contrived to show a small example.
|
| I'm not super familiar with the term index merge - this seems
| to be the term for a BitmapOr/BitmapAnd?
|
| Is there another optimization I'm missing?
|
| The article links to my code for my timings here:
| https://github.com/ethan-seal/ors_expensive
|
| There is an optimization that went in the new release of
| PostgreSQL I'm excited about that may affect this - I'm not
| sure. See
| https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
| Sesse__ wrote:
| > I'm not super familiar with the term index merge - this
| seems to be the term for a BitmapOr/BitmapAnd?
|
| Different databases will use similar terms for different
| operations, but I would guess that the comment refers to
| something similar to MySQL's index merge (which is
| essentially reading the row IDs of all the relevant ranges,
| then deduplicating them, then doing the final scan; it's
| similar to but less flexible than Postgres' BitmapOr).
| ethanseal wrote:
| Cool. I'll have to read up on that.
| to11mtm wrote:
| If ORs are expensive you've probably got a poor table design or
| the ORs are for some form of nasty query that is more
| UI/Dashboard based.
|
| A good table has the right indexes, and a good API to deal with
| the table is using the RIGHT indexes in it's criteria to get a
| good result.
| Animats wrote:
| The query optimizer knows how many items are in each index, but
| has no advance idea how many items will be in the result of a
| JOIN. An "a OR b" query on a table with millions of rows might
| have three hits on A, or millions of hits. The optimal query
| strategy for the two cases is very different.
|
| Has anyone put machine learning in an SQL query optimizer yet?
| throwaway81523 wrote:
| Some traditional planners try some different plans on random
| subsets of the data to see which plan works best. Don't need
| machine learning, just Bayes' rule.
| Sesse__ wrote:
| > The query optimizer knows how many items are in each index,
| but has no advance idea how many items will be in the result of
| a JOIN.
|
| Query optimizers definitely try to estimate cardinalities of
| joins. It's a really, really hard problem, but the typical
| estimate is _much_ better than "eh, no idea".
| ethanseal wrote:
| I think having a way to build statistics on the join itself
| would be helpful for this. Similar to how extended statistics^1
| can help when column distributions aren't independent of each
| other.
|
| But this may require some basic materialized views, which
| postgres doesn't really have.
|
| [1]: https://www.postgresql.org/docs/current/planner-
| stats.html#P...
| singron wrote:
| There are many papers on ML for query planners. You can search
| for "Learned Query Optimization". Some use ML just for the
| cardinality estimation.
| Glyptodon wrote:
| If optimization is really as simple as applying De Morgan's laws,
| surely it could be done within the query planner if that really
| is the main optimization switch? Or am I misreading this somehow?
|
| Edit: I guess the main difference is that it's just calculating
| separate sets and then merging them, which isn't really
| DeMorgan's, but a calculation approach.
| Sesse__ wrote:
| In most cases, the end result is a lot messier than it looks in
| this minimal example. Consider a case where the query is a
| 12-way join with a large IN list somewhere (which is
| essentially an OR); would it still look as attractive to
| duplicate that 12-way join a thousand times?
|
| There _are_ optimizers that try to represent this kind of
| AND/OR expression as sets of distinct ranges in order to at
| least be able to consider whether a thousand different index
| scans are worth it or not, with rather limited success (it
| doesn't integrate all that well when you get more than one
| table, and getting cardinalities for that many index scans can
| be pretty expensive).
| ape4 wrote:
| The OR query certainly states user intent better then the
| rewrite. So its better at communicating with a future programmer.
| sgarland wrote:
| So add a comment. "X is easier to understand" shouldn't mean
| taking a 100x performance hit.
| sgarland wrote:
| As an aside, MySQL will optimize WHERE IN better than OR,
| assuming that the predicates are all constants, and not JSON.
| Specifically, it sorts the IN array and uses binary search.
|
| That said, I'm not sure if it would have any impact on this
| specific query; I'd need to test.
| renhanxue wrote:
| The article is specifically discussing cases where you have
| predicates on different columns OR'ed together, like col_a =
| 'foo' OR col_b = 'foo'.
___________________________________________________________________
(page generated 2025-09-29 23:00 UTC)