hngopher.com

       [HN Gopher] A SQL Heuristic: Ors Are Expensive
       ___________________________________________________________________
        
       A SQL Heuristic: Ors Are Expensive
        
       Author : ethanseal
       Score  : 42 points
       Date   : 2025-09-29 13:29 UTC (9 hours ago)
        
 (HTM) web link (ethanseal.com)
 (TXT) w3m dump (ethanseal.com)
        
       | ntonozzi wrote:
       | I really like the extension pattern. I wish more of the tables at
       | my company used it.
       | 
       | Another huge benefit that we're realizing as we move some of our
       | heaviest tables to this pattern is that it makes it really easy
       | to index a core set of fields in Elasticsearch, along with a tag
       | to associate it with a particular domain model. This has
       | drastically cut down on our search-specific denormlization and
       | let us avoid expensive index schema updates.
        
       | prein wrote:
       | This sort of thing is why looking at generated SQL while
       | developing instead of just trusting the ORM to write good queries
       | is so important.
       | 
       | I find query planning (and databases in general) to be very
       | difficult to reason about, basically magic. Does anyone have some
       | recommended reading or advice?
        
         | parshimers wrote:
         | https://pages.cs.wisc.edu/~dbbook/ is a great overview,
         | specifically chapters 13 and 14 on query optimization. it is
         | difficult to reason about though, and every compiler is
         | different. it takes time and enough examples to look at a query
         | and have an intuition for what the plan should look like, and
         | if there's something the compiler is not handling well.
        
         | jalk wrote:
         | It's a big help if you know how to retrieve and interpret
         | execution plans for the database you use.
        
           | hobs wrote:
           | Yes, I was going to say, seeing the generated SQL can be
           | almost useless depending on the execution plan.
           | 
           | When you have a solid view of the schema and data sizes you
           | can start to be more predictive about what your code will
           | actually do, THEN you can layer on the complexity of the ORM
           | hell code.
        
         | ethanseal wrote:
         | Highly recommend https://use-the-index-luke.com/
         | 
         | It's very readable - I always ask new hires and interns to read
         | it.
        
       | jalk wrote:
       | Can we agree that this is only applies to queries where all the
       | filter conditions use cols with indexes? If no indexes can be
       | used, a single full table scan with OR surely is faster than
       | multiple full table scans.
        
         | ethanseal wrote:
         | Absolutely. Though I don't recall seeing multiple sequential
         | scans without a self-join or subquery. A basic filter within a
         | sequential scan/loop is the most naive/simplest way of
         | performing queries like these, so postgres falls back to that.
         | Also, fwiw, BitmapOr is only used with indexes:
         | https://pganalyze.com/docs/explain/other-nodes/bitmap-or.
        
           | jalk wrote:
           | That was the extreme case - the multi-scan would be gotten if
           | a casual reader tried your (neat by all means) AND-only query
           | on a non-indexed table (or partially indexed for that
           | matter).
        
             | ethanseal wrote:
             | Gotcha, I misunderstood your comment. The multiple counts
             | is a definitely very contrived example to demonstrate the
             | overhead of BitmapOr and general risk of sequential scans.
        
       | cyanydeez wrote:
       | Yeah, just watch community. Every OR splits the universe, duh.
        
       | galkk wrote:
       | I strongly dislike the way the polroblem is presented and the
       | "solution" is promoted. Author mentions merge join with count of
       | top, and if th database supports index merges, it can be
       | extremely efficient in described scenario. There are a lot of
       | real optimizations that can be baked in such merges that author
       | chooses to ignore.
       | 
       | The generalized guidance without even mentioning database server
       | as a baseline, without showing plans and/or checking if the
       | database can be guided is just a bad teaching.
       | 
       | It looks like author discovered a trick and tries to hammer
       | everything into it by overgeneralizing.
        
         | ethanseal wrote:
         | For sure, there's definitely a lot of cool techniques (and I'm
         | not aware of all of them)! And the first example is very much
         | contrived to show a small example.
         | 
         | I'm not super familiar with the term index merge - this seems
         | to be the term for a BitmapOr/BitmapAnd?
         | 
         | Is there another optimization I'm missing?
         | 
         | The article links to my code for my timings here:
         | https://github.com/ethan-seal/ors_expensive
         | 
         | There is an optimization that went in the new release of
         | PostgreSQL I'm excited about that may affect this - I'm not
         | sure. See
         | https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
        
           | Sesse__ wrote:
           | > I'm not super familiar with the term index merge - this
           | seems to be the term for a BitmapOr/BitmapAnd?
           | 
           | Different databases will use similar terms for different
           | operations, but I would guess that the comment refers to
           | something similar to MySQL's index merge (which is
           | essentially reading the row IDs of all the relevant ranges,
           | then deduplicating them, then doing the final scan; it's
           | similar to but less flexible than Postgres' BitmapOr).
        
             | ethanseal wrote:
             | Cool. I'll have to read up on that.
        
       | to11mtm wrote:
       | If ORs are expensive you've probably got a poor table design or
       | the ORs are for some form of nasty query that is more
       | UI/Dashboard based.
       | 
       | A good table has the right indexes, and a good API to deal with
       | the table is using the RIGHT indexes in it's criteria to get a
       | good result.
        
       | Animats wrote:
       | The query optimizer knows how many items are in each index, but
       | has no advance idea how many items will be in the result of a
       | JOIN. An "a OR b" query on a table with millions of rows might
       | have three hits on A, or millions of hits. The optimal query
       | strategy for the two cases is very different.
       | 
       | Has anyone put machine learning in an SQL query optimizer yet?
        
         | throwaway81523 wrote:
         | Some traditional planners try some different plans on random
         | subsets of the data to see which plan works best. Don't need
         | machine learning, just Bayes' rule.
        
         | Sesse__ wrote:
         | > The query optimizer knows how many items are in each index,
         | but has no advance idea how many items will be in the result of
         | a JOIN.
         | 
         | Query optimizers definitely try to estimate cardinalities of
         | joins. It's a really, really hard problem, but the typical
         | estimate is _much_ better than "eh, no idea".
        
         | ethanseal wrote:
         | I think having a way to build statistics on the join itself
         | would be helpful for this. Similar to how extended statistics^1
         | can help when column distributions aren't independent of each
         | other.
         | 
         | But this may require some basic materialized views, which
         | postgres doesn't really have.
         | 
         | [1]: https://www.postgresql.org/docs/current/planner-
         | stats.html#P...
        
         | singron wrote:
         | There are many papers on ML for query planners. You can search
         | for "Learned Query Optimization". Some use ML just for the
         | cardinality estimation.
        
       | Glyptodon wrote:
       | If optimization is really as simple as applying De Morgan's laws,
       | surely it could be done within the query planner if that really
       | is the main optimization switch? Or am I misreading this somehow?
       | 
       | Edit: I guess the main difference is that it's just calculating
       | separate sets and then merging them, which isn't really
       | DeMorgan's, but a calculation approach.
        
         | Sesse__ wrote:
         | In most cases, the end result is a lot messier than it looks in
         | this minimal example. Consider a case where the query is a
         | 12-way join with a large IN list somewhere (which is
         | essentially an OR); would it still look as attractive to
         | duplicate that 12-way join a thousand times?
         | 
         | There _are_ optimizers that try to represent this kind of
         | AND/OR expression as sets of distinct ranges in order to at
         | least be able to consider whether a thousand different index
         | scans are worth it or not, with rather limited success (it
         | doesn't integrate all that well when you get more than one
         | table, and getting cardinalities for that many index scans can
         | be pretty expensive).
        
       | ape4 wrote:
       | The OR query certainly states user intent better then the
       | rewrite. So its better at communicating with a future programmer.
        
         | sgarland wrote:
         | So add a comment. "X is easier to understand" shouldn't mean
         | taking a 100x performance hit.
        
       | sgarland wrote:
       | As an aside, MySQL will optimize WHERE IN better than OR,
       | assuming that the predicates are all constants, and not JSON.
       | Specifically, it sorts the IN array and uses binary search.
       | 
       | That said, I'm not sure if it would have any impact on this
       | specific query; I'd need to test.
        
         | renhanxue wrote:
         | The article is specifically discussing cases where you have
         | predicates on different columns OR'ed together, like col_a =
         | 'foo' OR col_b = 'foo'.
        
       ___________________________________________________________________
       (page generated 2025-09-29 23:00 UTC)