[HN Gopher] Speeding up SQL queries by orders of magnitude using...
       ___________________________________________________________________
        
       Speeding up SQL queries by orders of magnitude using UNION
        
       Author : tate
       Score  : 134 points
       Date   : 2021-03-20 17:58 UTC (5 hours ago)
        
 (HTM) web link (www.foxhound.systems)
 (TXT) w3m dump (www.foxhound.systems)
        
       | CraigJPerry wrote:
       | Is there any way to exploit parallelism in the customer vs
       | employee queries then merge the results?
       | 
       | My intuition would have been to tackle this in 3 steps with a
       | temp table (create temp table with employee markouts, append
       | customer meals, select ... order by). I can see the overheads of
       | my solution but I'm wondering if i could beat the Union all
       | approach somehow. My temp table wouldn't benefit from an index so
       | I'm left wondering about parallelism.
       | 
       | If i remember correctly though, there's no concurrency, and
       | therefore no opportunity for parallelism for transactions in the
       | same session.
        
         | EdwardDiego wrote:
         | I'm pretty sure later versions of PG can indeed execute both in
         | parallel.
        
           | Izkata wrote:
           | Not sure about each side of the UNION, but postgres can do
           | parallel index scans for even simple criteria:
           | https://www.postgresql.org/docs/10/parallel-
           | plans.html#PARAL...
           | 
           | Also the article says these tests were done on postgres 12.6
        
       | Justsignedup wrote:
       | Just to note, the easiest heuristic I've been using to figure out
       | when optimization might be useful is when you have an OR clause
       | on more than one table. Column A or B is fine, but Table A column
       | A or Table B column A is gonna cause interesting problems.
       | 
       | Also:
       | 
       | union = distinct values
       | 
       | union all = no distinct
        
         | croes wrote:
         | Important difference between union and union all, especially if
         | you don't query a unique id column
        
       | twoodfin wrote:
       | I know of at least one commercial DBMS that will perform this
       | transformation automatically in some circumstances, and I'd be
       | surprised if most of the other major systems couldn't do the
       | same.
        
         | tpetry wrote:
         | All database systems could implement many more optimizations.
         | Heck, all these tricks to get the queries to complete faster
         | are only needed because the database engine did not find this
         | fast execution plan.
         | 
         | The problem is the more optimizations you implement the more
         | possibilities the query planner has to evalue which may result
         | in (1) a too long query planning phase or (2) the possibility
         | of de-optimizing a good query because an optimization was
         | falsely predicted to be faster
        
       | fifilura wrote:
       | Is what the article is essentially saying is that [1] is faster
       | than [2]?
       | 
       | I have used [1] many times for that reason although [2] is
       | probably more intuitive for what you want to do.
       | 
       | [1]                 SELECT vegetable_id, SUM(price) as price,
       | SUM(weight) as weight       FROM       (         SELECT
       | vegetable_id, price, NULL as weight         FROM prices
       | UNION ALL         SELECT vegetable_id, NULL as price, weight
       | FROM weights       )       GROUP BY vegetable_id
       | 
       | [2]                 SELECT vegetable_id, price, weight       FROM
       | prices          JOIN weights            ON price.vegetable_id =
       | weights.vegetable_id
        
         | baix777 wrote:
         | The relative performance of these two queries will vary by data
         | volume. Swap in a sales table for the weights table, and make
         | that a massive sales table at that, joining it to much smaller
         | prices can be much faster than a group by. Stated differently,
         | a join can be faster than a group by. This is even more true
         | when the small table can fit into memory and a hash join can be
         | used, and the data in the group by can't fit into memory.
        
           | fifilura wrote:
           | Yes I'd say that I would intuitively do that only if the
           | tables were both sufficiently large.
           | 
           | I assume somewhere there is a similar assumption in TFA.
        
         | sandfly wrote:
         | Do you lose the indices on the columns after the UNION?
        
         | croes wrote:
         | This queries have different results. [2] retrieves only the
         | vegetable_ids which are in both tables, [1] gives all ids which
         | are in prices or weights. If vegetable_id null exists in either
         | table [1] result in an extra row with id null, this doesn't
         | happen for query [2]
        
           | fifilura wrote:
           | If I replace JOIN with FULL OUTER JOIN, you'll get what you
           | describe. It was just an quick example, but you are right.
           | 
           | There are also things to say about what happens if either
           | table has duplicate vegetable_id:s. At some point it is
           | assumed that you have processed it in such a way that the
           | table is well formed for the purpose.
        
       | rrrrrrrrrrrryan wrote:
       | Keep your matching logic simple.
       | 
       | With SQL, matching is (by far) the most computationally expensive
       | bit of most queries. Your goal should be to use only sargable[1]
       | operators, _and eliminate any functions_ , in all your WHERE
       | clauses and JOINs.
       | 
       | Complicated matching criteria will force the engine to make tough
       | trade-off decisions, which can be affected by out-of-date table
       | statistics, lousy indexes, and a million other things, and will
       | often force the engine to rule-out more optimized algorithms and
       | fall back to the cruder, primitive algorithms.
       | 
       | [1] For most queries, "OR" is not usually a sargable operator:
       | https://en.wikipedia.org/wiki/Sargable
        
       | darig wrote:
       | It would be even faster if you switched to a series of queries
       | without any JOINs at all, and then stitched them together with
       | code other than SQL after the queries have ran.
       | 
       | Getting below 15 ms would be trivial.
        
       | tech2 wrote:
       | Perhaps I'm missing something, but why is query 2's meal_items a
       | left join when the where clause then forces it to act as if it
       | were an inner join? Would changing that have any impact?
        
         | Albert_Camus wrote:
         | Author here, you are indeed correct that Query #2's final join
         | can be an INNER join. However, I just tested it against our
         | test data set and it makes no impact on the performance.
        
       | toast0 wrote:
       | UNION is also helpful for reducing round trips between the db
       | client and db server.
       | 
       | I've run into lots of cases where SQL join is clear to write, but
       | slow, but a simple query to get ids, plus a union of fetching
       | data for each id is fast. Again, a union of single id fetches is
       | often faster than a single query with in (x, y, z). We can wish
       | the sql engine could figure it out, but my experience from a few
       | years ago is that they usually don't.
        
         | onlyrealcuzzo wrote:
         | Does anyone with enough db experience know why the engine can't
         | figure it out? Is it easy to explain?
         | 
         | I'm not a db expert, but this seems like something that should
         | be solvable.
        
           | isoprophlex wrote:
           | Yeah agreed, at the very least the thing should be able to
           | conceive to switch to this... lets call it "WHERE unrolling"
           | internally when asked for " WHERE id IN (x, y, z)"
           | 
           | I'd be delighted to learn what's going on
        
             | toast0 wrote:
             | Where unrolling is an excellent term, thanks.
             | 
             | The best I can understand is the engines are (or were) just
             | not setup so they can do repeated single key lookups, which
             | is really the right strategy for a where in. As a result,
             | they do some kind of scan, which looks at a lot more data.
        
               | tpetry wrote:
               | The problem is these repeated single key lookups are
               | random io for the database engine. So the database engine
               | has to predict a threshold for when a scan or random io
               | is cheaper which is very hard to get right, and your io
               | layer changes this threshold drastically. A spinning disk
               | may be faster all the time with sequential io, and for
               | flash based systems theres a wide variety of performance
               | profiles.
               | 
               | To tackle this problem postgresql has a setting where you
               | can tune the cost factor for random io.
        
             | btilly wrote:
             | That one is annoying. IN tends to be implemented as a loop.
             | I've spend many queries up by putting an IN into a
             | temporary table, then joining it.
             | 
             | I have actually seen it be a win to combine it like this:
             | SELECT ...         FROM foo           JOIN (
             | select 'a' as value               union all
             | select 'b' as value               ...           ) bar
        
           | btilly wrote:
           | _I 'm not a db expert, but this seems like something that
           | should be solvable._
           | 
           | It is very highly database specific. MySQL tends to be
           | terrible. PostgreSQL tends to be quite good. The reasons why
           | it did not figure it out in particular cases tend to be
           | complicated.
           | 
           | Except for MySQL, in which case the problem is simple. There
           | is a tradeoff between time spent optimizing a query and time
           | spent running it. Other databases assume that you prepare a
           | query and then use it many times so can spend time on
           | optimizations. MySQL has an application base of people who
           | prepare and use queries just once. Which means that they use
           | simple heuristics and can't afford the overhead of a complex
           | search for a better plan.
        
           | bawolff wrote:
           | I imagine its bad query optimizer doing the wrong thing.
           | Maybe a different db engine would do better.
        
       | croes wrote:
       | No need to wrap the union queries in a new select to order the
       | result. An order by after the last select in the unions orders
       | the whole result set as long as the column names match in each
       | part of the union.
        
         | rrrrrrrrrrrryan wrote:
         | This depends on the database engine. The syntax will error out
         | in some (perhaps most).
        
       | balfirevic wrote:
       | I'm now curious (but too lazy to setup and test it) how the
       | following query would fare:                 SELECT meal_items.*,
       | employee_markouts.employee_id, customer_orders.customer_id
       | FROM meal_items       LEFT JOIN employee_markouts ON
       | employee_markouts.meal_item_id = meal_items.id       LEFT JOIN
       | employees ON employees.id = employee_markouts.employee_id
       | LEFT JOIN customer_order_items ON
       | customer_order_items.meal_item_id = meal_items.id       LEFT JOIN
       | customer_orders ON customer_orders.id =
       | customer_order_items.customer_order_id       LEFT JOIN customers
       | ON customers.id = customer_orders.customer_id       WHERE
       | (employees.store_id = 250 AND
       | employee_markouts.created >= '2021-02-03' AND
       | employee_markouts.created < '2021-02-04') OR
       | (customers.store_id = 250 AND
       | customer_orders.created >= '2021-02-03' AND
       | customer_orders.created < '2021-02-04')
        
       | clcaev wrote:
       | The author doesn't describe why they thought a cross product of
       | two large tables, using a group by to remove duplicates, would be
       | a good basis for this query.
        
       ___________________________________________________________________
       (page generated 2021-03-20 23:00 UTC)