[HN Gopher] Speeding up SQL queries by orders of magnitude using...
___________________________________________________________________
Speeding up SQL queries by orders of magnitude using UNION
Author : tate
Score : 134 points
Date : 2021-03-20 17:58 UTC (5 hours ago)
(HTM) web link (www.foxhound.systems)
(TXT) w3m dump (www.foxhound.systems)
| CraigJPerry wrote:
| Is there any way to exploit parallelism in the customer vs
| employee queries then merge the results?
|
| My intuition would have been to tackle this in 3 steps with a
| temp table (create temp table with employee markouts, append
| customer meals, select ... order by). I can see the overheads of
| my solution but I'm wondering if i could beat the Union all
| approach somehow. My temp table wouldn't benefit from an index so
| I'm left wondering about parallelism.
|
| If i remember correctly though, there's no concurrency, and
| therefore no opportunity for parallelism for transactions in the
| same session.
| EdwardDiego wrote:
| I'm pretty sure later versions of PG can indeed execute both in
| parallel.
| Izkata wrote:
| Not sure about each side of the UNION, but postgres can do
| parallel index scans for even simple criteria:
| https://www.postgresql.org/docs/10/parallel-
| plans.html#PARAL...
|
| Also the article says these tests were done on postgres 12.6
| Justsignedup wrote:
| Just to note, the easiest heuristic I've been using to figure out
| when optimization might be useful is when you have an OR clause
| on more than one table. Column A or B is fine, but Table A column
| A or Table B column A is gonna cause interesting problems.
|
| Also:
|
| union = distinct values
|
| union all = no distinct
| croes wrote:
| Important difference between union and union all, especially if
| you don't query a unique id column
| twoodfin wrote:
| I know of at least one commercial DBMS that will perform this
| transformation automatically in some circumstances, and I'd be
| surprised if most of the other major systems couldn't do the
| same.
| tpetry wrote:
| All database systems could implement many more optimizations.
| Heck, all these tricks to get the queries to complete faster
| are only needed because the database engine did not find this
| fast execution plan.
|
| The problem is the more optimizations you implement the more
| possibilities the query planner has to evalue which may result
| in (1) a too long query planning phase or (2) the possibility
| of de-optimizing a good query because an optimization was
| falsely predicted to be faster
| fifilura wrote:
| Is what the article is essentially saying is that [1] is faster
| than [2]?
|
| I have used [1] many times for that reason although [2] is
| probably more intuitive for what you want to do.
|
| [1] SELECT vegetable_id, SUM(price) as price,
| SUM(weight) as weight FROM ( SELECT
| vegetable_id, price, NULL as weight FROM prices
| UNION ALL SELECT vegetable_id, NULL as price, weight
| FROM weights ) GROUP BY vegetable_id
|
| [2] SELECT vegetable_id, price, weight FROM
| prices JOIN weights ON price.vegetable_id =
| weights.vegetable_id
| baix777 wrote:
| The relative performance of these two queries will vary by data
| volume. Swap in a sales table for the weights table, and make
| that a massive sales table at that, joining it to much smaller
| prices can be much faster than a group by. Stated differently,
| a join can be faster than a group by. This is even more true
| when the small table can fit into memory and a hash join can be
| used, and the data in the group by can't fit into memory.
| fifilura wrote:
| Yes I'd say that I would intuitively do that only if the
| tables were both sufficiently large.
|
| I assume somewhere there is a similar assumption in TFA.
| sandfly wrote:
| Do you lose the indices on the columns after the UNION?
| croes wrote:
| This queries have different results. [2] retrieves only the
| vegetable_ids which are in both tables, [1] gives all ids which
| are in prices or weights. If vegetable_id null exists in either
| table [1] result in an extra row with id null, this doesn't
| happen for query [2]
| fifilura wrote:
| If I replace JOIN with FULL OUTER JOIN, you'll get what you
| describe. It was just an quick example, but you are right.
|
| There are also things to say about what happens if either
| table has duplicate vegetable_id:s. At some point it is
| assumed that you have processed it in such a way that the
| table is well formed for the purpose.
| rrrrrrrrrrrryan wrote:
| Keep your matching logic simple.
|
| With SQL, matching is (by far) the most computationally expensive
| bit of most queries. Your goal should be to use only sargable[1]
| operators, _and eliminate any functions_ , in all your WHERE
| clauses and JOINs.
|
| Complicated matching criteria will force the engine to make tough
| trade-off decisions, which can be affected by out-of-date table
| statistics, lousy indexes, and a million other things, and will
| often force the engine to rule-out more optimized algorithms and
| fall back to the cruder, primitive algorithms.
|
| [1] For most queries, "OR" is not usually a sargable operator:
| https://en.wikipedia.org/wiki/Sargable
| darig wrote:
| It would be even faster if you switched to a series of queries
| without any JOINs at all, and then stitched them together with
| code other than SQL after the queries have ran.
|
| Getting below 15 ms would be trivial.
| tech2 wrote:
| Perhaps I'm missing something, but why is query 2's meal_items a
| left join when the where clause then forces it to act as if it
| were an inner join? Would changing that have any impact?
| Albert_Camus wrote:
| Author here, you are indeed correct that Query #2's final join
| can be an INNER join. However, I just tested it against our
| test data set and it makes no impact on the performance.
| toast0 wrote:
| UNION is also helpful for reducing round trips between the db
| client and db server.
|
| I've run into lots of cases where SQL join is clear to write, but
| slow, but a simple query to get ids, plus a union of fetching
| data for each id is fast. Again, a union of single id fetches is
| often faster than a single query with in (x, y, z). We can wish
| the sql engine could figure it out, but my experience from a few
| years ago is that they usually don't.
| onlyrealcuzzo wrote:
| Does anyone with enough db experience know why the engine can't
| figure it out? Is it easy to explain?
|
| I'm not a db expert, but this seems like something that should
| be solvable.
| isoprophlex wrote:
| Yeah agreed, at the very least the thing should be able to
| conceive to switch to this... lets call it "WHERE unrolling"
| internally when asked for " WHERE id IN (x, y, z)"
|
| I'd be delighted to learn what's going on
| toast0 wrote:
| Where unrolling is an excellent term, thanks.
|
| The best I can understand is the engines are (or were) just
| not setup so they can do repeated single key lookups, which
| is really the right strategy for a where in. As a result,
| they do some kind of scan, which looks at a lot more data.
| tpetry wrote:
| The problem is these repeated single key lookups are
| random io for the database engine. So the database engine
| has to predict a threshold for when a scan or random io
| is cheaper which is very hard to get right, and your io
| layer changes this threshold drastically. A spinning disk
| may be faster all the time with sequential io, and for
| flash based systems theres a wide variety of performance
| profiles.
|
| To tackle this problem postgresql has a setting where you
| can tune the cost factor for random io.
| btilly wrote:
| That one is annoying. IN tends to be implemented as a loop.
| I've spend many queries up by putting an IN into a
| temporary table, then joining it.
|
| I have actually seen it be a win to combine it like this:
| SELECT ... FROM foo JOIN (
| select 'a' as value union all
| select 'b' as value ... ) bar
| btilly wrote:
| _I 'm not a db expert, but this seems like something that
| should be solvable._
|
| It is very highly database specific. MySQL tends to be
| terrible. PostgreSQL tends to be quite good. The reasons why
| it did not figure it out in particular cases tend to be
| complicated.
|
| Except for MySQL, in which case the problem is simple. There
| is a tradeoff between time spent optimizing a query and time
| spent running it. Other databases assume that you prepare a
| query and then use it many times so can spend time on
| optimizations. MySQL has an application base of people who
| prepare and use queries just once. Which means that they use
| simple heuristics and can't afford the overhead of a complex
| search for a better plan.
| bawolff wrote:
| I imagine its bad query optimizer doing the wrong thing.
| Maybe a different db engine would do better.
| croes wrote:
| No need to wrap the union queries in a new select to order the
| result. An order by after the last select in the unions orders
| the whole result set as long as the column names match in each
| part of the union.
| rrrrrrrrrrrryan wrote:
| This depends on the database engine. The syntax will error out
| in some (perhaps most).
| balfirevic wrote:
| I'm now curious (but too lazy to setup and test it) how the
| following query would fare: SELECT meal_items.*,
| employee_markouts.employee_id, customer_orders.customer_id
| FROM meal_items LEFT JOIN employee_markouts ON
| employee_markouts.meal_item_id = meal_items.id LEFT JOIN
| employees ON employees.id = employee_markouts.employee_id
| LEFT JOIN customer_order_items ON
| customer_order_items.meal_item_id = meal_items.id LEFT JOIN
| customer_orders ON customer_orders.id =
| customer_order_items.customer_order_id LEFT JOIN customers
| ON customers.id = customer_orders.customer_id WHERE
| (employees.store_id = 250 AND
| employee_markouts.created >= '2021-02-03' AND
| employee_markouts.created < '2021-02-04') OR
| (customers.store_id = 250 AND
| customer_orders.created >= '2021-02-03' AND
| customer_orders.created < '2021-02-04')
| clcaev wrote:
| The author doesn't describe why they thought a cross product of
| two large tables, using a group by to remove duplicates, would be
| a good basis for this query.
___________________________________________________________________
(page generated 2021-03-20 23:00 UTC)