[HN Gopher] Speeding up SQL queries by orders of magnitude using...
___________________________________________________________________
Speeding up SQL queries by orders of magnitude using UNION
Author : tate
Score : 373 points
Date : 2021-03-20 17:58 UTC (1 days ago)
(HTM) web link (www.foxhound.systems)
(TXT) w3m dump (www.foxhound.systems)
| magicalhippo wrote:
| At work I ended up with something similar. A reporting query has
| about 20 sum()'ed columns, and found the easiest and by far
| fastest was just to have 20+ queries glued by UNION ALL, and
| aggregate the sums over that.
|
| It's heavy in it's raw form, but if you add some search terms to
| narrow things down the DB really cuts through.
|
| It's about 320 lines of SQL, so to make debugging a bit easier I
| added a dummy "identifier" column to each of the union'ed sub-
| queries, so first would have value 100, second 200 etc, so I
| could quickly identify which of the sub-queries produced a given
| row. Not counting in 1's allowed for easy insertion of additional
| sub-queries near relevant "identifiers".
| CraigJPerry wrote:
| Is there any way to exploit parallelism in the customer vs
| employee queries then merge the results?
|
| My intuition would have been to tackle this in 3 steps with a
| temp table (create temp table with employee markouts, append
| customer meals, select ... order by). I can see the overheads of
| my solution but I'm wondering if i could beat the Union all
| approach somehow. My temp table wouldn't benefit from an index so
| I'm left wondering about parallelism.
|
| If i remember correctly though, there's no concurrency, and
| therefore no opportunity for parallelism for transactions in the
| same session.
| punnerud wrote:
| Temp table as in using WITH?
| Albert_Camus wrote:
| Author of the original article here. Temporary tables are
| different than using WITH (which are common table
| expressions, or CTEs). In many database engines, can make a
| temporary table that will persist for a single session. The
| syntax is the same as table creation, it just starts with
| CREATE TEMPORARY TABLE ....
|
| More info in the PostgreSQL docs [1]:
|
| > If specified, the table is created as a temporary table.
| Temporary tables are automatically dropped at the end of a
| session, or optionally at the end of the current transaction
| (see ON COMMIT below). Existing permanent tables with the
| same name are not visible to the current session while the
| temporary table exists, unless they are referenced with
| schema-qualified names. Any indexes created on a temporary
| table are automatically temporary as well.
|
| [1]: https://www.postgresql.org/docs/13/sql-createtable.html
| ewalk153 wrote:
| If this is a database for reporting, using a temporary
| table is probably fine and a union all wouldn't concern me.
|
| On Mysql, using a union all creates a temp table which can
| perform catastrophically under database load. I've seen a
| union all query with zero rows in the second half render a
| database server unresponsive when the database was under
| high load, causing a service disruption. We ended rewriting
| the union all query as two database fetches and have not
| seen a single problem in that area since.
|
| I was shocked by this union all behavior, but it is
| apparently a well known thing on MySQL.
|
| I can't speak to Postgres behavior for this kind of query.
| Albert_Camus wrote:
| In the past we worked on a system that used MySQL 8. We
| used UNION (not UNION ALL, but I assume it doesn't
| matter) in several places, applying it to improve
| performance as we described in the article. There were
| definitely cases in the system where one side of the
| UNION would return zero rows, but we never ran into any
| of the types of issues you're describing.
| jeffbee wrote:
| Yeah, you _really_ have to watch out for this. I once
| spent months chasing down a serious but rare performance
| problem in a large-scale mysql 5.6 deployment, which was
| eventually root-caused to kernel slab reclaim pressure,
| caused by very high XFS metadata mutation rate, caused by
| MySQL creating an on-disk temporary ISAM file for every
| query with a union, which was most of the traffic.
| AdrianB1 wrote:
| MS SQL will do most of the work in parallel, but you can
| control if that is what you want.
| EdwardDiego wrote:
| I'm pretty sure later versions of PG can indeed execute both in
| parallel.
| Izkata wrote:
| Not sure about each side of the UNION, but postgres can do
| parallel index scans for even simple criteria:
| https://www.postgresql.org/docs/10/parallel-
| plans.html#PARAL...
|
| Also the article says these tests were done on postgres 12.6
| kroolik wrote:
| The author didn't mention the reason the query is slow.
|
| I dare to challenge that the non-optimal query is something that
| a person with advanced SQL skills would do. A seasoned engineer
| knows that by joining two tables, you are looking at, worst case,
| NxM comparisons.
|
| The problem is that you are joining the customers table with
| THREE other tables - the merged result of stores, employees and
| markouts. No matter how hard you try, you can't escape the fact
| that you are joining way more rows than just stores and
| customers.
|
| I would call it a layman issue SQL beginners learn the hard way.
| Advanced SQL users know the very basics of the db they are
| working with.
| ncallaway wrote:
| > I would call it a layman issue SQL beginners learn the hard
| way. Advanced SQL users know the very basics of the db they are
| working with.
|
| So?
|
| Does every blog post have to be for expert users of the system?
| What's wrong with a blog post that explains mistakes that a
| beginner might make, and a method for mitigating it.
| kroolik wrote:
| Completely nothing is wrong with it! That's not what I'm
| arguing with. I'm challenging the statement in the Conclusion
| section:
|
| > Arriving at query #2 to get the combined results was the
| intuitive way of thinking through the problem, and something
| that someone with intermediate or advanced SQL skills could
| come up with.
|
| The issue in question is simple. It's a very good lecture for
| people early in their SQL journey. It would be a superb read
| if the author had dug into the details why it's slow.
|
| But advanced engineers should know that LEFT JOINing 6+
| tables will be huge. I disagree it's a mistake an experienced
| person would do.
| ncallaway wrote:
| Ah, fair play. I must've missed that part of the article,
| or just didn't have it top of mind when reading your
| comment.
| jiggawatts wrote:
| I don't even understand the thought process that would have led
| to the non-optimal query. It would simply never occur to me to
| use anything other than UNION ALL for this kind of scenario,
| because it is the only appropriate translation of the english-
| language business request into a SQL query.
| dwd wrote:
| After working 20+ years in the industry I'm always surprised
| by the aversion (or lack of thought) to draw up a Venn
| diagram on a white board to visualise the relation, let alone
| then write it out using set notation.
| avereveard wrote:
| Probably the result of some early attempt at either sorting
| or removing duplicates
| robertlagrant wrote:
| It could be that the query was gradually modified. It's
| pretty simple to spot the union when the problem's being
| focused on, laid out nicely in a blog post, and the answer
| already given!
| gigatexal wrote:
| Would be cool to post the sqldump of the data somewhere so we
| could play with the queries and the database with the same data.
| Justsignedup wrote:
| Just to note, the easiest heuristic I've been using to figure out
| when optimization might be useful is when you have an OR clause
| on more than one table. Column A or B is fine, but Table A column
| A or Table B column A is gonna cause interesting problems.
|
| Also:
|
| union = distinct values
|
| union all = no distinct
| croes wrote:
| Important difference between union and union all, especially if
| you don't query a unique id column
| [deleted]
| 02020202 wrote:
| i use projections so simple query fuck the what.
| zeroimpl wrote:
| I suspect query #2 would have performed better if they used a
| COALESCE instead of an OR condition. One of the most important
| optimization lessons you can learn is not to use OR for any join
| conditions.
|
| Eg instead of this: ON
| (customer_order_items.meal_item_id = meal_items.id OR
| employee_markouts.meal_item_id = meal_items.id )
|
| Do this: ON meal_items.id = COALESCE(
| employee_markouts.meal_item_id, customer_order_items.meal_item_id
| )
|
| The latter can optimize properly. Probably still slower than the
| UNION, but would be more in line with expected performance. It
| might be more useful too in certain cases, such as if building a
| view.
| twoodfin wrote:
| I know of at least one commercial DBMS that will perform this
| transformation automatically in some circumstances, and I'd be
| surprised if most of the other major systems couldn't do the
| same.
| tpetry wrote:
| All database systems could implement many more optimizations.
| Heck, all these tricks to get the queries to complete faster
| are only needed because the database engine did not find this
| fast execution plan.
|
| The problem is the more optimizations you implement the more
| possibilities the query planner has to evalue which may result
| in (1) a too long query planning phase or (2) the possibility
| of de-optimizing a good query because an optimization was
| falsely predicted to be faster
| posedge wrote:
| Clever optimization engines are IMO no excuse for blindly
| throwing queries at it and not knowing what's going on. Every
| optimizer has some corner cases in which it fails and I'd argue
| that as a good developer you should know when they can happen.
|
| Additionally, in this example, the rewritten query is easier to
| read and maintain.
| danvk wrote:
| Why is it so important to express this as one SQL query? Why not
| just run the two simple queries? It feels as though two different
| types of things are being joined.
| Daynil wrote:
| Depends on your goals. If you're a software engineer, it won't
| make a difference, just send two queries to the DB and combine
| the results before sending them off. If you're an analyst,
| you'll often get repeated ad-hoc requests. If you have to run 2
| queries, then stitch them together, each time you are asked for
| a report, you're wasting tons of (really boring) time. Even if
| you ultimately create some code to run and stitch the results
| together for you, it's just faster and easier to write 1 query
| wherever possible. Just let the database do the work!
| Albert_Camus wrote:
| > If you're a software engineer, it won't make a difference,
| just send two queries to the DB and combine the results
| before sending them off.
|
| This isn't true if you have certain use cases in your
| application, pagination being one of them. There's no simple
| way to implement pagination when you have two or more queries
| that return and unknown number of results and you're tasked
| with maintaining a consistent order across page changes.
|
| If you make a single query do all the work, it's easy to
| implement paging in any number of ways, the most performant
| being to filter by the last id the client saw. If your users
| aren't likely to paginate too far into the data, LIMIT and
| OFFSET will work fine as well.
| drstewart wrote:
| You're falling into the trap of thinking like the developer
| looking at the implementation, not the end user. The goal is:
|
| >In order to audit inventory, the logistics team at the
| corporate headquarters requests a tool that can generate a
| report containing all meal_items that left a given store's
| inventory on a particular day
|
| That's one thing. The end user doesn't see these as two
| separate requests just because that's how it's modeled in the
| database.
| robertlagrant wrote:
| The end user here being the person writing the union SQL
| query?
| fifilura wrote:
| Is what the article is essentially saying is that [1] is faster
| than [2]?
|
| I have used [1] many times for that reason although [2] is
| probably more intuitive for what you want to do.
|
| [1] SELECT vegetable_id, SUM(price) as price,
| SUM(weight) as weight FROM ( SELECT
| vegetable_id, price, NULL as weight FROM prices
| UNION ALL SELECT vegetable_id, NULL as price, weight
| FROM weights ) GROUP BY vegetable_id
|
| [2] SELECT vegetable_id, price, weight FROM
| prices JOIN weights ON price.vegetable_id =
| weights.vegetable_id
| baix777 wrote:
| The relative performance of these two queries will vary by data
| volume. Swap in a sales table for the weights table, and make
| that a massive sales table at that, joining it to much smaller
| prices can be much faster than a group by. Stated differently,
| a join can be faster than a group by. This is even more true
| when the small table can fit into memory and a hash join can be
| used, and the data in the group by can't fit into memory.
| fifilura wrote:
| Yes I'd say that I would intuitively do that only if the
| tables were both sufficiently large.
|
| I assume somewhere there is a similar assumption in TFA.
| sandfly wrote:
| Do you lose the indices on the columns after the UNION?
| ComputerGuru wrote:
| Definitely, but that is more than made up for by the much
| smaller size.
| croes wrote:
| This queries have different results. [2] retrieves only the
| vegetable_ids which are in both tables, [1] gives all ids which
| are in prices or weights. If vegetable_id null exists in either
| table [1] result in an extra row with id null, this doesn't
| happen for query [2]
| fifilura wrote:
| If I replace JOIN with FULL OUTER JOIN, you'll get what you
| describe. It was just an quick example, but you are right.
|
| There are also things to say about what happens if either
| table has duplicate vegetable_id:s. At some point it is
| assumed that you have processed it in such a way that the
| table is well formed for the purpose.
| blunte wrote:
| Not speaking negatively about the author of this post, but union
| should have been the obvious tool to use in a scenario like this.
|
| SQL is very powerful and really not so complicated. Schema,
| index, and constraint design can be quite challenging, depending
| on the needs, but that's true whether using SQL or an ORM.
|
| The problem I see often is that people use ORMs and never learn
| SQL. Also often, one has to really work to understand how their
| ORM is handling certain situation to avoid problems like N+1 or
| cartesian products.
|
| In the end, it seems that rather than learning SQL, the developer
| ends up learning just as much or more about a specific ORM to get
| the same quality/performing results they would have gotten by
| using direct SQL.
| jjtheblunt wrote:
| I wonder if i'm the only one who sees the phrase "order(s) of
| magnitude" and expects a vague story.
|
| I don't mean to denigrate this article, just wonder if others
| have the same reaction.
| noisy_boy wrote:
| Everyone who the store is selling to is a customer; an employee
| is a customer too except the price might be zero. What if a field
| is added to customers table: employee_id (would be null by
| default as most customers would not be employee). Then the query
| can be run without any union and bypassing
| stores/employees/employee_markouts etc: SELECT
| meal_items.*, customer_id,
| customers.employee_id FROM customers INNER
| JOIN customer_orders ON customers.id =
| customer_orders.customer_id INNER JOIN
| customer_order_items ON customer_orders.id =
| customer_order_items.customer_order_id INNER JOIN
| meal_items ON customer_order_items.meal_item_id =
| meal_items.id
|
| Of course this means that customer_id will never be null but that
| kind of is the point.
|
| This can also be approached by adding a customer_id field to the
| employees table instead (in this example the no. of employees and
| customers are both 20000 but typically the no. of employees is
| less than the no. of customers so adding customer_id to employees
| table would be more typical - in which case the query would
| obviously change).
| kroolik wrote:
| I think this is called denormalization, right? What you could
| try is to instead of modifying the domain for this specific
| use-case, create a read model instead (if you don't want to
| pollute your domain) . Sometimes its very easy to write data,
| but reading is a pain. For such cases, you can prepare read
| models that contains all the data necessary.
| [deleted]
| noisy_boy wrote:
| This is not specifically denormalization because I'm not
| combining/flattening two models - instead, I'm moving the
| special treatment for employees from a totally separate path
| to a property of the existing customer model without
| removing/merging the employee model. Regarding read
| optimizations, there are several ways, some of which are
| vendor specific e.g. PostgreSQL/Oracle allow materialized
| views which can make accessing this data simple (if often
| required though they have their own issues).
| rrrrrrrrrrrryan wrote:
| Keep your matching logic simple.
|
| With SQL, matching is (by far) the most computationally expensive
| bit of most queries. Your goal should be to use only sargable[1]
| operators, _and eliminate any functions_ , in all your WHERE
| clauses and JOINs.
|
| Complicated matching criteria will force the engine to make tough
| trade-off decisions, which can be affected by out-of-date table
| statistics, lousy indexes, and a million other things, and will
| often force the engine to rule-out more optimized algorithms and
| fall back to the cruder, primitive algorithms.
|
| [1] For most queries, "OR" is not usually a sargable operator:
| https://en.wikipedia.org/wiki/Sargable
| setr wrote:
| Shouldn't OR operations in an otherwise sargable[0] query just
| constitute merging two sargable queries? I don't see why OR
| should affect the property
|
| [0] new term for me, and wow do I not like it
| kroolik wrote:
| This depends on the rdbms engine. Even if it might make
| sense, the planned could go with nested loops because of
| statistics and cost estimation.
|
| In the article, the devil is in how the second explodes in
| size. We are suddenly cross-joining employees with customers.
| Left join of stores, employees and markouts gives you more
| than 2 rows. It gives you all the employees assigned to the
| store. Regardless only 2 have markouts.
|
| Next, you join full customers table with it. And keep on
| multiplying the size by further left joining. Result size
| might quickly exceed the memory limits for efficient joins,
| and the rdbms might have to resort to poor nested loops.
| dragonwriter wrote:
| > Your goal should be to use only sargable[1] operators, and
| eliminate any functions, in all your WHERE clauses and JOINs.
|
| Or just use indexes properly; it's fairly common for modern
| databases to support indexes on functions of column values;
| using an indexed (deterministic) function in WHERE/JOIN
| criteria isn't problematic. The Wikipedia article you link
| addresses valid concerns, but the specifics it suggests don't
| seem accurate for the features of modern RDBMSs.
| jbverschoor wrote:
| Also common for ancient RDBMSes
| haimez wrote:
| Really seems like you don't know what sargable means (and
| didn't bother to find out if this is your reply).
| epr wrote:
| No, its pretty clear from reading his entire response that
| he understands exactly what sargable means. "just use
| indexes properly" is pretty ambiguous, but the rest of his
| comment provides context. It doesn't make much sense to me
| to be indexing on fn(x) for predicates unless an
| alternative sargable predicate is too complex or not
| possible, but that doesn't mean that it's never the best
| solution.
| fastball wrote:
| I think it might be the other way around.
| slt2021 wrote:
| the problem with author slow query is joining 7 tables (forcing
| SQL engine to compute cartesian of 7 tables), this is the
| primary source of slow down.
|
| the faster query with UNION is joining 4 tables and unions with
| 5 tables, so the faster query is two orders of magnitude
| faster, because the SQL engine has to compute cartesian product
| on 5 instead of 7 tables
| Albert_Camus wrote:
| Author here. Between this and your other response, where you
| expound on the same point, I think you're being far too hand
| wavy about what causes performance issues. The number of
| joins alone doesn't have much to do with the performance
| characteristics of any query.
|
| What's more important for performance of queries on larger
| data sets than the number of joins is that there are indexes
| and that the query is written in a way that can utilize
| indexes to avoid multiplicative slowdowns. The reason the
| UNION query is fast is because the query on either side
| effectively utilizes the indexes so that the database engine
| can limit the number of rows immediately, rather than filter
| them out after multiplying them all together. I can expand
| this schema to have a UNION query with two 10-table joins and
| it would still perform better than the 7 table query.
|
| I think someone new to SQL is likely to read your statement
| and think _" okay joins are slow so I guess I should avoid
| joins"_. This is not true and this belief that joins are slow
| leads people down the path that ends at _" SQL just doesn't
| scale"_ and _" let's build a complicated Redis-backed caching
| layer"_.
|
| SQL performance is a complex topic. The point of our post was
| to illustrate that a UNION query can simplify how your join
| your tables and allow you to write constituent queries that
| have better performance characteristics. Morphing this into
| "the number of joins is smaller so the performance is better"
| is just incorrect.
| slt2021 wrote:
| I invite you to share execution plans of queries #2 and #4
| with the public so that people can decide what is actually
| slowing down, whether itnis realy cartesian or anything
| else
| slt2021 wrote:
| Sorry, if I was hand-wavy. I was trying to give other
| people a simple framework that I use myself (Big O
| calculated as a number of rows in each table). ALthough I
| dont know how many rows you have in each table, Big O
| frameworks still works, because cartesian of tables is
| being dominated by two huge different datasets (customers's
| orders being joined to employees' orders). The Union simply
| calcualtes them separately, rather than doing cartesian of
| two completely different datasets that represent different
| entities.
|
| Well written predicates and indexes can help, as well as
| poorly written predicates make it worse. So there is
| balance. This is not a shortcut or a silver bullet, it is a
| trade-off being made. More indexes->faster selects and
| slower updates/inserts. One bad index=>failed insert and
| possible losing customer data (happened to me once)
|
| I agree with you that "SQL performance is a complex topic."
| and one should definitely study query Execution Plan to
| understand the bottlenecks and make optimization decisions
| aidos wrote:
| Sql queries never really adhere to that simple Big O
| analysis though. Sure, if you had zero indexes so every
| operation was nested sequential scans, then it's
| Cartesian, but that's never the case. Most often you get
| really fast index lookups, by design.
| kroolik wrote:
| I think it's good to note in the article that the second
| query is slow because the RDBMS doesn't use indexes (and
| which ones). Currently, the text is hand waving the problem
| and moves on.
|
| If the article had, instead, listed indexes, shown they
| were used in simple cases, shown they weren't used in the
| second query, dug into why they weren't (maybe they were
| but it was still hella slow) - that would be a ton of
| value!
| darig wrote:
| It would be even faster if you switched to a series of queries
| without any JOINs at all, and then stitched them together with
| code other than SQL after the queries have ran.
|
| Getting below 15 ms would be trivial.
| tech2 wrote:
| Perhaps I'm missing something, but why is query 2's meal_items a
| left join when the where clause then forces it to act as if it
| were an inner join? Would changing that have any impact?
| Albert_Camus wrote:
| Author here, you are indeed correct that Query #2's final join
| can be an INNER join. However, I just tested it against our
| test data set and it makes no impact on the performance.
| tech2 wrote:
| Ah well, thanks for humouring me. Good thing those query
| optimisers do the right thing on occasion.
| toast0 wrote:
| UNION is also helpful for reducing round trips between the db
| client and db server.
|
| I've run into lots of cases where SQL join is clear to write, but
| slow, but a simple query to get ids, plus a union of fetching
| data for each id is fast. Again, a union of single id fetches is
| often faster than a single query with in (x, y, z). We can wish
| the sql engine could figure it out, but my experience from a few
| years ago is that they usually don't.
| ahmad_muzakkir wrote:
| I'm very interested to learn what you are saying.
|
| Can you give an example ?
| onlyrealcuzzo wrote:
| Does anyone with enough db experience know why the engine can't
| figure it out? Is it easy to explain?
|
| I'm not a db expert, but this seems like something that should
| be solvable.
| isoprophlex wrote:
| Yeah agreed, at the very least the thing should be able to
| conceive to switch to this... lets call it "WHERE unrolling"
| internally when asked for " WHERE id IN (x, y, z)"
|
| I'd be delighted to learn what's going on
| AdrianB1 wrote:
| In the tests I've done, WHERE id IN (x,y,z) was unrolled as
| WHERE id = x OR id = y OR id = z. It is fine if the list is
| short, I had to find solutions when a developer in my team
| built a dynamic query where the IN list had ~ 4000
| elements. A JOIN solved the problem fast and easy in that
| particular case.
| paulmd wrote:
| if you need to do very large "IN" clauses, one option can
| be a global temporary table where you preload values, so
| instead of doing an IN you do a join against the GTT.
| toast0 wrote:
| Where unrolling is an excellent term, thanks.
|
| The best I can understand is the engines are (or were) just
| not setup so they can do repeated single key lookups, which
| is really the right strategy for a where in. As a result,
| they do some kind of scan, which looks at a lot more data.
| tpetry wrote:
| The problem is these repeated single key lookups are
| random io for the database engine. So the database engine
| has to predict a threshold for when a scan or random io
| is cheaper which is very hard to get right, and your io
| layer changes this threshold drastically. A spinning disk
| may be faster all the time with sequential io, and for
| flash based systems theres a wide variety of performance
| profiles.
|
| To tackle this problem postgresql has a setting where you
| can tune the cost factor for random io.
| arp242 wrote:
| As an aside, this setting is almost certainly too high
| for you by default. The default of random_page_cost=4
| assumes that random pages are 4 times more expensive to
| read than sequential pages, but with SSDs this cost is
| much much lower and chances are you're using an SSD these
| days.
|
| I managed to speed up queries by about 10x by just
| adjusting this. One Weird Trick To Speed Up Your
| PostgreSQL And Put DBAs Out Of a Job!
| guenthert wrote:
| Thanks for the hint. I wonder where the value of 4 comes
| from. Is this an old value derived from a (then) fast
| RAID across many spinning rust disks? As you pointed out,
| today --- using SSDs -- the cost should be lower; I'd
| think in some cases (e.g. here a DB used for a catalog of
| a back-up system) the backing store is exactly one
| spinning rust drive, where I'd expect a _much_ higher
| cost for random access (about 80MB /s / (100/s*8KiB)).
|
| Ah, [1] has the answer: they acknowledge that random
| accesses might be much slower, but expect a good share of
| accesses to be satisfied by the cache (that'll be a
| combination of various caches).
|
| [1] https://www.postgresql.org/docs/13/runtime-config-
| query.html
| btilly wrote:
| That one is annoying. IN tends to be implemented as a loop.
| I've spend many queries up by putting an IN into a
| temporary table, then joining it.
|
| I have actually seen it be a win to combine it like this:
| SELECT ... FROM foo JOIN (
| select 'a' as value union all
| select 'b' as value ... ) bar
| btilly wrote:
| _I 'm not a db expert, but this seems like something that
| should be solvable._
|
| It is very highly database specific. MySQL tends to be
| terrible. PostgreSQL tends to be quite good. The reasons why
| it did not figure it out in particular cases tend to be
| complicated.
|
| Except for MySQL, in which case the problem is simple. There
| is a tradeoff between time spent optimizing a query and time
| spent running it. Other databases assume that you prepare a
| query and then use it many times so can spend time on
| optimizations. MySQL has an application base of people who
| prepare and use queries just once. Which means that they use
| simple heuristics and can't afford the overhead of a complex
| search for a better plan.
| bawolff wrote:
| I was under the impression that mysql wss smart enough to
| skip parts of the range that isn't needed.
| https://dev.mysql.com/doc/refman/8.0/en/range-
| optimization.h...
| matwood wrote:
| > It is very highly database specific.
|
| I was about to comment the same thing. Every engine has its
| quirks, and the programmer learns them over time. I went
| from years on MSSQL to MySQL and it was a bit rough to be
| generous. But now I know many of the MySQL quirks and it's
| fine.
| kristianp wrote:
| "MySQL has an application base of people who prepare and
| use queries just once". Is it really likely that mysql is
| unique in its usage patterns? Isn't it more likely that the
| devs have decided to keep things simple because that's the
| historical story for mysql. For example it originally
| didn't even support foreign key constraints if I recall
| correctly (that's a while ago now of course).
| btilly wrote:
| Yeah, I'm pretty sure that it is unique to MySQL and
| SQLite.
|
| Back in the 90s and early 2000s, MySQL told people to
| just run queries. PHP encouraged the same.
|
| Every other database was competing on ability to run
| complex queries. And so it became common knowledge that
| applications written for MySQL didn't port well to, say,
| Oracle. Exactly because of this issue.
|
| Those applications and more like them are out there. And
| mostly expect to run on MySQL. So that use case has to be
| supported.
| ComputerGuru wrote:
| I agree on MySQL but hard disagree on SQLite because the
| API makes it rather obvious very quickly to any developer
| that there is a benefit to preparing beforehand.
| dragonwriter wrote:
| > "MySQL has an application base of people who prepare
| and use queries just once". Is it really likely that
| mysql is unique in its usage patterns?
|
| It's quite plausible that it is distinct in the usage
| patterns it has chosen to optimize for, and that that
| design choice has been instrumental in determining what
| use cases it became popular for and which use cases led
| people to migrate off of it for a different engine, yes.
|
| To the extent that is at play here it is somewhat of an
| oversimplification (even if the target audience was a
| factor in the sewing decision) to describe that as simple
| unidirectional causation from usage to design, and
| possibly more accurate (though still oversimplified) the
| other way around.
| bawolff wrote:
| I imagine its bad query optimizer doing the wrong thing.
| Maybe a different db engine would do better.
| slt2021 wrote:
| I dont like it when people who cannot write efficient
| SQL/design efficient database schema start blaming "bad"
| query optimizer.
|
| pro SQL developers extremely rarely have a problem with
| query optimizer, and when they have it is extremely exotic.
|
| it is usually ORM users that have a problem with SQL
| performance and blame the optimizer
| matwood wrote:
| I agree they shouldn't start there, but you don't
| necessarily need exotic queries to run into edge cases
| that work differently across say MySQL, MSSQL, and
| Oracle. The classic example being exists vs left join
| null.
| bawolff wrote:
| The comment i'm responding to is asking why two
| equivalent queries have different runtimes. The answer
| pretty obviously is because the optimizer is smart enough
| to chose the best query plan in one case and not the
| other. I think its fairly obvious that in theory a better
| optimizer could figure out the right plan in both cases.
|
| That's not to say that its impossible to work around or
| that in a real application that you would ever be "stuck"
| by this. At the end of the day you deal with the software
| you have, but the optimizer can still have weaknesses
| without it totally derailing your app.
| gfody wrote:
| the optimizer is constrained, it still has to give you
| what you asked for. i can't even translate into english
| what's being asked for in query #2 but it's only a
| coincidence that it has the same result as the more
| precise and correct thing being asked for in the final
| query.
| lixtra wrote:
| There are two possible cases:
|
| a) with everything known to the database (i.e. ids not
| null) the two queries produce the same result.
|
| b) for some contents of the schema the results differ,
| but in the presented case some additional condition (not
| known to the db) holds that makes them equivalent.
|
| Only in case b) it can be called coincidence. In case a)
| it's fair to ask if the optimizer can't do better. After
| all it tries to avoid cartesian products for simple joins
| even though that's the text book definition.
|
| "Not worth" to improve the optimizer is still a valid
| answer.
| gfody wrote:
| the query in question is definitely case b - it's a group
| by query with an explicit granularity, there's just no
| actual aggregates being asked for but the very presence
| or absence of a row in the results carries some meaning.
| the explicit grain is something like employee markouts
| plus customers from stores with at least one employee
| without markouts.
| sixdimensional wrote:
| There could be a lot of reasons that are highly engine
| dependent, for this specific case.
|
| A general answer, perhaps... not specific to the case you've
| specified.
|
| Query optimization is a science with multiple dimensions [1].
| I'd wager every problem in computer science plays a role
| somehow in query optimization.
|
| Query performance is based on a combination of actions you
| take to optimize the design of your system to get the best
| performance (e.g. data modelling, index design, hardware,
| query style, and more), and the patterns the system can
| recognize based on your inputs and the data itself, with the
| resources it has available, to optimize your queries.
|
| There are known patterns for optimization that are discovered
| over the years, many hard learned from practical experience.
| This is why older "popular" engines sometimes are more mature
| and more performant - they have optimizations built for the
| common use cases over long periods of time. That is not to
| say older engines are always better, just that they have
| often had more exposure to the variety of problems that
| occur.
|
| The reason why the engine "can't figure it out" is that most
| engines, even the best ones, are quite complex - combinations
| of known rules as well as more fuzzy logic, where the engine
| uses a combination of information and heuristics to
| essentially explore a possible solution space, to try to find
| the optimal execution plan. Making the right decision, well,
| can be difficult and given the nature of these things,
| sometimes the optimizer makes the wrong decision (this is why
| "hints" exist, sometimes, you can force the optimizer to do
| what you see is obvious - but this is suboptimal for you).
|
| In some cases, finding an optimal execution plan can actually
| be quite computationally expensive, and/or quite time
| consuming, or the engine in question may simply have no logic
| coded to handle the case. Optimization is all about finding
| the balance between finding the most performant query plan,
| but in the least amount of time, with the least computational
| and I/O impact to the overall system, that returns the right
| result. Optimizers are also highly depending on the
| capabilities of the engineering teams that build them.
|
| It is not an easy problem, and it is an area which one could
| liken to almost machine learning/artificial intelligence, in
| one way. There are so many possible options, the problem
| space so big, with so many different ways to approach a given
| scenario, that it can be difficult for the "engine" to
| decide.
|
| This is why known patterns were created, for example,
| dimensional data models for analytical queries vs. 3rd normal
| form. Dimensional data models enable certain optimizations,
| for example, star schemas [2]. If you take a combination of
| implementing known patterns, along with optimizers written by
| engineers that exploit those patterns, you can get to a world
| of better performance.
|
| However, in a world that is, let's say.. more "open ended" -
| for example, the world of data in a "data lake", where data
| models are not optimized, data comes in unpredictable
| multiple shapes/sizes, then it often comes down to
| combinations of elegant/complex engines that can interpret
| the shapes of data, cardinality, and other characteristics,
| make use of much larger distributed compute and system
| performance, and in some cases - often brute force to arrive
| at the best query plan or performance possible.
|
| There are so many levels of optimization.. for example, if
| you were to look at things like Trino [3], which started its
| genesis as PrestoDb in Facebook - you will see special CPU
| optimizations (e.g. SIMD instructions), vectorized/pipelined
| operations - there are storage engine optimizations, memory
| optimizations, etc. etc. It truly is a complex and
| fascinating problem.
|
| Source: I was a technical product manager for a federated
| query engine.
|
| [1] https://en.wikipedia.org/wiki/Query_optimization
|
| [2] https://en.wikipedia.org/wiki/Star_schema
|
| [3] https://trino.io/
| Daynil wrote:
| I often read things that say that if you just write the query
| properly, you can trust the query optimizer to select the right
| plan. Just trust the optimizer! Unfortunately, in my
| experience, the more complex your query gets, the more
| opportunities the optimizer has to get it terribly wrong. I've
| learned to accept that, if the "clear/clean" version of the SQL
| (that the human likes to read and write) does not produce
| acceptable results, you just have to drop more explicit hints
| to the optimizer about what you want it to do. Query optimizers
| are truly awesome pieces of software, you just have to learn to
| work around their limitations when you hit them.
| slt2021 wrote:
| This is ABC of SQL: every SQL developer has to learn this lesson
| soon or later: Every additional JOIN you make creates a cartesian
| product of all previously joined records and the number of rows
| grows exponentially => you join tables A to B to C to D -> SQL
| engine has to create AxBxCxD number rows, each letter
| representing number of rows in that table)
|
| Of course specifying useful predicates that filter down dataset
| and having indexes on these columns helps avoid row explosion,
| but you can't have an index on each and every column, especially
| on OLTP database.
|
| in example #2 the author joined 7 tables (multiply row count of 7
| tables and this gives you an indea of how many rows SQL engine
| has to churn through) - big O complexity of query is
| (AxBxCxDxExFxG) of course it will be SLOW.
|
| in example #4 he joins 4 tables and unions with 5 table join, so
| the big O complexity of query is AxBxCxD(1+E).
|
| same as in programming, there is Big O complexity in your SQL
| queries, so it helps to know O() compelxity of queries that you
| write
|
| TLDR: learn the SQL Big-O and stop compaining about the query
| planner, pls
| ineedasername wrote:
| I don't understand what you mean by saying each join creates a
| cartesian product. Join criteria avoid this. If I left outer
| join 7 rows on table A and 5 on table B and join on a unique
| key then I still only get 7 rows, not the cartesian product of
| 7x5 rows. I query across a dozen or more tables on a regular
| basis with aggregates, window functions, with statements,
| having clauses and so on. Performance is rarely a problem.
| jaredsohn wrote:
| I've found denormalization to be helpful for such performance
| issues in that it reduces the number of joins. It can also be
| useful for debugging if you want to keep track of values from
| other tables at the time that data was processed within the
| primary table.
| epr wrote:
| denormalization should only be seen as a last resort
| slt2021 wrote:
| denormalization aka memoization aka "dynamic programming of
| SQL" :-) - all three provide speed-up because they compute
| value once and reuse it later
| mtone wrote:
| Except that your source of truth aka your raw data is now
| memoized, hardly a win. This is not a goal of memoization
| or dynamic programming at all.
|
| Perfect for a reporting database though.
| monoideism wrote:
| A certain amount of denormalization can be helpful, but often
| there's a trade-off in terms of maintainability.
| rovr138 wrote:
| Not who you replied to.
|
| But yes,
|
| As an example, we keep our data normalized and add extra
| tables to do the operations we need. It's expensive since
| we have to precompute, etc. But then on lookups it's
| quicker.
|
| Like everything, it depends on what you want to optimize
| for.
| jaredsohn wrote:
| Yep, in my case we wanted to make user-facing queries
| fast (i.e. a reports page). The extra complexity (not
| much; we just have to remember when to update certain
| values in other tables when they change) is definitely
| worth it since load time dropped from 30 seconds
| sometimes to almost instant.
|
| Denormalizing everything is definitely a pain; keeping
| data in sync in mongodb (which lacks traditional joins)
| on a previous project was not fun; now using postgresql.
| [deleted]
| [deleted]
| ericHosick wrote:
| > Every additional JOIN you make creates a cartesian product of
| all previously joined records and the number of rows grows
| exponentially => you join tables A to B to C to D -> SQL engine
| has to create AxBxCxD number rows, each letter representing
| number of rows in that table)
|
| I think a cartesian product would be:
|
| SELECT * FROM A, B; Given sizes a, b the resulting number of
| rows would be a X b or O(N^2).
|
| With a join SELECT * FROM A INNER JOIN B ON A.id = B.id then
| the result rows would be MAX(a,b) or O(N).
|
| A JOIN is a predicate that filters down the dataset.
| slt2021 wrote:
| all queries with JOINs can be rewritten as cartesians: select
| * from a join B on col1=col2 will be rewritten by query
| optimizer into:
|
| select * from A, B /* this is cartesian that slows down your
| query */ WHERE a.col1=B.col2
|
| in fact both queries produce the same execution plan if you
| check yourself
| whimsicalism wrote:
| ? not how it works behind the scenes at all
| ineedasername wrote:
| All queries could technically be expressed as a cartesian
| product but that is not necessary and not what happens in
| practice. Both of the above might produce the same plan
| because they are both treated as joins. One is expressed as
| an explicit join, the other as an implicit join, but
| neither requires the query engine to produce a cartesian
| product on execution. If it did, queries I run in seconds
| would require more storage or RAM than probably exists in
| my entire workplace, and I'm not using anything that would
| usually be considered "big data".
| slt2021 wrote:
| >>queries I run in seconds would require more storage or
| RAM than probably exists in my city
|
| my explanation to this: CPUs ave very fast. My CPU is
| 4Ghz so a single core can do a lot of computations in one
| second, and a SQL engine is smart enough to make
| cartesian computation (as part of a query plan) and
| discard the result if row does not meet predicate
| condition.
|
| in fact I agree that not entire cartesian is being
| computed, if you specify enough predicates. But the query
| still multiplies rows. In the author's article he is
| joining employees when customer_id columns in NULL so
| this is technically a cartesian, because NULL predicate
| is not very selective (=there are a lot of rows with
| value of NULL)
| whimsicalism wrote:
| Cartesian product means a very specific thing - testing
| the predicates on every possible combination. This is not
| what occurs in practice
| ineedasername wrote:
| You are not using the term cartesian product correctly as
| applied to how database engines execute queries. Multiple
| people have detailed the problems with your ideas on how
| such things work. A fast processor cannot overcome the
| the need for multiple terabytes of RAM that would be
| required to process common queries if databases worked as
| you describe. Databases are significantly more likely to
| get bottlenecked by RAM than CPU, and your incorrect
| understanding of how databases work would require as much
| RAM as a large "big data" cluster to run queries on
| moderately sized data sets. Then even if it had that RAM,
| the bandwidth of the system's bus channels would be
| incapable of transporting it fast enough to return
| results in 500ms. Certainly not on my system with a few
| dozen other people working at the same time, and
| especially not when I have to query the live production
| system with a few thousand end users instead of the data
| warehouse. Databases do not work this way.
| slt2021 wrote:
| multiple people can be wrong, so it is not a valid
| argument. I can write 10 nested loops and overwrite value
| of a single cpu register with a single value, so many
| times, that it will exceed any big data cluster capacity.
| This is what I meant by CPUs are fast.
|
| if you study engine internals, especially MergeJoin,
| HashJoin, NestedLoopJoin - they all do comoute cartesian
| while simultaneously applying predicates. Some operations
| are faster because of sorting and hashing, but they still
| do cartesian.
| zeroimpl wrote:
| Umm.. virtually no join on tables with more than a
| handful of rows is done as a cartesian product. A
| suitable set of columns is used to sort or hash each side
| of the join, such that the actual join logic is basically
| linear performance (each row in table A is probably only
| ever paired with approximately 1 row in table B when
| evaluating the predicates). A cartesian product would
| involve testing the predicates on every combination of A
| and B (each row in table A being paired with each row in
| table B)
|
| (Note when I say linear performance, I mean the join
| predicate is executed a linear amount of times, but the
| initial sort/hash operation was probably more like O(n
| log n))
| [deleted]
| gfody wrote:
| I think the point is rather that a join with a condition
| (which is the norm) is almost never actually executed as a
| cartesian product. take for example from tfa
| select * from customer_order_items
|
| the grain is one row per customer order item, and the next
| two joins don't change that select * from
| customer_order_items join customer_orders on order_id
|
| the grain is still one row per customer order item
| select * from customer_order_items join customer_orders on
| order_id join customers on customer_id
|
| the grain is still one row per customer order item
|
| ..etc..
|
| of course later on they screw it up and take the cartesian
| product of customer_order_items by employ_markouts, but
| it's just 2 big factors not 7 - their query did finish
| after a few seconds. usually mistakes involving cartesian
| products with 7 factors just run for hours and eventually
| throw out of memory.
| gfody wrote:
| > you join tables A to B to C to D -> SQL engine has to create
| AxBxCxD number rows
|
| this isn't really true. not every join is a cross join, and not
| every join even changes the grain of the result set. it is
| important to pay attention when your joins do change the grain
| of the result set, especially when they grow it by some factor
| - then it's more like a product as you say.
| slt2021 wrote:
| all queries with JOINs can be rewritten as cartesians: select
| * from a join B on col1=col2 will be rewritten by query
| optimizer into: select * from A, B /* this is cartesian that
| slows down your query */ WHERE a.col1=B.col2
|
| in fact both queries produce the same execution plan if you
| check yourself
| gfody wrote:
| the query is not the only input into the plan generator,
| the other inputs are the schema (with constraints and
| indexes) and their statistics - so you can't really say
| "check the plan for yourself" without all the inputs. and
| the only time the plan for "select * from A, B" is going to
| be the same as "select * from A, B where A.col1=B.col2" is
| when col1 and col2 are both neither primary keys nor
| foreign keys and that would be a weird condition under
| which to run such a query.
| slt2021 wrote:
| pls reread my post, I wrote that query select * from A
| join B on col1=col2 is the the same plan as for select *
| from a, b where col1=col2
|
| this is true regardless of schema and statistics, it is
| simply how query planner works under the hood
|
| and you can check it yourself on any database
| iaabtpbtpnn wrote:
| Those are both inner joins, just a different syntax for
| writing them.
| ohlookabird wrote:
| They might be logically equivalent, but they are not
| identical if more than two tables are joined. For
| instance, the current Postgres docs say [1]:
|
| > This latter equivalence does not hold exactly when more
| than two tables appear, because JOIN binds more tightly
| than comma. For example FROM T1 CROSS JOIN T2 INNER JOIN
| T3 ON condition is not the same as FROM T1, T2 INNER JOIN
| T3 ON condition because the condition can reference T1 in
| the first case but not the second.
|
| [1] https://www.postgresql.org/docs/current/queries-
| table-expres...
| tehlike wrote:
| Something that helps with exploding cartesian problem is using
| json_agg for the joined tables. IT's available in postgres, it
| works great.
| meritt wrote:
| Your understanding is flawed, these aren't cross joins.
| boulos wrote:
| Huh. Like other commenters, I'm surprised to see a multi part
| JOIN be the answer for "I want the results of Query A as well as
| the results of Query B".
|
| If anything, I think many folks would say "whatever, make the two
| requests in parallel and zip them together in the application".
| If not, WITH expressions to make the two independent queries and
| then a final SELECT to merge them however you prefer would be
| reasonable (whether that's a join flavor or
| UNION/INTERSECT/etc.), but the super explosion JOIN just doesn't
| cross my mind.
|
| I think there must be a different example that they tried to
| anonymize for this post, otherwise they don't need the IDs (and
| probably want a sum of quantity). If the author is around, what
| was the more realistic example behind:
|
| "In order to audit inventory, the logistics team at the corporate
| headquarters requests a tool that can generate a report
| containing all meal_items that left a given store's inventory on
| a particular day. This requires a query that includes items that
| were both sold to customers as well as recorded as employee
| markouts for the specified store on the specified day."?
| IgorPartola wrote:
| Where this fail (and I don't see the article addressing this
| concern) is if you want to do pagination. Say you are
| retrieving page 17 of 8000. Will the UNION still be performant
| then? After all, it will require a stable sort order that may
| depend on one of the branches. You certainly can't pagination
| without doing a whole lot of logic in your application if you
| use two separate queries. My typical solution to this is to do
| a query not employee IDs, another one of customer IDs and then
| do an IN clause. Or you could do an intermediate ID of some
| kind that will collapse your diamond branch into a single IN.
| catmanjan wrote:
| In my experience this is the number one driver for infinite
| scroll, dropping paging simplifies a lot
| sahila wrote:
| Having infinite scroll doesn't avoid paging results from a
| query. Like the OP mentioned, if you have 8000 page *
| results/page records, you may only want to retrieve 50
| records at a time to pass back to the client. In that case,
| you will still need a way to query for the next set / page.
| Infinite scroll is a UX implementation.
| catmanjan wrote:
| Yes - very true, however when we used paging we found
| users were going "I know my results are on page 19, skip
| to there" but when they were faced with the task of
| scrolling to page 19, their behaviour changed and they
| used filtering
| ajmurmann wrote:
| Did that make the user experience worse or better?
| catmanjan wrote:
| Haha no comment
| OJFord wrote:
| > "I know my results are on page 19, skip to there"
|
| What on Earth is your application where that's the case?
| ..A forum maybe? (Revisiting a thread looking for a
| particular reply I mean.)
| vletal wrote:
| Have you ever used pornhub?
| [deleted]
| catmanjan wrote:
| Document registry, but I've heard users doing this in
| just about every application that allows it - just doing
| a broad search and memorising the page that the thing
| they actually want is on
| rini17 wrote:
| If users often search for known things then why not
| optimize the engine/UI to the use case instead of
| removing the use case?
|
| Like, remembering the association <search term - selected
| results> and prominently display these?
| catmanjan wrote:
| Search results were boosted based on usage, but the
| performance benefits of infinite scroll outweighed the
| use case
| Black101 wrote:
| speeding up sql queries isnt hard but it takes time and its worth
| it.
| mongol wrote:
| What is the best book to learn performant SQL querying? I
| recently needed to query a large database for exploratory
| purposes. The database has historical data about many years of
| sales transactions of a huge company. It is using partitioning
| keys. I know pretty well how to write queries to get what I want
| but given the size of the database I need to write them
| performantly. How can I learn that best?
| sqldba wrote:
| SQL Performance Explained, hands-down https://sql-performance-
| explained.com/
|
| Covers all the main DBMS, explained very simply from the bottom
| up, maintained constantly and has been read and recommended by
| people for years and years.
| croes wrote:
| No need to wrap the union queries in a new select to order the
| result. An order by after the last select in the unions orders
| the whole result set as long as the column names match in each
| part of the union.
| AdrianB1 wrote:
| That works in MS SQL, not guaranteed in others.
| croes wrote:
| Works in mysql, postgres, oracle, MS, DB2
| rrrrrrrrrrrryan wrote:
| This depends on the database engine. The syntax will error out
| in some (perhaps most).
| croes wrote:
| Works in mysql, postgres, oracle, MS, DB2. Can you name an
| engine where it fails?
| balfirevic wrote:
| I'm now curious (but too lazy to setup and test it) how the
| following query would fare: SELECT meal_items.*,
| employee_markouts.employee_id, customer_orders.customer_id
| FROM meal_items LEFT JOIN employee_markouts ON
| employee_markouts.meal_item_id = meal_items.id LEFT JOIN
| employees ON employees.id = employee_markouts.employee_id
| LEFT JOIN customer_order_items ON
| customer_order_items.meal_item_id = meal_items.id LEFT JOIN
| customer_orders ON customer_orders.id =
| customer_order_items.customer_order_id LEFT JOIN customers
| ON customers.id = customer_orders.customer_id WHERE
| (employees.store_id = 250 AND
| employee_markouts.created >= '2021-02-03' AND
| employee_markouts.created < '2021-02-04') OR
| (customers.store_id = 250 AND
| customer_orders.created >= '2021-02-03' AND
| customer_orders.created < '2021-02-04')
| Albert_Camus wrote:
| Author here. This doesn't give the correct results. It produces
| meal_items that have both customer_id and employee_id. Here's
| an excerpt (the full result set is thousands of rows, as
| opposed to the expected 45): id |label
| |price|employee_id|customer_id|
| ---|------------|-----|-----------|-----------| ...
| 344|Poke | 4.18| 3772| 13204|
| 344|Poke | 4.18| 3313| 13204|
| 344|Poke | 4.18| 2320| 13204|
| 344|Poke | 4.18| 632| 13204|
| 344|Poke | 4.18| 4264| 13204|
| 344|Poke | 4.18| 699| 13204|
| 344|Poke | 4.18| 1070| 13204|
| 344|Poke | 4.18| 3022| 13204|
| 344|Poke | 4.18| 1501| 13204|
| 344|Poke | 4.18| 808| 13204|
| 344|Poke | 4.18| 2793| 13204|
| 344|Poke | 4.18| 1660| 13204|
| 344|Poke | 4.18| 932| 13204| ...
|
| To be clear, there are ways to write this query without UNION
| that have both good performance and give the correct results,
| but they're very fiddly and harder to reason about that just
| writing the two comparatively simple queries and then mashing
| the results together.
| balfirevic wrote:
| > Author here. This doesn't give the correct results.
|
| Right you are! :) Should have given it more thought before
| posting. Thanks for jumping in.
| gfody wrote:
| this is probably how I'd write it (assuming pg properly pushes
| the predicates down to the subquery in the lateral join)
| select mi.*, x.employee_id, x.customer_id from meal_items
| mi join lateral ( select e.store_id, em.created,
| em.employee_id, customer_id = null from
| employee_markouts em join employees e on e.id =
| em.employee_id where em.meal_item_id = mi.id
| union all select c.store_id, co.created, employee_id =
| null, co.customer_id from customer_order_items coi
| join customer_orders co on co.id = coi.customer_order_id
| join customers c on c.id = co.customer_id where
| coi.meal_item_id = mi.id) x where x.store_id = 250 and
| x.created between '2021-02-03' and '2021-02-03' order by
| mi.id
| knuthsat wrote:
| OR really messes things up. UNION instead makes things fast.
| balfirevic wrote:
| Oh, I wouldn't presume that it will be faster than the union
| but I wonder how it compares to the original query in the
| article (the one with OR).
| gfody wrote:
| > Retrieving both employee and customer meal items
|
| When you're getting two separate sets of things in one query you
| should expect to use union. This isn't so much a performance
| optimization as just expressing the query properly.
|
| Imagine starting with just this weird SQL that gets both employee
| and customer meal items using only joins, and then trying to
| infer the original intent.
| [deleted]
| clcaev wrote:
| The author doesn't describe why they thought a cross product of
| two large tables, using a group by to remove duplicates, would be
| a good basis for this query.
| gfody wrote:
| The author(s?) almost certainly have no idea what they're
| talking about. Wtf are "diamond schemas" and "multi-branch
| joins"? This blog is the sole google result for that made up
| nonsense that's supposedly there to imply that it's somehow
| normal to put an OR expression with columns from two different
| tables into a join clause. They're using select * in a group by
| query with no aggregates. And they're talking about keywords as
| operations as if their SQL is directly executed and doing stuff
| rather than merely being the input for a plan generator (this
| despite the first paragraph - maybe one of them knows better
| than the other?).
|
| They say that query #2 is "the intuitive way of thinking
| through the problem, and something that someone with
| intermediate or advanced SQL skills could come up with" and is
| slow because it "eliminated the optimizer's ability to create
| an efficient query plan" - and they're wrong on both counts.
| That query is a nightmare no advanced SQL expert would write,
| and the plan that was generated was probably perfectly optimal
| to give them what they asked for. They just don't realize they
| asked for ridiculous nonsense that accidentally matches their
| expected results.
| idoubtit wrote:
| > The author(s?) almost certainly have no idea what they're
| talking about.
|
| > They're using select * in a group by query with no
| aggregates.
|
| The author wrote something akin to
|
| SELECT A.*, B.x FROM A JOIN B GROUP BY A.id, b.x
|
| This is perfectly valid SQL, AS per the SQL standard: "Each
| <column reference> in the <group by clause> shall
| unambiguously reference a column of T. A column referenced in
| a <group by clause> is a grouping column". From SQL-92
| reference, section 7.7,
| http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
|
| When using offensive language, you'd better be sure of the
| technical quality of what you write, if you don't want to
| show an ugly portrait of yourself.
| matwood wrote:
| A.* assumes every column in A is functionally dependent on
| the A.id. This may be the case, but is a huge source of
| bugs I have seen related to MySQL, and would set my spidey
| sense off if I saw it in code. The problem is that older
| versions of MySQL allowed non-functionaly dependent columns
| in the select list and would randomly put one of the values
| in the result set (every other sane database at the time
| properly errored out). At this point, I would consider it
| bad practice to run MySQL in anything other than only full
| group by mode.
|
| See here: https://dev.mysql.com/doc/refman/5.7/en/group-by-
| handling.ht...
|
| And here: https://www.percona.com/blog/2019/05/13/solve-
| query-failures...
| gfody wrote:
| > This is perfectly valid SQL
|
| no, it's not. this query only runs because postgres is not
| compliant with SQL-99 T301. the query is unambiguously
| invalid under SQL-92.
|
| > When using offensive language
|
| fwiw I regret my tone but this article deserves to be
| criticized. I love and respect rdbms technology and it's
| exciting to see anything "SQL" in a headline on HN - but
| then it bums me out to see bogus terms tossed around and
| insane queries presented as normal. if this were some
| beginner blog writing about lessons learned that's one
| thing, but this is a professional consultancy firm writing
| with the air of "you can trust us, we're experts".
___________________________________________________________________
(page generated 2021-03-21 23:03 UTC)