[HN Gopher] Speeding up SQL queries by orders of magnitude using...
       ___________________________________________________________________
        
       Speeding up SQL queries by orders of magnitude using UNION
        
       Author : tate
       Score  : 373 points
       Date   : 2021-03-20 17:58 UTC (1 days ago)
        
 (HTM) web link (www.foxhound.systems)
 (TXT) w3m dump (www.foxhound.systems)
        
       | magicalhippo wrote:
       | At work I ended up with something similar. A reporting query has
       | about 20 sum()'ed columns, and found the easiest and by far
       | fastest was just to have 20+ queries glued by UNION ALL, and
       | aggregate the sums over that.
       | 
       | It's heavy in it's raw form, but if you add some search terms to
       | narrow things down the DB really cuts through.
       | 
       | It's about 320 lines of SQL, so to make debugging a bit easier I
       | added a dummy "identifier" column to each of the union'ed sub-
       | queries, so first would have value 100, second 200 etc, so I
       | could quickly identify which of the sub-queries produced a given
       | row. Not counting in 1's allowed for easy insertion of additional
       | sub-queries near relevant "identifiers".
        
       | CraigJPerry wrote:
       | Is there any way to exploit parallelism in the customer vs
       | employee queries then merge the results?
       | 
       | My intuition would have been to tackle this in 3 steps with a
       | temp table (create temp table with employee markouts, append
       | customer meals, select ... order by). I can see the overheads of
       | my solution but I'm wondering if i could beat the Union all
       | approach somehow. My temp table wouldn't benefit from an index so
       | I'm left wondering about parallelism.
       | 
       | If i remember correctly though, there's no concurrency, and
       | therefore no opportunity for parallelism for transactions in the
       | same session.
        
         | punnerud wrote:
         | Temp table as in using WITH?
        
           | Albert_Camus wrote:
           | Author of the original article here. Temporary tables are
           | different than using WITH (which are common table
           | expressions, or CTEs). In many database engines, can make a
           | temporary table that will persist for a single session. The
           | syntax is the same as table creation, it just starts with
           | CREATE TEMPORARY TABLE ....
           | 
           | More info in the PostgreSQL docs [1]:
           | 
           | > If specified, the table is created as a temporary table.
           | Temporary tables are automatically dropped at the end of a
           | session, or optionally at the end of the current transaction
           | (see ON COMMIT below). Existing permanent tables with the
           | same name are not visible to the current session while the
           | temporary table exists, unless they are referenced with
           | schema-qualified names. Any indexes created on a temporary
           | table are automatically temporary as well.
           | 
           | [1]: https://www.postgresql.org/docs/13/sql-createtable.html
        
             | ewalk153 wrote:
             | If this is a database for reporting, using a temporary
             | table is probably fine and a union all wouldn't concern me.
             | 
             | On Mysql, using a union all creates a temp table which can
             | perform catastrophically under database load. I've seen a
             | union all query with zero rows in the second half render a
             | database server unresponsive when the database was under
             | high load, causing a service disruption. We ended rewriting
             | the union all query as two database fetches and have not
             | seen a single problem in that area since.
             | 
             | I was shocked by this union all behavior, but it is
             | apparently a well known thing on MySQL.
             | 
             | I can't speak to Postgres behavior for this kind of query.
        
               | Albert_Camus wrote:
               | In the past we worked on a system that used MySQL 8. We
               | used UNION (not UNION ALL, but I assume it doesn't
               | matter) in several places, applying it to improve
               | performance as we described in the article. There were
               | definitely cases in the system where one side of the
               | UNION would return zero rows, but we never ran into any
               | of the types of issues you're describing.
        
               | jeffbee wrote:
               | Yeah, you _really_ have to watch out for this. I once
               | spent months chasing down a serious but rare performance
               | problem in a large-scale mysql 5.6 deployment, which was
               | eventually root-caused to kernel slab reclaim pressure,
               | caused by very high XFS metadata mutation rate, caused by
               | MySQL creating an on-disk temporary ISAM file for every
               | query with a union, which was most of the traffic.
        
         | AdrianB1 wrote:
         | MS SQL will do most of the work in parallel, but you can
         | control if that is what you want.
        
         | EdwardDiego wrote:
         | I'm pretty sure later versions of PG can indeed execute both in
         | parallel.
        
           | Izkata wrote:
           | Not sure about each side of the UNION, but postgres can do
           | parallel index scans for even simple criteria:
           | https://www.postgresql.org/docs/10/parallel-
           | plans.html#PARAL...
           | 
           | Also the article says these tests were done on postgres 12.6
        
       | kroolik wrote:
       | The author didn't mention the reason the query is slow.
       | 
       | I dare to challenge that the non-optimal query is something that
       | a person with advanced SQL skills would do. A seasoned engineer
       | knows that by joining two tables, you are looking at, worst case,
       | NxM comparisons.
       | 
       | The problem is that you are joining the customers table with
       | THREE other tables - the merged result of stores, employees and
       | markouts. No matter how hard you try, you can't escape the fact
       | that you are joining way more rows than just stores and
       | customers.
       | 
       | I would call it a layman issue SQL beginners learn the hard way.
       | Advanced SQL users know the very basics of the db they are
       | working with.
        
         | ncallaway wrote:
         | > I would call it a layman issue SQL beginners learn the hard
         | way. Advanced SQL users know the very basics of the db they are
         | working with.
         | 
         | So?
         | 
         | Does every blog post have to be for expert users of the system?
         | What's wrong with a blog post that explains mistakes that a
         | beginner might make, and a method for mitigating it.
        
           | kroolik wrote:
           | Completely nothing is wrong with it! That's not what I'm
           | arguing with. I'm challenging the statement in the Conclusion
           | section:
           | 
           | > Arriving at query #2 to get the combined results was the
           | intuitive way of thinking through the problem, and something
           | that someone with intermediate or advanced SQL skills could
           | come up with.
           | 
           | The issue in question is simple. It's a very good lecture for
           | people early in their SQL journey. It would be a superb read
           | if the author had dug into the details why it's slow.
           | 
           | But advanced engineers should know that LEFT JOINing 6+
           | tables will be huge. I disagree it's a mistake an experienced
           | person would do.
        
             | ncallaway wrote:
             | Ah, fair play. I must've missed that part of the article,
             | or just didn't have it top of mind when reading your
             | comment.
        
         | jiggawatts wrote:
         | I don't even understand the thought process that would have led
         | to the non-optimal query. It would simply never occur to me to
         | use anything other than UNION ALL for this kind of scenario,
         | because it is the only appropriate translation of the english-
         | language business request into a SQL query.
        
           | dwd wrote:
           | After working 20+ years in the industry I'm always surprised
           | by the aversion (or lack of thought) to draw up a Venn
           | diagram on a white board to visualise the relation, let alone
           | then write it out using set notation.
        
           | avereveard wrote:
           | Probably the result of some early attempt at either sorting
           | or removing duplicates
        
           | robertlagrant wrote:
           | It could be that the query was gradually modified. It's
           | pretty simple to spot the union when the problem's being
           | focused on, laid out nicely in a blog post, and the answer
           | already given!
        
       | gigatexal wrote:
       | Would be cool to post the sqldump of the data somewhere so we
       | could play with the queries and the database with the same data.
        
       | Justsignedup wrote:
       | Just to note, the easiest heuristic I've been using to figure out
       | when optimization might be useful is when you have an OR clause
       | on more than one table. Column A or B is fine, but Table A column
       | A or Table B column A is gonna cause interesting problems.
       | 
       | Also:
       | 
       | union = distinct values
       | 
       | union all = no distinct
        
         | croes wrote:
         | Important difference between union and union all, especially if
         | you don't query a unique id column
        
           | [deleted]
        
       | 02020202 wrote:
       | i use projections so simple query fuck the what.
        
       | zeroimpl wrote:
       | I suspect query #2 would have performed better if they used a
       | COALESCE instead of an OR condition. One of the most important
       | optimization lessons you can learn is not to use OR for any join
       | conditions.
       | 
       | Eg instead of this:                   ON
       | (customer_order_items.meal_item_id = meal_items.id             OR
       | employee_markouts.meal_item_id = meal_items.id            )
       | 
       | Do this:                   ON meal_items.id = COALESCE(
       | employee_markouts.meal_item_id, customer_order_items.meal_item_id
       | )
       | 
       | The latter can optimize properly. Probably still slower than the
       | UNION, but would be more in line with expected performance. It
       | might be more useful too in certain cases, such as if building a
       | view.
        
       | twoodfin wrote:
       | I know of at least one commercial DBMS that will perform this
       | transformation automatically in some circumstances, and I'd be
       | surprised if most of the other major systems couldn't do the
       | same.
        
         | tpetry wrote:
         | All database systems could implement many more optimizations.
         | Heck, all these tricks to get the queries to complete faster
         | are only needed because the database engine did not find this
         | fast execution plan.
         | 
         | The problem is the more optimizations you implement the more
         | possibilities the query planner has to evalue which may result
         | in (1) a too long query planning phase or (2) the possibility
         | of de-optimizing a good query because an optimization was
         | falsely predicted to be faster
        
         | posedge wrote:
         | Clever optimization engines are IMO no excuse for blindly
         | throwing queries at it and not knowing what's going on. Every
         | optimizer has some corner cases in which it fails and I'd argue
         | that as a good developer you should know when they can happen.
         | 
         | Additionally, in this example, the rewritten query is easier to
         | read and maintain.
        
       | danvk wrote:
       | Why is it so important to express this as one SQL query? Why not
       | just run the two simple queries? It feels as though two different
       | types of things are being joined.
        
         | Daynil wrote:
         | Depends on your goals. If you're a software engineer, it won't
         | make a difference, just send two queries to the DB and combine
         | the results before sending them off. If you're an analyst,
         | you'll often get repeated ad-hoc requests. If you have to run 2
         | queries, then stitch them together, each time you are asked for
         | a report, you're wasting tons of (really boring) time. Even if
         | you ultimately create some code to run and stitch the results
         | together for you, it's just faster and easier to write 1 query
         | wherever possible. Just let the database do the work!
        
           | Albert_Camus wrote:
           | > If you're a software engineer, it won't make a difference,
           | just send two queries to the DB and combine the results
           | before sending them off.
           | 
           | This isn't true if you have certain use cases in your
           | application, pagination being one of them. There's no simple
           | way to implement pagination when you have two or more queries
           | that return and unknown number of results and you're tasked
           | with maintaining a consistent order across page changes.
           | 
           | If you make a single query do all the work, it's easy to
           | implement paging in any number of ways, the most performant
           | being to filter by the last id the client saw. If your users
           | aren't likely to paginate too far into the data, LIMIT and
           | OFFSET will work fine as well.
        
         | drstewart wrote:
         | You're falling into the trap of thinking like the developer
         | looking at the implementation, not the end user. The goal is:
         | 
         | >In order to audit inventory, the logistics team at the
         | corporate headquarters requests a tool that can generate a
         | report containing all meal_items that left a given store's
         | inventory on a particular day
         | 
         | That's one thing. The end user doesn't see these as two
         | separate requests just because that's how it's modeled in the
         | database.
        
           | robertlagrant wrote:
           | The end user here being the person writing the union SQL
           | query?
        
       | fifilura wrote:
       | Is what the article is essentially saying is that [1] is faster
       | than [2]?
       | 
       | I have used [1] many times for that reason although [2] is
       | probably more intuitive for what you want to do.
       | 
       | [1]                 SELECT vegetable_id, SUM(price) as price,
       | SUM(weight) as weight       FROM       (         SELECT
       | vegetable_id, price, NULL as weight         FROM prices
       | UNION ALL         SELECT vegetable_id, NULL as price, weight
       | FROM weights       )       GROUP BY vegetable_id
       | 
       | [2]                 SELECT vegetable_id, price, weight       FROM
       | prices          JOIN weights            ON price.vegetable_id =
       | weights.vegetable_id
        
         | baix777 wrote:
         | The relative performance of these two queries will vary by data
         | volume. Swap in a sales table for the weights table, and make
         | that a massive sales table at that, joining it to much smaller
         | prices can be much faster than a group by. Stated differently,
         | a join can be faster than a group by. This is even more true
         | when the small table can fit into memory and a hash join can be
         | used, and the data in the group by can't fit into memory.
        
           | fifilura wrote:
           | Yes I'd say that I would intuitively do that only if the
           | tables were both sufficiently large.
           | 
           | I assume somewhere there is a similar assumption in TFA.
        
         | sandfly wrote:
         | Do you lose the indices on the columns after the UNION?
        
           | ComputerGuru wrote:
           | Definitely, but that is more than made up for by the much
           | smaller size.
        
         | croes wrote:
         | This queries have different results. [2] retrieves only the
         | vegetable_ids which are in both tables, [1] gives all ids which
         | are in prices or weights. If vegetable_id null exists in either
         | table [1] result in an extra row with id null, this doesn't
         | happen for query [2]
        
           | fifilura wrote:
           | If I replace JOIN with FULL OUTER JOIN, you'll get what you
           | describe. It was just an quick example, but you are right.
           | 
           | There are also things to say about what happens if either
           | table has duplicate vegetable_id:s. At some point it is
           | assumed that you have processed it in such a way that the
           | table is well formed for the purpose.
        
       | blunte wrote:
       | Not speaking negatively about the author of this post, but union
       | should have been the obvious tool to use in a scenario like this.
       | 
       | SQL is very powerful and really not so complicated. Schema,
       | index, and constraint design can be quite challenging, depending
       | on the needs, but that's true whether using SQL or an ORM.
       | 
       | The problem I see often is that people use ORMs and never learn
       | SQL. Also often, one has to really work to understand how their
       | ORM is handling certain situation to avoid problems like N+1 or
       | cartesian products.
       | 
       | In the end, it seems that rather than learning SQL, the developer
       | ends up learning just as much or more about a specific ORM to get
       | the same quality/performing results they would have gotten by
       | using direct SQL.
        
       | jjtheblunt wrote:
       | I wonder if i'm the only one who sees the phrase "order(s) of
       | magnitude" and expects a vague story.
       | 
       | I don't mean to denigrate this article, just wonder if others
       | have the same reaction.
        
       | noisy_boy wrote:
       | Everyone who the store is selling to is a customer; an employee
       | is a customer too except the price might be zero. What if a field
       | is added to customers table: employee_id (would be null by
       | default as most customers would not be employee). Then the query
       | can be run without any union and bypassing
       | stores/employees/employee_markouts etc:                   SELECT
       | meal_items.*,             customer_id,
       | customers.employee_id         FROM customers             INNER
       | JOIN customer_orders                 ON customers.id =
       | customer_orders.customer_id             INNER JOIN
       | customer_order_items                 ON customer_orders.id =
       | customer_order_items.customer_order_id             INNER JOIN
       | meal_items                 ON customer_order_items.meal_item_id =
       | meal_items.id
       | 
       | Of course this means that customer_id will never be null but that
       | kind of is the point.
       | 
       | This can also be approached by adding a customer_id field to the
       | employees table instead (in this example the no. of employees and
       | customers are both 20000 but typically the no. of employees is
       | less than the no. of customers so adding customer_id to employees
       | table would be more typical - in which case the query would
       | obviously change).
        
         | kroolik wrote:
         | I think this is called denormalization, right? What you could
         | try is to instead of modifying the domain for this specific
         | use-case, create a read model instead (if you don't want to
         | pollute your domain) . Sometimes its very easy to write data,
         | but reading is a pain. For such cases, you can prepare read
         | models that contains all the data necessary.
        
           | [deleted]
        
           | noisy_boy wrote:
           | This is not specifically denormalization because I'm not
           | combining/flattening two models - instead, I'm moving the
           | special treatment for employees from a totally separate path
           | to a property of the existing customer model without
           | removing/merging the employee model. Regarding read
           | optimizations, there are several ways, some of which are
           | vendor specific e.g. PostgreSQL/Oracle allow materialized
           | views which can make accessing this data simple (if often
           | required though they have their own issues).
        
       | rrrrrrrrrrrryan wrote:
       | Keep your matching logic simple.
       | 
       | With SQL, matching is (by far) the most computationally expensive
       | bit of most queries. Your goal should be to use only sargable[1]
       | operators, _and eliminate any functions_ , in all your WHERE
       | clauses and JOINs.
       | 
       | Complicated matching criteria will force the engine to make tough
       | trade-off decisions, which can be affected by out-of-date table
       | statistics, lousy indexes, and a million other things, and will
       | often force the engine to rule-out more optimized algorithms and
       | fall back to the cruder, primitive algorithms.
       | 
       | [1] For most queries, "OR" is not usually a sargable operator:
       | https://en.wikipedia.org/wiki/Sargable
        
         | setr wrote:
         | Shouldn't OR operations in an otherwise sargable[0] query just
         | constitute merging two sargable queries? I don't see why OR
         | should affect the property
         | 
         | [0] new term for me, and wow do I not like it
        
           | kroolik wrote:
           | This depends on the rdbms engine. Even if it might make
           | sense, the planned could go with nested loops because of
           | statistics and cost estimation.
           | 
           | In the article, the devil is in how the second explodes in
           | size. We are suddenly cross-joining employees with customers.
           | Left join of stores, employees and markouts gives you more
           | than 2 rows. It gives you all the employees assigned to the
           | store. Regardless only 2 have markouts.
           | 
           | Next, you join full customers table with it. And keep on
           | multiplying the size by further left joining. Result size
           | might quickly exceed the memory limits for efficient joins,
           | and the rdbms might have to resort to poor nested loops.
        
         | dragonwriter wrote:
         | > Your goal should be to use only sargable[1] operators, and
         | eliminate any functions, in all your WHERE clauses and JOINs.
         | 
         | Or just use indexes properly; it's fairly common for modern
         | databases to support indexes on functions of column values;
         | using an indexed (deterministic) function in WHERE/JOIN
         | criteria isn't problematic. The Wikipedia article you link
         | addresses valid concerns, but the specifics it suggests don't
         | seem accurate for the features of modern RDBMSs.
        
           | jbverschoor wrote:
           | Also common for ancient RDBMSes
        
           | haimez wrote:
           | Really seems like you don't know what sargable means (and
           | didn't bother to find out if this is your reply).
        
             | epr wrote:
             | No, its pretty clear from reading his entire response that
             | he understands exactly what sargable means. "just use
             | indexes properly" is pretty ambiguous, but the rest of his
             | comment provides context. It doesn't make much sense to me
             | to be indexing on fn(x) for predicates unless an
             | alternative sargable predicate is too complex or not
             | possible, but that doesn't mean that it's never the best
             | solution.
        
             | fastball wrote:
             | I think it might be the other way around.
        
         | slt2021 wrote:
         | the problem with author slow query is joining 7 tables (forcing
         | SQL engine to compute cartesian of 7 tables), this is the
         | primary source of slow down.
         | 
         | the faster query with UNION is joining 4 tables and unions with
         | 5 tables, so the faster query is two orders of magnitude
         | faster, because the SQL engine has to compute cartesian product
         | on 5 instead of 7 tables
        
           | Albert_Camus wrote:
           | Author here. Between this and your other response, where you
           | expound on the same point, I think you're being far too hand
           | wavy about what causes performance issues. The number of
           | joins alone doesn't have much to do with the performance
           | characteristics of any query.
           | 
           | What's more important for performance of queries on larger
           | data sets than the number of joins is that there are indexes
           | and that the query is written in a way that can utilize
           | indexes to avoid multiplicative slowdowns. The reason the
           | UNION query is fast is because the query on either side
           | effectively utilizes the indexes so that the database engine
           | can limit the number of rows immediately, rather than filter
           | them out after multiplying them all together. I can expand
           | this schema to have a UNION query with two 10-table joins and
           | it would still perform better than the 7 table query.
           | 
           | I think someone new to SQL is likely to read your statement
           | and think _" okay joins are slow so I guess I should avoid
           | joins"_. This is not true and this belief that joins are slow
           | leads people down the path that ends at _" SQL just doesn't
           | scale"_ and _" let's build a complicated Redis-backed caching
           | layer"_.
           | 
           | SQL performance is a complex topic. The point of our post was
           | to illustrate that a UNION query can simplify how your join
           | your tables and allow you to write constituent queries that
           | have better performance characteristics. Morphing this into
           | "the number of joins is smaller so the performance is better"
           | is just incorrect.
        
             | slt2021 wrote:
             | I invite you to share execution plans of queries #2 and #4
             | with the public so that people can decide what is actually
             | slowing down, whether itnis realy cartesian or anything
             | else
        
             | slt2021 wrote:
             | Sorry, if I was hand-wavy. I was trying to give other
             | people a simple framework that I use myself (Big O
             | calculated as a number of rows in each table). ALthough I
             | dont know how many rows you have in each table, Big O
             | frameworks still works, because cartesian of tables is
             | being dominated by two huge different datasets (customers's
             | orders being joined to employees' orders). The Union simply
             | calcualtes them separately, rather than doing cartesian of
             | two completely different datasets that represent different
             | entities.
             | 
             | Well written predicates and indexes can help, as well as
             | poorly written predicates make it worse. So there is
             | balance. This is not a shortcut or a silver bullet, it is a
             | trade-off being made. More indexes->faster selects and
             | slower updates/inserts. One bad index=>failed insert and
             | possible losing customer data (happened to me once)
             | 
             | I agree with you that "SQL performance is a complex topic."
             | and one should definitely study query Execution Plan to
             | understand the bottlenecks and make optimization decisions
        
               | aidos wrote:
               | Sql queries never really adhere to that simple Big O
               | analysis though. Sure, if you had zero indexes so every
               | operation was nested sequential scans, then it's
               | Cartesian, but that's never the case. Most often you get
               | really fast index lookups, by design.
        
             | kroolik wrote:
             | I think it's good to note in the article that the second
             | query is slow because the RDBMS doesn't use indexes (and
             | which ones). Currently, the text is hand waving the problem
             | and moves on.
             | 
             | If the article had, instead, listed indexes, shown they
             | were used in simple cases, shown they weren't used in the
             | second query, dug into why they weren't (maybe they were
             | but it was still hella slow) - that would be a ton of
             | value!
        
       | darig wrote:
       | It would be even faster if you switched to a series of queries
       | without any JOINs at all, and then stitched them together with
       | code other than SQL after the queries have ran.
       | 
       | Getting below 15 ms would be trivial.
        
       | tech2 wrote:
       | Perhaps I'm missing something, but why is query 2's meal_items a
       | left join when the where clause then forces it to act as if it
       | were an inner join? Would changing that have any impact?
        
         | Albert_Camus wrote:
         | Author here, you are indeed correct that Query #2's final join
         | can be an INNER join. However, I just tested it against our
         | test data set and it makes no impact on the performance.
        
           | tech2 wrote:
           | Ah well, thanks for humouring me. Good thing those query
           | optimisers do the right thing on occasion.
        
       | toast0 wrote:
       | UNION is also helpful for reducing round trips between the db
       | client and db server.
       | 
       | I've run into lots of cases where SQL join is clear to write, but
       | slow, but a simple query to get ids, plus a union of fetching
       | data for each id is fast. Again, a union of single id fetches is
       | often faster than a single query with in (x, y, z). We can wish
       | the sql engine could figure it out, but my experience from a few
       | years ago is that they usually don't.
        
         | ahmad_muzakkir wrote:
         | I'm very interested to learn what you are saying.
         | 
         | Can you give an example ?
        
         | onlyrealcuzzo wrote:
         | Does anyone with enough db experience know why the engine can't
         | figure it out? Is it easy to explain?
         | 
         | I'm not a db expert, but this seems like something that should
         | be solvable.
        
           | isoprophlex wrote:
           | Yeah agreed, at the very least the thing should be able to
           | conceive to switch to this... lets call it "WHERE unrolling"
           | internally when asked for " WHERE id IN (x, y, z)"
           | 
           | I'd be delighted to learn what's going on
        
             | AdrianB1 wrote:
             | In the tests I've done, WHERE id IN (x,y,z) was unrolled as
             | WHERE id = x OR id = y OR id = z. It is fine if the list is
             | short, I had to find solutions when a developer in my team
             | built a dynamic query where the IN list had ~ 4000
             | elements. A JOIN solved the problem fast and easy in that
             | particular case.
        
               | paulmd wrote:
               | if you need to do very large "IN" clauses, one option can
               | be a global temporary table where you preload values, so
               | instead of doing an IN you do a join against the GTT.
        
             | toast0 wrote:
             | Where unrolling is an excellent term, thanks.
             | 
             | The best I can understand is the engines are (or were) just
             | not setup so they can do repeated single key lookups, which
             | is really the right strategy for a where in. As a result,
             | they do some kind of scan, which looks at a lot more data.
        
               | tpetry wrote:
               | The problem is these repeated single key lookups are
               | random io for the database engine. So the database engine
               | has to predict a threshold for when a scan or random io
               | is cheaper which is very hard to get right, and your io
               | layer changes this threshold drastically. A spinning disk
               | may be faster all the time with sequential io, and for
               | flash based systems theres a wide variety of performance
               | profiles.
               | 
               | To tackle this problem postgresql has a setting where you
               | can tune the cost factor for random io.
        
               | arp242 wrote:
               | As an aside, this setting is almost certainly too high
               | for you by default. The default of random_page_cost=4
               | assumes that random pages are 4 times more expensive to
               | read than sequential pages, but with SSDs this cost is
               | much much lower and chances are you're using an SSD these
               | days.
               | 
               | I managed to speed up queries by about 10x by just
               | adjusting this. One Weird Trick To Speed Up Your
               | PostgreSQL And Put DBAs Out Of a Job!
        
               | guenthert wrote:
               | Thanks for the hint. I wonder where the value of 4 comes
               | from. Is this an old value derived from a (then) fast
               | RAID across many spinning rust disks? As you pointed out,
               | today --- using SSDs -- the cost should be lower; I'd
               | think in some cases (e.g. here a DB used for a catalog of
               | a back-up system) the backing store is exactly one
               | spinning rust drive, where I'd expect a _much_ higher
               | cost for random access (about 80MB /s / (100/s*8KiB)).
               | 
               | Ah, [1] has the answer: they acknowledge that random
               | accesses might be much slower, but expect a good share of
               | accesses to be satisfied by the cache (that'll be a
               | combination of various caches).
               | 
               | [1] https://www.postgresql.org/docs/13/runtime-config-
               | query.html
        
             | btilly wrote:
             | That one is annoying. IN tends to be implemented as a loop.
             | I've spend many queries up by putting an IN into a
             | temporary table, then joining it.
             | 
             | I have actually seen it be a win to combine it like this:
             | SELECT ...         FROM foo           JOIN (
             | select 'a' as value               union all
             | select 'b' as value               ...           ) bar
        
           | btilly wrote:
           | _I 'm not a db expert, but this seems like something that
           | should be solvable._
           | 
           | It is very highly database specific. MySQL tends to be
           | terrible. PostgreSQL tends to be quite good. The reasons why
           | it did not figure it out in particular cases tend to be
           | complicated.
           | 
           | Except for MySQL, in which case the problem is simple. There
           | is a tradeoff between time spent optimizing a query and time
           | spent running it. Other databases assume that you prepare a
           | query and then use it many times so can spend time on
           | optimizations. MySQL has an application base of people who
           | prepare and use queries just once. Which means that they use
           | simple heuristics and can't afford the overhead of a complex
           | search for a better plan.
        
             | bawolff wrote:
             | I was under the impression that mysql wss smart enough to
             | skip parts of the range that isn't needed.
             | https://dev.mysql.com/doc/refman/8.0/en/range-
             | optimization.h...
        
             | matwood wrote:
             | > It is very highly database specific.
             | 
             | I was about to comment the same thing. Every engine has its
             | quirks, and the programmer learns them over time. I went
             | from years on MSSQL to MySQL and it was a bit rough to be
             | generous. But now I know many of the MySQL quirks and it's
             | fine.
        
             | kristianp wrote:
             | "MySQL has an application base of people who prepare and
             | use queries just once". Is it really likely that mysql is
             | unique in its usage patterns? Isn't it more likely that the
             | devs have decided to keep things simple because that's the
             | historical story for mysql. For example it originally
             | didn't even support foreign key constraints if I recall
             | correctly (that's a while ago now of course).
        
               | btilly wrote:
               | Yeah, I'm pretty sure that it is unique to MySQL and
               | SQLite.
               | 
               | Back in the 90s and early 2000s, MySQL told people to
               | just run queries. PHP encouraged the same.
               | 
               | Every other database was competing on ability to run
               | complex queries. And so it became common knowledge that
               | applications written for MySQL didn't port well to, say,
               | Oracle. Exactly because of this issue.
               | 
               | Those applications and more like them are out there. And
               | mostly expect to run on MySQL. So that use case has to be
               | supported.
        
               | ComputerGuru wrote:
               | I agree on MySQL but hard disagree on SQLite because the
               | API makes it rather obvious very quickly to any developer
               | that there is a benefit to preparing beforehand.
        
               | dragonwriter wrote:
               | > "MySQL has an application base of people who prepare
               | and use queries just once". Is it really likely that
               | mysql is unique in its usage patterns?
               | 
               | It's quite plausible that it is distinct in the usage
               | patterns it has chosen to optimize for, and that that
               | design choice has been instrumental in determining what
               | use cases it became popular for and which use cases led
               | people to migrate off of it for a different engine, yes.
               | 
               | To the extent that is at play here it is somewhat of an
               | oversimplification (even if the target audience was a
               | factor in the sewing decision) to describe that as simple
               | unidirectional causation from usage to design, and
               | possibly more accurate (though still oversimplified) the
               | other way around.
        
           | bawolff wrote:
           | I imagine its bad query optimizer doing the wrong thing.
           | Maybe a different db engine would do better.
        
             | slt2021 wrote:
             | I dont like it when people who cannot write efficient
             | SQL/design efficient database schema start blaming "bad"
             | query optimizer.
             | 
             | pro SQL developers extremely rarely have a problem with
             | query optimizer, and when they have it is extremely exotic.
             | 
             | it is usually ORM users that have a problem with SQL
             | performance and blame the optimizer
        
               | matwood wrote:
               | I agree they shouldn't start there, but you don't
               | necessarily need exotic queries to run into edge cases
               | that work differently across say MySQL, MSSQL, and
               | Oracle. The classic example being exists vs left join
               | null.
        
               | bawolff wrote:
               | The comment i'm responding to is asking why two
               | equivalent queries have different runtimes. The answer
               | pretty obviously is because the optimizer is smart enough
               | to chose the best query plan in one case and not the
               | other. I think its fairly obvious that in theory a better
               | optimizer could figure out the right plan in both cases.
               | 
               | That's not to say that its impossible to work around or
               | that in a real application that you would ever be "stuck"
               | by this. At the end of the day you deal with the software
               | you have, but the optimizer can still have weaknesses
               | without it totally derailing your app.
        
               | gfody wrote:
               | the optimizer is constrained, it still has to give you
               | what you asked for. i can't even translate into english
               | what's being asked for in query #2 but it's only a
               | coincidence that it has the same result as the more
               | precise and correct thing being asked for in the final
               | query.
        
               | lixtra wrote:
               | There are two possible cases:
               | 
               | a) with everything known to the database (i.e. ids not
               | null) the two queries produce the same result.
               | 
               | b) for some contents of the schema the results differ,
               | but in the presented case some additional condition (not
               | known to the db) holds that makes them equivalent.
               | 
               | Only in case b) it can be called coincidence. In case a)
               | it's fair to ask if the optimizer can't do better. After
               | all it tries to avoid cartesian products for simple joins
               | even though that's the text book definition.
               | 
               | "Not worth" to improve the optimizer is still a valid
               | answer.
        
               | gfody wrote:
               | the query in question is definitely case b - it's a group
               | by query with an explicit granularity, there's just no
               | actual aggregates being asked for but the very presence
               | or absence of a row in the results carries some meaning.
               | the explicit grain is something like employee markouts
               | plus customers from stores with at least one employee
               | without markouts.
        
           | sixdimensional wrote:
           | There could be a lot of reasons that are highly engine
           | dependent, for this specific case.
           | 
           | A general answer, perhaps... not specific to the case you've
           | specified.
           | 
           | Query optimization is a science with multiple dimensions [1].
           | I'd wager every problem in computer science plays a role
           | somehow in query optimization.
           | 
           | Query performance is based on a combination of actions you
           | take to optimize the design of your system to get the best
           | performance (e.g. data modelling, index design, hardware,
           | query style, and more), and the patterns the system can
           | recognize based on your inputs and the data itself, with the
           | resources it has available, to optimize your queries.
           | 
           | There are known patterns for optimization that are discovered
           | over the years, many hard learned from practical experience.
           | This is why older "popular" engines sometimes are more mature
           | and more performant - they have optimizations built for the
           | common use cases over long periods of time. That is not to
           | say older engines are always better, just that they have
           | often had more exposure to the variety of problems that
           | occur.
           | 
           | The reason why the engine "can't figure it out" is that most
           | engines, even the best ones, are quite complex - combinations
           | of known rules as well as more fuzzy logic, where the engine
           | uses a combination of information and heuristics to
           | essentially explore a possible solution space, to try to find
           | the optimal execution plan. Making the right decision, well,
           | can be difficult and given the nature of these things,
           | sometimes the optimizer makes the wrong decision (this is why
           | "hints" exist, sometimes, you can force the optimizer to do
           | what you see is obvious - but this is suboptimal for you).
           | 
           | In some cases, finding an optimal execution plan can actually
           | be quite computationally expensive, and/or quite time
           | consuming, or the engine in question may simply have no logic
           | coded to handle the case. Optimization is all about finding
           | the balance between finding the most performant query plan,
           | but in the least amount of time, with the least computational
           | and I/O impact to the overall system, that returns the right
           | result. Optimizers are also highly depending on the
           | capabilities of the engineering teams that build them.
           | 
           | It is not an easy problem, and it is an area which one could
           | liken to almost machine learning/artificial intelligence, in
           | one way. There are so many possible options, the problem
           | space so big, with so many different ways to approach a given
           | scenario, that it can be difficult for the "engine" to
           | decide.
           | 
           | This is why known patterns were created, for example,
           | dimensional data models for analytical queries vs. 3rd normal
           | form. Dimensional data models enable certain optimizations,
           | for example, star schemas [2]. If you take a combination of
           | implementing known patterns, along with optimizers written by
           | engineers that exploit those patterns, you can get to a world
           | of better performance.
           | 
           | However, in a world that is, let's say.. more "open ended" -
           | for example, the world of data in a "data lake", where data
           | models are not optimized, data comes in unpredictable
           | multiple shapes/sizes, then it often comes down to
           | combinations of elegant/complex engines that can interpret
           | the shapes of data, cardinality, and other characteristics,
           | make use of much larger distributed compute and system
           | performance, and in some cases - often brute force to arrive
           | at the best query plan or performance possible.
           | 
           | There are so many levels of optimization.. for example, if
           | you were to look at things like Trino [3], which started its
           | genesis as PrestoDb in Facebook - you will see special CPU
           | optimizations (e.g. SIMD instructions), vectorized/pipelined
           | operations - there are storage engine optimizations, memory
           | optimizations, etc. etc. It truly is a complex and
           | fascinating problem.
           | 
           | Source: I was a technical product manager for a federated
           | query engine.
           | 
           | [1] https://en.wikipedia.org/wiki/Query_optimization
           | 
           | [2] https://en.wikipedia.org/wiki/Star_schema
           | 
           | [3] https://trino.io/
        
         | Daynil wrote:
         | I often read things that say that if you just write the query
         | properly, you can trust the query optimizer to select the right
         | plan. Just trust the optimizer! Unfortunately, in my
         | experience, the more complex your query gets, the more
         | opportunities the optimizer has to get it terribly wrong. I've
         | learned to accept that, if the "clear/clean" version of the SQL
         | (that the human likes to read and write) does not produce
         | acceptable results, you just have to drop more explicit hints
         | to the optimizer about what you want it to do. Query optimizers
         | are truly awesome pieces of software, you just have to learn to
         | work around their limitations when you hit them.
        
       | slt2021 wrote:
       | This is ABC of SQL: every SQL developer has to learn this lesson
       | soon or later: Every additional JOIN you make creates a cartesian
       | product of all previously joined records and the number of rows
       | grows exponentially => you join tables A to B to C to D -> SQL
       | engine has to create AxBxCxD number rows, each letter
       | representing number of rows in that table)
       | 
       | Of course specifying useful predicates that filter down dataset
       | and having indexes on these columns helps avoid row explosion,
       | but you can't have an index on each and every column, especially
       | on OLTP database.
       | 
       | in example #2 the author joined 7 tables (multiply row count of 7
       | tables and this gives you an indea of how many rows SQL engine
       | has to churn through) - big O complexity of query is
       | (AxBxCxDxExFxG) of course it will be SLOW.
       | 
       | in example #4 he joins 4 tables and unions with 5 table join, so
       | the big O complexity of query is AxBxCxD(1+E).
       | 
       | same as in programming, there is Big O complexity in your SQL
       | queries, so it helps to know O() compelxity of queries that you
       | write
       | 
       | TLDR: learn the SQL Big-O and stop compaining about the query
       | planner, pls
        
         | ineedasername wrote:
         | I don't understand what you mean by saying each join creates a
         | cartesian product. Join criteria avoid this. If I left outer
         | join 7 rows on table A and 5 on table B and join on a unique
         | key then I still only get 7 rows, not the cartesian product of
         | 7x5 rows. I query across a dozen or more tables on a regular
         | basis with aggregates, window functions, with statements,
         | having clauses and so on. Performance is rarely a problem.
        
         | jaredsohn wrote:
         | I've found denormalization to be helpful for such performance
         | issues in that it reduces the number of joins. It can also be
         | useful for debugging if you want to keep track of values from
         | other tables at the time that data was processed within the
         | primary table.
        
           | epr wrote:
           | denormalization should only be seen as a last resort
        
           | slt2021 wrote:
           | denormalization aka memoization aka "dynamic programming of
           | SQL" :-) - all three provide speed-up because they compute
           | value once and reuse it later
        
             | mtone wrote:
             | Except that your source of truth aka your raw data is now
             | memoized, hardly a win. This is not a goal of memoization
             | or dynamic programming at all.
             | 
             | Perfect for a reporting database though.
        
           | monoideism wrote:
           | A certain amount of denormalization can be helpful, but often
           | there's a trade-off in terms of maintainability.
        
             | rovr138 wrote:
             | Not who you replied to.
             | 
             | But yes,
             | 
             | As an example, we keep our data normalized and add extra
             | tables to do the operations we need. It's expensive since
             | we have to precompute, etc. But then on lookups it's
             | quicker.
             | 
             | Like everything, it depends on what you want to optimize
             | for.
        
               | jaredsohn wrote:
               | Yep, in my case we wanted to make user-facing queries
               | fast (i.e. a reports page). The extra complexity (not
               | much; we just have to remember when to update certain
               | values in other tables when they change) is definitely
               | worth it since load time dropped from 30 seconds
               | sometimes to almost instant.
               | 
               | Denormalizing everything is definitely a pain; keeping
               | data in sync in mongodb (which lacks traditional joins)
               | on a previous project was not fun; now using postgresql.
        
               | [deleted]
        
               | [deleted]
        
         | ericHosick wrote:
         | > Every additional JOIN you make creates a cartesian product of
         | all previously joined records and the number of rows grows
         | exponentially => you join tables A to B to C to D -> SQL engine
         | has to create AxBxCxD number rows, each letter representing
         | number of rows in that table)
         | 
         | I think a cartesian product would be:
         | 
         | SELECT * FROM A, B; Given sizes a, b the resulting number of
         | rows would be a X b or O(N^2).
         | 
         | With a join SELECT * FROM A INNER JOIN B ON A.id = B.id then
         | the result rows would be MAX(a,b) or O(N).
         | 
         | A JOIN is a predicate that filters down the dataset.
        
           | slt2021 wrote:
           | all queries with JOINs can be rewritten as cartesians: select
           | * from a join B on col1=col2 will be rewritten by query
           | optimizer into:
           | 
           | select * from A, B /* this is cartesian that slows down your
           | query */ WHERE a.col1=B.col2
           | 
           | in fact both queries produce the same execution plan if you
           | check yourself
        
             | whimsicalism wrote:
             | ? not how it works behind the scenes at all
        
             | ineedasername wrote:
             | All queries could technically be expressed as a cartesian
             | product but that is not necessary and not what happens in
             | practice. Both of the above might produce the same plan
             | because they are both treated as joins. One is expressed as
             | an explicit join, the other as an implicit join, but
             | neither requires the query engine to produce a cartesian
             | product on execution. If it did, queries I run in seconds
             | would require more storage or RAM than probably exists in
             | my entire workplace, and I'm not using anything that would
             | usually be considered "big data".
        
               | slt2021 wrote:
               | >>queries I run in seconds would require more storage or
               | RAM than probably exists in my city
               | 
               | my explanation to this: CPUs ave very fast. My CPU is
               | 4Ghz so a single core can do a lot of computations in one
               | second, and a SQL engine is smart enough to make
               | cartesian computation (as part of a query plan) and
               | discard the result if row does not meet predicate
               | condition.
               | 
               | in fact I agree that not entire cartesian is being
               | computed, if you specify enough predicates. But the query
               | still multiplies rows. In the author's article he is
               | joining employees when customer_id columns in NULL so
               | this is technically a cartesian, because NULL predicate
               | is not very selective (=there are a lot of rows with
               | value of NULL)
        
               | whimsicalism wrote:
               | Cartesian product means a very specific thing - testing
               | the predicates on every possible combination. This is not
               | what occurs in practice
        
               | ineedasername wrote:
               | You are not using the term cartesian product correctly as
               | applied to how database engines execute queries. Multiple
               | people have detailed the problems with your ideas on how
               | such things work. A fast processor cannot overcome the
               | the need for multiple terabytes of RAM that would be
               | required to process common queries if databases worked as
               | you describe. Databases are significantly more likely to
               | get bottlenecked by RAM than CPU, and your incorrect
               | understanding of how databases work would require as much
               | RAM as a large "big data" cluster to run queries on
               | moderately sized data sets. Then even if it had that RAM,
               | the bandwidth of the system's bus channels would be
               | incapable of transporting it fast enough to return
               | results in 500ms. Certainly not on my system with a few
               | dozen other people working at the same time, and
               | especially not when I have to query the live production
               | system with a few thousand end users instead of the data
               | warehouse. Databases do not work this way.
        
               | slt2021 wrote:
               | multiple people can be wrong, so it is not a valid
               | argument. I can write 10 nested loops and overwrite value
               | of a single cpu register with a single value, so many
               | times, that it will exceed any big data cluster capacity.
               | This is what I meant by CPUs are fast.
               | 
               | if you study engine internals, especially MergeJoin,
               | HashJoin, NestedLoopJoin - they all do comoute cartesian
               | while simultaneously applying predicates. Some operations
               | are faster because of sorting and hashing, but they still
               | do cartesian.
        
               | zeroimpl wrote:
               | Umm.. virtually no join on tables with more than a
               | handful of rows is done as a cartesian product. A
               | suitable set of columns is used to sort or hash each side
               | of the join, such that the actual join logic is basically
               | linear performance (each row in table A is probably only
               | ever paired with approximately 1 row in table B when
               | evaluating the predicates). A cartesian product would
               | involve testing the predicates on every combination of A
               | and B (each row in table A being paired with each row in
               | table B)
               | 
               | (Note when I say linear performance, I mean the join
               | predicate is executed a linear amount of times, but the
               | initial sort/hash operation was probably more like O(n
               | log n))
        
             | [deleted]
        
             | gfody wrote:
             | I think the point is rather that a join with a condition
             | (which is the norm) is almost never actually executed as a
             | cartesian product. take for example from tfa
             | select * from customer_order_items
             | 
             | the grain is one row per customer order item, and the next
             | two joins don't change that                 select * from
             | customer_order_items join customer_orders on order_id
             | 
             | the grain is still one row per customer order item
             | select * from customer_order_items join customer_orders on
             | order_id join customers on customer_id
             | 
             | the grain is still one row per customer order item
             | 
             | ..etc..
             | 
             | of course later on they screw it up and take the cartesian
             | product of customer_order_items by employ_markouts, but
             | it's just 2 big factors not 7 - their query did finish
             | after a few seconds. usually mistakes involving cartesian
             | products with 7 factors just run for hours and eventually
             | throw out of memory.
        
         | gfody wrote:
         | > you join tables A to B to C to D -> SQL engine has to create
         | AxBxCxD number rows
         | 
         | this isn't really true. not every join is a cross join, and not
         | every join even changes the grain of the result set. it is
         | important to pay attention when your joins do change the grain
         | of the result set, especially when they grow it by some factor
         | - then it's more like a product as you say.
        
           | slt2021 wrote:
           | all queries with JOINs can be rewritten as cartesians: select
           | * from a join B on col1=col2 will be rewritten by query
           | optimizer into: select * from A, B /* this is cartesian that
           | slows down your query */ WHERE a.col1=B.col2
           | 
           | in fact both queries produce the same execution plan if you
           | check yourself
        
             | gfody wrote:
             | the query is not the only input into the plan generator,
             | the other inputs are the schema (with constraints and
             | indexes) and their statistics - so you can't really say
             | "check the plan for yourself" without all the inputs. and
             | the only time the plan for "select * from A, B" is going to
             | be the same as "select * from A, B where A.col1=B.col2" is
             | when col1 and col2 are both neither primary keys nor
             | foreign keys and that would be a weird condition under
             | which to run such a query.
        
               | slt2021 wrote:
               | pls reread my post, I wrote that query select * from A
               | join B on col1=col2 is the the same plan as for select *
               | from a, b where col1=col2
               | 
               | this is true regardless of schema and statistics, it is
               | simply how query planner works under the hood
               | 
               | and you can check it yourself on any database
        
               | iaabtpbtpnn wrote:
               | Those are both inner joins, just a different syntax for
               | writing them.
        
               | ohlookabird wrote:
               | They might be logically equivalent, but they are not
               | identical if more than two tables are joined. For
               | instance, the current Postgres docs say [1]:
               | 
               | > This latter equivalence does not hold exactly when more
               | than two tables appear, because JOIN binds more tightly
               | than comma. For example FROM T1 CROSS JOIN T2 INNER JOIN
               | T3 ON condition is not the same as FROM T1, T2 INNER JOIN
               | T3 ON condition because the condition can reference T1 in
               | the first case but not the second.
               | 
               | [1] https://www.postgresql.org/docs/current/queries-
               | table-expres...
        
         | tehlike wrote:
         | Something that helps with exploding cartesian problem is using
         | json_agg for the joined tables. IT's available in postgres, it
         | works great.
        
         | meritt wrote:
         | Your understanding is flawed, these aren't cross joins.
        
       | boulos wrote:
       | Huh. Like other commenters, I'm surprised to see a multi part
       | JOIN be the answer for "I want the results of Query A as well as
       | the results of Query B".
       | 
       | If anything, I think many folks would say "whatever, make the two
       | requests in parallel and zip them together in the application".
       | If not, WITH expressions to make the two independent queries and
       | then a final SELECT to merge them however you prefer would be
       | reasonable (whether that's a join flavor or
       | UNION/INTERSECT/etc.), but the super explosion JOIN just doesn't
       | cross my mind.
       | 
       | I think there must be a different example that they tried to
       | anonymize for this post, otherwise they don't need the IDs (and
       | probably want a sum of quantity). If the author is around, what
       | was the more realistic example behind:
       | 
       | "In order to audit inventory, the logistics team at the corporate
       | headquarters requests a tool that can generate a report
       | containing all meal_items that left a given store's inventory on
       | a particular day. This requires a query that includes items that
       | were both sold to customers as well as recorded as employee
       | markouts for the specified store on the specified day."?
        
         | IgorPartola wrote:
         | Where this fail (and I don't see the article addressing this
         | concern) is if you want to do pagination. Say you are
         | retrieving page 17 of 8000. Will the UNION still be performant
         | then? After all, it will require a stable sort order that may
         | depend on one of the branches. You certainly can't pagination
         | without doing a whole lot of logic in your application if you
         | use two separate queries. My typical solution to this is to do
         | a query not employee IDs, another one of customer IDs and then
         | do an IN clause. Or you could do an intermediate ID of some
         | kind that will collapse your diamond branch into a single IN.
        
           | catmanjan wrote:
           | In my experience this is the number one driver for infinite
           | scroll, dropping paging simplifies a lot
        
             | sahila wrote:
             | Having infinite scroll doesn't avoid paging results from a
             | query. Like the OP mentioned, if you have 8000 page *
             | results/page records, you may only want to retrieve 50
             | records at a time to pass back to the client. In that case,
             | you will still need a way to query for the next set / page.
             | Infinite scroll is a UX implementation.
        
               | catmanjan wrote:
               | Yes - very true, however when we used paging we found
               | users were going "I know my results are on page 19, skip
               | to there" but when they were faced with the task of
               | scrolling to page 19, their behaviour changed and they
               | used filtering
        
               | ajmurmann wrote:
               | Did that make the user experience worse or better?
        
               | catmanjan wrote:
               | Haha no comment
        
               | OJFord wrote:
               | > "I know my results are on page 19, skip to there"
               | 
               | What on Earth is your application where that's the case?
               | ..A forum maybe? (Revisiting a thread looking for a
               | particular reply I mean.)
        
               | vletal wrote:
               | Have you ever used pornhub?
        
               | [deleted]
        
               | catmanjan wrote:
               | Document registry, but I've heard users doing this in
               | just about every application that allows it - just doing
               | a broad search and memorising the page that the thing
               | they actually want is on
        
               | rini17 wrote:
               | If users often search for known things then why not
               | optimize the engine/UI to the use case instead of
               | removing the use case?
               | 
               | Like, remembering the association <search term - selected
               | results> and prominently display these?
        
               | catmanjan wrote:
               | Search results were boosted based on usage, but the
               | performance benefits of infinite scroll outweighed the
               | use case
        
       | Black101 wrote:
       | speeding up sql queries isnt hard but it takes time and its worth
       | it.
        
       | mongol wrote:
       | What is the best book to learn performant SQL querying? I
       | recently needed to query a large database for exploratory
       | purposes. The database has historical data about many years of
       | sales transactions of a huge company. It is using partitioning
       | keys. I know pretty well how to write queries to get what I want
       | but given the size of the database I need to write them
       | performantly. How can I learn that best?
        
         | sqldba wrote:
         | SQL Performance Explained, hands-down https://sql-performance-
         | explained.com/
         | 
         | Covers all the main DBMS, explained very simply from the bottom
         | up, maintained constantly and has been read and recommended by
         | people for years and years.
        
       | croes wrote:
       | No need to wrap the union queries in a new select to order the
       | result. An order by after the last select in the unions orders
       | the whole result set as long as the column names match in each
       | part of the union.
        
         | AdrianB1 wrote:
         | That works in MS SQL, not guaranteed in others.
        
           | croes wrote:
           | Works in mysql, postgres, oracle, MS, DB2
        
         | rrrrrrrrrrrryan wrote:
         | This depends on the database engine. The syntax will error out
         | in some (perhaps most).
        
           | croes wrote:
           | Works in mysql, postgres, oracle, MS, DB2. Can you name an
           | engine where it fails?
        
       | balfirevic wrote:
       | I'm now curious (but too lazy to setup and test it) how the
       | following query would fare:                 SELECT meal_items.*,
       | employee_markouts.employee_id, customer_orders.customer_id
       | FROM meal_items       LEFT JOIN employee_markouts ON
       | employee_markouts.meal_item_id = meal_items.id       LEFT JOIN
       | employees ON employees.id = employee_markouts.employee_id
       | LEFT JOIN customer_order_items ON
       | customer_order_items.meal_item_id = meal_items.id       LEFT JOIN
       | customer_orders ON customer_orders.id =
       | customer_order_items.customer_order_id       LEFT JOIN customers
       | ON customers.id = customer_orders.customer_id       WHERE
       | (employees.store_id = 250 AND
       | employee_markouts.created >= '2021-02-03' AND
       | employee_markouts.created < '2021-02-04') OR
       | (customers.store_id = 250 AND
       | customer_orders.created >= '2021-02-03' AND
       | customer_orders.created < '2021-02-04')
        
         | Albert_Camus wrote:
         | Author here. This doesn't give the correct results. It produces
         | meal_items that have both customer_id and employee_id. Here's
         | an excerpt (the full result set is thousands of rows, as
         | opposed to the expected 45):                   id |label
         | |price|employee_id|customer_id|
         | ---|------------|-----|-----------|-----------|         ...
         | 344|Poke        | 4.18|       3772|      13204|
         | 344|Poke        | 4.18|       3313|      13204|
         | 344|Poke        | 4.18|       2320|      13204|
         | 344|Poke        | 4.18|        632|      13204|
         | 344|Poke        | 4.18|       4264|      13204|
         | 344|Poke        | 4.18|        699|      13204|
         | 344|Poke        | 4.18|       1070|      13204|
         | 344|Poke        | 4.18|       3022|      13204|
         | 344|Poke        | 4.18|       1501|      13204|
         | 344|Poke        | 4.18|        808|      13204|
         | 344|Poke        | 4.18|       2793|      13204|
         | 344|Poke        | 4.18|       1660|      13204|
         | 344|Poke        | 4.18|        932|      13204|         ...
         | 
         | To be clear, there are ways to write this query without UNION
         | that have both good performance and give the correct results,
         | but they're very fiddly and harder to reason about that just
         | writing the two comparatively simple queries and then mashing
         | the results together.
        
           | balfirevic wrote:
           | > Author here. This doesn't give the correct results.
           | 
           | Right you are! :) Should have given it more thought before
           | posting. Thanks for jumping in.
        
         | gfody wrote:
         | this is probably how I'd write it (assuming pg properly pushes
         | the predicates down to the subquery in the lateral join)
         | select mi.*, x.employee_id, x.customer_id       from meal_items
         | mi       join lateral (         select e.store_id, em.created,
         | em.employee_id, customer_id = null         from
         | employee_markouts em         join employees e on e.id =
         | em.employee_id         where em.meal_item_id = mi.id
         | union all         select c.store_id, co.created, employee_id =
         | null, co.customer_id         from customer_order_items coi
         | join customer_orders co on co.id = coi.customer_order_id
         | join customers c on c.id = co.customer_id         where
         | coi.meal_item_id = mi.id) x       where x.store_id = 250 and
         | x.created between '2021-02-03' and '2021-02-03'       order by
         | mi.id
        
         | knuthsat wrote:
         | OR really messes things up. UNION instead makes things fast.
        
           | balfirevic wrote:
           | Oh, I wouldn't presume that it will be faster than the union
           | but I wonder how it compares to the original query in the
           | article (the one with OR).
        
       | gfody wrote:
       | > Retrieving both employee and customer meal items
       | 
       | When you're getting two separate sets of things in one query you
       | should expect to use union. This isn't so much a performance
       | optimization as just expressing the query properly.
       | 
       | Imagine starting with just this weird SQL that gets both employee
       | and customer meal items using only joins, and then trying to
       | infer the original intent.
        
       | [deleted]
        
       | clcaev wrote:
       | The author doesn't describe why they thought a cross product of
       | two large tables, using a group by to remove duplicates, would be
       | a good basis for this query.
        
         | gfody wrote:
         | The author(s?) almost certainly have no idea what they're
         | talking about. Wtf are "diamond schemas" and "multi-branch
         | joins"? This blog is the sole google result for that made up
         | nonsense that's supposedly there to imply that it's somehow
         | normal to put an OR expression with columns from two different
         | tables into a join clause. They're using select * in a group by
         | query with no aggregates. And they're talking about keywords as
         | operations as if their SQL is directly executed and doing stuff
         | rather than merely being the input for a plan generator (this
         | despite the first paragraph - maybe one of them knows better
         | than the other?).
         | 
         | They say that query #2 is "the intuitive way of thinking
         | through the problem, and something that someone with
         | intermediate or advanced SQL skills could come up with" and is
         | slow because it "eliminated the optimizer's ability to create
         | an efficient query plan" - and they're wrong on both counts.
         | That query is a nightmare no advanced SQL expert would write,
         | and the plan that was generated was probably perfectly optimal
         | to give them what they asked for. They just don't realize they
         | asked for ridiculous nonsense that accidentally matches their
         | expected results.
        
           | idoubtit wrote:
           | > The author(s?) almost certainly have no idea what they're
           | talking about.
           | 
           | > They're using select * in a group by query with no
           | aggregates.
           | 
           | The author wrote something akin to
           | 
           | SELECT A.*, B.x FROM A JOIN B GROUP BY A.id, b.x
           | 
           | This is perfectly valid SQL, AS per the SQL standard: "Each
           | <column reference> in the <group by clause> shall
           | unambiguously reference a column of T. A column referenced in
           | a <group by clause> is a grouping column". From SQL-92
           | reference, section 7.7,
           | http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
           | 
           | When using offensive language, you'd better be sure of the
           | technical quality of what you write, if you don't want to
           | show an ugly portrait of yourself.
        
             | matwood wrote:
             | A.* assumes every column in A is functionally dependent on
             | the A.id. This may be the case, but is a huge source of
             | bugs I have seen related to MySQL, and would set my spidey
             | sense off if I saw it in code. The problem is that older
             | versions of MySQL allowed non-functionaly dependent columns
             | in the select list and would randomly put one of the values
             | in the result set (every other sane database at the time
             | properly errored out). At this point, I would consider it
             | bad practice to run MySQL in anything other than only full
             | group by mode.
             | 
             | See here: https://dev.mysql.com/doc/refman/5.7/en/group-by-
             | handling.ht...
             | 
             | And here: https://www.percona.com/blog/2019/05/13/solve-
             | query-failures...
        
             | gfody wrote:
             | > This is perfectly valid SQL
             | 
             | no, it's not. this query only runs because postgres is not
             | compliant with SQL-99 T301. the query is unambiguously
             | invalid under SQL-92.
             | 
             | > When using offensive language
             | 
             | fwiw I regret my tone but this article deserves to be
             | criticized. I love and respect rdbms technology and it's
             | exciting to see anything "SQL" in a headline on HN - but
             | then it bums me out to see bogus terms tossed around and
             | insane queries presented as normal. if this were some
             | beginner blog writing about lessons learned that's one
             | thing, but this is a professional consultancy firm writing
             | with the air of "you can trust us, we're experts".
        
       ___________________________________________________________________
       (page generated 2021-03-21 23:03 UTC)