[HN Gopher] Best practices for writing SQL queries
___________________________________________________________________
Best practices for writing SQL queries
Author : ReginaDeiPirati
Score : 244 points
Date : 2021-04-10 15:33 UTC (7 hours ago)
(HTM) web link (www.metabase.com)
(TXT) w3m dump (www.metabase.com)
| macando wrote:
| This article nudge me to google "SQL optimization tool". I found
| one that says: "Predict performance bottlenecks and optimize SQL
| queries, using AI". Basically, it gives you suggestions on how to
| improve your queries.
|
| I wonder what the results would be if I ran the queries from this
| article through that tool.
| SigmundA wrote:
| >LIKE compares characters, and can be paired with wildcard
| operators like %, whereas the = >operator compares strings and
| numbers for exact matches. The = can take advantage of indexed
| columns.
|
| Unless this specific to certain databases, LIKE can take
| advantage of indexes too, without wildcards LIKE should be nearly
| identical in performance to = both seeking the index.
|
| >Using wildcards for searching can be expensive. Prefer adding
| wildcards to the end of strings. Prefixing a string with a
| wildcard can lead to a full table scan.
|
| Which is contradictory to the first quote, it seems you recognize
| that a wildcard at the end can take advantage of an index. Full
| table scan is the same thing as not taking advantage of an index,
| hence LIKE can take advantage of normal indexes so long as there
| are characters before the first wildcard or has no wildcards.
| Hjfrf wrote:
| LIKE 'abc%' will use indexes but LIKE '%abc' will not.
|
| At least for the latest versions of every database. If you go
| back to a version from 10+ years ago there's no guarantees.
| SigmundA wrote:
| Pedantically if your database supports index scans it can use
| the index on the column to scan for '%abc' rather than the
| whole table which can be much faster while not as a fast as a
| seek.
|
| It can only do a seek if there are character before the
| wildcard: 'ab%c', 'abc%' and 'abc' getting progressively
| faster due to less index entries transversed.
| tomnipotent wrote:
| > it can use the index on the column to scan for '%abc'
|
| Using an index would just mean more overhead to fetch data
| later, so optimizers will prioritize a table scan in these
| cases since it would have less cost.
| deathanatos wrote:
| I think it would depend, wouldn't it? If the query can be
| answered directly from an index (there exists some index
| containing all of the columns required by the query) then
| an index scan would suffice and be faster by virtue of
| not having to scan _all_ the data (the index would be
| smaller by not including all columns). I believe most
| modern DB query optimizers are capable of this.
|
| If there _isn 't_ such an index, then it's a toss up:
| yes, going to the main table to fetch a row has a cost,
| but if there are only a few rows answered by the query,
| then it might be worth it. If there are many rows, that
| indirection will probably outweigh the benefit of the
| index scan & we'd be better off with a table scan. This
| would require an optimizer to estimate the number of rows
| the query would find. I don't know if modern DB query
| optimizers would do this or not. (And my naive guess
| would be "they don't", specifically, that the statistics
| kept are not sufficiently detailed to answer any
| generalized LIKE expression.)
| tomnipotent wrote:
| > I think it would depend
|
| Not for LIKE clauses using suffix wildcards, unless you
| create an index specifically using such a condition
| (CREATE INDEX IX_blah ON table (column) WHERE column LIKE
| '%abc');
| SigmundA wrote:
| Depends on estimated selectivity and if the index covers
| the result as well.
|
| If the criteria would fetch few rows out of many it can
| be faster to scan the index then retrieve the few
| matching results and even better if the index covers the
| results it never touches the table itself (index only
| scan).
| tomnipotent wrote:
| > Depends on estimated selectivity
|
| This can't be determined with LIKE suffix wildcards and
| that's not how any of the commonly-used index data
| structures work (b-tree, hash, gist, or bitmap). Index
| metadata will not help in eliminating leaf pages, and
| every row is going to need to be scanned.
| SigmundA wrote:
| Yes every row of the _index_ needs to be scanned not
| every row of the table which is faster than scanning the
| table.
|
| I am most familiar with MS SQL server and it will most
| certainly do an index scan for what it thinks is a highly
| selective predicate with "suffix wildcards" and it can
| return results faster than scanning the table.
|
| If the index covers the result columns it will scan the
| index and never touch the table otherwise it will do a
| key lookup to the table.
| tomnipotent wrote:
| B-tree's do not work that way. They are inherently
| ordered, and contain min/max that help to determine if
| you can skip the page for a given condition. The min/max
| cannot be used for suffix wildcards.
|
| Unless the index contains all the columns you're dealing
| with, the optimizer will determine that just scanning the
| table will cost less than scanning an index AND then
| looking up the data in the table (bookmark lookups in
| MSSQL).
| SigmundA wrote:
| B-trees have little to do with it, if the table has many
| columns its cheaper to scan the index for the value
| because it occupies less pages, thats all, less I/O more
| cache hits etc, goes from top to bottom on the index
| scanning for the result. This is the distinction between
| scan and seek.
|
| I just ran a common one I see and yep MS SQL is still
| doing a index scan then key lookup to get result with a
| select * from table where col LIKE '%abc' type query.
| yardstick wrote:
| > Although it's possible to join using a WHERE clause (an
| implicit join), prefer an explicit JOIN instead, as the ON
| keyword can take advantage of the database's index.
|
| Don't most databases figure this out as part of the query planner
| anyway? Postgres has no problems using indexes for joins inside
| WHERE.
| Hjfrf wrote:
| Yes all databases will use indexes for joins. There's quite a
| few mistakes like that.
|
| My guess is the author heard something about not using implicit
| inner joins (deprecated decades ago) and misunderstood.
|
| E.g. This old syntax- SELECT * FROM a, b WHERE a.id = b. a_id
| eirki wrote:
| > Avoid SELECT title, last_name,
| first_name FROM books LEFT JOIN authors ON
| books.author_id = authors.id
|
| > Prefer SELECT b.title,
| a.last_name, a.first_name FROM books AS b
| LEFT JOIN authors AS a ON b.author_id = a.id
|
| Couldn't disagree more. One letter abbreviations hurt readability
| IMO.
| hobs wrote:
| Most of your queries shouldn't be so verbose as to confuse
| which aliases you are talking about, but I agree - here's the
| best format :) (because why put ON on another line anyway?)
| SELECT bo.title, au.last_name,
| au.first_name FROM books AS bo LEFT JOIN authors AS
| au ON bo.author_id = au.id
| dragonwriter wrote:
| actually, I'd say the best is "column aliases are like full
| identifiers that name what the entity is in the context of
| the query".
|
| In simple cases that may be just the singular of the table
| name, e.g.: SELECT book.title,
| author.last_name, author.first_name FROM books
| AS book LEFT JOIN authors AS author ON
| book.author_id = author.id
|
| But in other cases, it will be different, e.g.:
| SELECT manager.last_name || ', ' ||
| manager.first_name AS manager_name,
| coalesce(employee.title, 'All Titles') AS staff_title,
| count(employee.id) AS count FROM employees AS manager
| LEFT JOIN employees AS employee ON manager.id =
| employee.manager_id GROUP BY manager.id,
| ROLLUP(employee.title) HAVING employee.title IS NOT
| NULL OR GROUPING(employee.title)=1
| mulmen wrote:
| ON should be on another line because it gives context to
| bo.author_id = au.id.
| hobs wrote:
| You never write NOT ON, so no, I'd disagree that its
| required for all the inequalities you express.
| mulmen wrote:
| Huh?
|
| ON is required by the syntax. Why start every line with a
| keyword that describes the line _except_ the ON portion
| of a JOIN? It 's inconsistent and has no clear benefit.
| zoomablemind wrote:
| The usefulness of this advice depends on the schema or design
| of the database. If the data is normalized, then it's quite
| reasonable to design for unambiguous field names in the queries
| after all relevant joins.
|
| Sure it does not help to understand the origins of a given
| field without aliases, unless someone is very familiar with the
| schema.
| heavyset_go wrote:
| Like naming variables, choose better names for your table
| aliases, then.
| trollied wrote:
| Came here to post exactly this.
|
| If the tables names are long, by all means abbreviate them a
| little, but never just use 1 letter aliases.
|
| I wonder if the author has ever worked with a system that has
| more than a handful of tables.
| dragonwriter wrote:
| I agree with the source that the latter (explicit table
| specification in the SELECT list, whether using aliases or not)
| is to be preferred to the former; at the same time (while I am
| sometimes guilty of using them) I agree that single-character
| aliases are generally a poor choice for the same reasons that's
| generally true of single character identifier names; column
| aliases are variable (well, constant) names and the usual rules
| of meaningful identifier names apply.
| GordonS wrote:
| It seems to be a matter of personal preference, but I've never
| liked single-character aliases myself, and never understood why
| so many seem to.
| dspillett wrote:
| Lazy typing: t is shorter than tableWithTheDataIWantIn
|
| I prefer descriptive table and other object names, and
| abbreviate them in aliases within queries (though usually not
| to single letters).
| markmark wrote:
| It's not just about lazy typing it's about removing
| unnecessary clutter from large queries that makes things
| _harder_ to read. In the author /books example, repeating
| the words author and books a dozen times doesn't convey any
| information that a and b don't, but clutters up the query
| making it harder to see the useful parts.
| hn_throwaway_99 wrote:
| While I semi disagree, I think the author's primary point was
| that you should always scope your column names.
| prepend wrote:
| I prefer having some meaningful alias because trying to
| remember what a,b,c,d, etc gets annoying.
| chrishas35 wrote:
| Author probably could have chosen a better example such that
| it doesn't look like the author chose letters sequently. In
| this case the letters are meaningful as they are the same as
| the first of the table name, a common convention,
| unfortunately that happens to be the first two letters of the
| alphabet...which yes, would be very annoying.
| [deleted]
| dspillett wrote:
| While single letter aliases can be bad outside of small
| examples like that, even worse being: SELECT
| t1.thing , t2.stuff , t3.stuffagain
| , t4.more FROM SomeTable t1
| JOIN TableThatLinksToSelf t2 ON <join predicate> JOIN
| TableThatLinksToSelf t3 ON <join predicate> JOIN
| AnotherTable t4 ON <join predicate>
|
| that is not the point that is being made here. The point is
| that explicitly naming tables is beneficial to understanding
| and reducing issues later. Short alias is preferable to not
| specifying column sources at all.
| [deleted]
| cornel_io wrote:
| NB: this post is mostly performance advice, and it only applies
| to traditional databases. Specifically, it is not good advice for
| big data columnar DBs, for instance a limit clause doesn't help
| you at all on BigQuery and grabbing fewer columns _really_ does.
| jitans wrote:
| not even all the "traditional" databases, each Engine has his
| own peculiarities.
| the_arun wrote:
| Anyone using Metabase? Is it worth having self hosted and
| managing it?
| ineedasername wrote:
| _prefer an explicit JOIN_
|
| Yes absolutely, and not just for performance benefits. It's much
| easier to track what, how, and why you're joining to something
| when it's not jumbled together in a list of a dozen conditions in
| the WHERE clause.
|
| I can't tell you how much bad data I've had to fix because when I
| break apart the implicit conditions into explicit joins it is
| absolutely not doing what the original author intended and it
| would have been obvious with an explicit join.
|
| And then in the explicit join, always be explicit about the join
| type. don't just use JOIN when you want an INNER JOIN. Otherwise
| I have to wonder if the author accidentally left off something.
| c2h5oh wrote:
| CTE advice is somewhat questionable, as it is database specific.
|
| CTEs were for a very long time an optimization fence in
| PostgreSQL, were not inlined and behaved more like temporary
| materialized views.
|
| Only with release of PostgreSQL 12 some CTE inlining is happening
| - with limitations: not recursive, no side-effects and are only
| referenced once in a later part of a query.
|
| Mode info: https://hakibenita.com/be-careful-with-cte-in-postgre-
| sql
| johnvaluk wrote:
| Overall an enjoyable read, but as someone who includes SQL
| queries in code, I disagree with two points:
|
| I despise table aliases and usually remove them from queries. To
| me, they add a level of abstraction that obscures the purpose of
| the query. They're usually meaningless strings generated
| automatically by the tools used by data analysts who rarely
| inspect the underlying SQL for readability. I fully agree that
| you should reference columns explicitly with the table name,
| which I think is the real point they're trying to make in the
| article.
|
| While it's true that sorting is expensive, the downstream
| benefits can be huge. The ability to easily diff sorted result
| sets helps with troubleshooting and can also save significant
| storage space whenever the results are archived.
| alex_anglin wrote:
| To each their own, but in the case of ETL/ELT, you would just
| be asking for pain not using aliases.
| mulmen wrote:
| In the case of ETL you should only be referencing those
| tables a few times because you are integrating them into
| friendly analytic models. In that case you probably have a
| lot of columns to wrangle and complex transformation logic.
| In those cases I prefer to use no alias at all to avoid the
| scrolling around to get context, even when table names are
| very long.
| gizmodo59 wrote:
| Even there someone needs to read them eventually than just
| the person who wrote it. Single letter aliases are just evil.
| In some ways it's the same as doing: String x = "Hello"
| dragonwriter wrote:
| > Even there someone needs to read them eventually than
| just the person who wrote it.
|
| That's not an argument against table aliases, its an
| argument against _unclear_ table aliases.
|
| Single letter table aliases are better than just using
| unqualified column names, both of which are worse than
| table aliases guided by the same naming rules you'd use for
| semantically-meaningful identifiers in regular program
| code.
| barrkel wrote:
| Table aliases make queries much less verbose than using the
| full table name, and more readable as a result. Aliases are
| unavoidable when you're joining the table more than once. Not
| using qualified identifiers is just asking for trouble.
|
| Short aliases - I tend to use the first letter of each word in
| the table name - work best, IMO.
| learn_more wrote:
| I'm the opposite. I like the short aliases, esp. when there are
| many tables. Short aliases can all be the same length, and
| therefore align better for better readability, and they don't
| pollute the visibility as overly-verbose table names do.
|
| In code, I like the length of the variable name to be
| proportional to the size of the scope. Small scope -- short
| variable names.
| magicalhippo wrote:
| I got descriptive table names like WhsTransactionGoodsItems and
| WhsTransactionGoodsItemPackages.
|
| I feel it would be rather noisy to have to specify such table
| names in front of the 15+ column references in a query,
| compared to using aliases.
|
| Then again I've never had to diff the result sets, so I guess
| our usage is quite different.
| mulmen wrote:
| My editor has tab completion, it's not like you have to type
| every character. I prefer it for readability but it's
| definitely debatable.
| magicalhippo wrote:
| Readability is weird heh, I'm exactly the opposite. For me
| long lines become a blur, so better to have a short alias.
| mulmen wrote:
| I know what you mean. I think this is really a matter of
| readability vs comprehension. I want all the context on
| one line so I can understand it, even if that does
| technically make it harder to read that line.
|
| This is especially a problem in huge queries.
| magicalhippo wrote:
| I'd add "be aware of window functions"[1]. Certain gnarly
| aggregates and joins can often be much better expressed using
| window functions.
|
| And at least for the database we use at work, if the sole reason
| for a join is to reduce the data, prefer EXISTS.
|
| [1]: https://www.sqltutorial.org/sql-window-functions/
| AdrianB1 wrote:
| Sorry to rain on your parade, but there is nothing in that
| article that is not included in the basic SQL manuals like Itzik
| Ben-Gan's.
|
| Also a few things are dead wrong: the "make the haystack small"
| is optimization (it should be at the end, as the first rule
| says), the "prefer UNION ALL to UNION" is missing the context (a
| good dev knows what is needed, not what to prefer) and the usage
| of CTEs is nice, but sometimes slower that other options and in
| SQL slower can easily be orders of magnitude, so nice is not
| enough. Same for 'avoid sorting where possible, especially in
| subqueries' or "use composite indexes" (really? it's a basic
| thing, not a best practice).
|
| In the past few months I interviewed and hired several DBAs, this
| list is ok-ish for a junior but a fail for a senior. I am not
| working for FAANG, so the bar is pretty low, this article would
| not even pass for a junior there.
| [deleted]
| iblaine wrote:
| "Best practices for writing SQL queries in metabase" should be
| the title here.
|
| 10 or so years ago when SQL Server, Oracle & MySQL dominated the
| industry, you could talk about SQL optimization with the
| expectation that all advice was good advice. There are too many
| flavors of databases to do that today.
| nerdbaggy wrote:
| Does anybody else like putting from first? I find it makes the
| auto complete sooo much better and easier to read.
| 7952 wrote:
| I think it would be easier to order things in terms of when
| they are executed. And perhaps it would be easier to teach SQL
| if the different parts where more obviously separate. As the
| different parts are actually distinct and don't really cross
| over. But to a newbie would seem procedural when it's not.
| chubot wrote:
| I like it, but sqlite doesn't seem to accept it? At least not
| in the version on my Ubuntu machine. Is putting from first
| standard SQL?
| [deleted]
| SigmundA wrote:
| Xquery and Linq both use FLWOR like syntax which puts the
| "FROM" first and helps auto complete, wish SQL had ordered
| things this way:
|
| https://en.wikipedia.org/wiki/FLWOR
|
| SELECT first_name FROM person WHERE first_name LIKE 'john'
|
| becomes:
|
| FROM person WHERE first_name LIKE 'john' SELECT first_name
|
| SQL reads more English like while from first is more Yoda speak
| but the auto-complete is worth more to me.
| carbocation wrote:
| Personal habit is to start my WHERE clause with a TRUE or a FALSE
| so that adding or removing clauses becomes seamless:
| SELECT foo FROM bar WHERE TRUE AND baz
| > boom
|
| For OR conditions it's a bit different: SELECT
| foo FROM bar WHERE FALSE OR baz > boom
| magicalhippo wrote:
| Yeah, I almost always do "where 1=1" with the actual
| expressions AND'ed below.
|
| For OR, I like to keep the "1=1" and do AND
| (1=2 OR ... )
| mactrey wrote:
| I'm so attached to starting all my where clauses with a TRUE
| (1=1 since SQL Server doesn't have boolean literals) that I do
| this when I need some OR clauses: SELECT foo
| FROM bar WHERE 1=1 AND (1<>1
| OR baz > boom OR fizz >= bang )
| AND foo is not null
|
| So you can comment out lines starting with OR individually.
| Some people might hate it but it makes sense conceptually for
| me since almost every query I write takes a chain of ANDs in
| the where clause as a starting point.
| warent wrote:
| this seems like taking on a pretty huge risk for a minor
| convenience. the difference between those two queries can mean
| the difference between protecting someone's PII
| carbocation wrote:
| I'm not sure that I follow. The two queries are to
| demonstrate difference in form; they are not intended to be
| equivalent.
|
| If you're already writing: WHERE foo=bar
| AND biz=baz
|
| It's not clear to me how: WHERE TRUE
| AND foo=bar AND biz=baz
|
| is worse.
| shazzdeeds wrote:
| He's saying if someone gets in the habit of using that
| style they have to be very careful. If they forget to
| change True to False when using an OR that it could have
| major consequences. Performance being the least of
| concerns.
| Hjfrf wrote:
| Lots of mistakes (or at least rare opinions going against the
| crowd) here.
|
| Here's a better general performance tuning handbook -
| https://use-the-index-luke.com/
| willvarfar wrote:
| More that its a dumbed-down general guide aimed at meta base
| users?
|
| Use-the-index-luke is an altogether deeper, more technical
| article aimed at data engineers and going into the details and
| differences between databases.
| hn_throwaway_99 wrote:
| This is an aside, but a colleague years back showed me his
| preferred method formatting SQL statements, and I've always found
| it to be the best in terms of readability, I just wish there was
| more automated tool support for this format. The idea is to line
| up the first value from each clause. Visually it makes it
| extremely easy to "chunk" the statement by clause, e.g.:
| SELECT a.foo, b.bar, g.zed FROM alpha a JOIN beta
| b ON a.id = b.alpha_id LEFT JOIN gamma g ON b.id =
| g.beta_id WHERE a.val > 1 AND b.col < 2
| ORDER BY a.foo
| karmakaze wrote:
| I find it more readable as SELECT a.foo, b.bar,
| g.zed FROM alpha a JOIN beta b ON a.id = b.alpha_id
| LEFT JOIN gamma g ON b.id = g.beta_id WHERE a.val > 1
| AND b.col < 2 ORDER BY a.foo
|
| Usually only the conditions get deep and can also use extra
| indented parenthesized parts.
|
| reminder: don't use 'OUTER' it's pure noise
| AdrianB1 wrote:
| We are doing something very similar, except we add extra
| indentation to the joins like you did for the second
| condition in the WHERE. This is because we strongly separate
| each block - SELECT, FROM, WHERE, <ORDER>, so everything that
| is in the FROM block is indented. Same for the SELECT if it
| so long it goes to a second or third row.
| sodapopcan wrote:
| My problem with formatting any code like this is that it can
| make diffs painful. I agree that this looks better but I would
| say only marginally so. And I really have no problems reading
| code that isn't lined up like this. I don't really have a high
| care level, though. I'm happy to go with the team on this one.
| hn_throwaway_99 wrote:
| I don't see why you think it would make diffs painful. If
| anything, in my experience it makes diffs _easier_ because
| each chunk can be put on it 's own, independent line so that
| if you change anything it is constrained to the relevant
| line.
| dragonwriter wrote:
| It makes diffs harder because maintaining the indentation
| rule (sometimes, depending on what is on _other_ lines)
| requires changing every line of the query if you go from
| "INNER JOIN" (equally, outer /right/cross join) to "LEFT
| JOIN" (equally, full join).
| fiddlerwoaroof wrote:
| Nearly every diff tool has -w for this, though: main main
| annoyance with GitHub is that I can't enable this as the
| default diff mode.
| sodapopcan wrote:
| Yep--`-w` is the default when I blame in my editor but
| ya, github is really the problem.
| majewsky wrote:
| At least GitHub now has a UI for enabling it. I remember
| the dark ages when you had to put ?w=1 on the URL _like
| some sort of animal_.
| hn_throwaway_99 wrote:
| The indentation rules never change. Fortunately, "SELECT"
| at six letters is as long as the longest first word that
| starts a clause, which is why when doing, for example, a
| "LEFT JOIN" you line up the "LEFT" and not the "JOIN,
| e.g. SELECT a.foo FROM alpha a
| JOIN beta b ...
|
| would become SELECT a.foo FROM
| alpha a LEFT JOIN beta b ...
|
| Any diff tool correctly highlights the only change is the
| LEFT.
| sodapopcan wrote:
| Ya, was thinking after I submitted that SELECT and DELETE
| are always going to be the longest anyway, so it always
| works for SQL!
| Hjfrf wrote:
| Does that still look ok if you're selecting 10+ columns with
| functions, or would you split out the first line situationally?
| hn_throwaway_99 wrote:
| Another commenter showed how this works:
| SELECT a.foo , b.bar , g.zed
| FROM ...
|
| While the comma placement may seem weird, it makes this
| exactly identical to the "AND" or "OR" placement in WHERE
| clauses, and the primary benefit is that it's easy to comment
| out any column except the first.
| mulmen wrote:
| It's always easy to comment out any column except the first
| or last. Leading commas make it easy to comment out the
| _last_ column, trailing commas make it easy to comment out
| the first.
|
| Personally I don't think that optimization is worth the
| price. Trailing commas look nicer visually so I prefer
| them.
| jolmg wrote:
| > While the comma placement may seem weird
|
| It's not completely unconventional. Haskell is typically
| styled with that kind of comma usage, too. For example,
| [ 1 , 2 ] { foo = 1 , bar = 2
| }
|
| Coincidentally, SQL and Haskell are the only languages I
| know that use `--` for comments.
| mulmen wrote:
| With the curly braces this makes a lot more sense. In SQL
| it offends my eyes (personal preference) but here it
| seems more clear.
| weaksauce wrote:
| what about something like this?
| http://www.eslinstructor.net/vkbeautify/
| matwood wrote:
| This is close to what I've settled on for 20 years. I'll also
| indent again if there are a lot of joins/clauses in the on.
| dspillett wrote:
| I've taken to using a similar format too, though some seem to
| dislike it significantly. Other things I like for clarity and
| editing ease are prefix commas and lining up like parts, using
| something like your example: SELECT a.foo
| , b.bar , g.zed FROM alpha a
| JOIN beta b ON a.id = b.alpha_id AND a.another = b.thing
| LEFT JOIN gamma g ON b.id = g.beta_id WHERE a.val
| > 1 AND b.col < 2 ORDER BY a.foo
|
| or SELECT a.foo , b.bar
| , g.zed FROM alpha a JOIN beta b
| ON a.id = b.alpha_id AND a.another =
| b.thing LEFT JOIN gamma g ON b.id =
| g.beta_id WHERE a.val > 1 AND b.col
| < 2 ORDER BY a.foo
|
| I'm not consistent with the layout of my joining predicates - I
| go for whatever seems clearer given the current circumstances
| and that varies due to several factors (number of parts, length
| of column names and/or functions, ...). How sub-queries and
| instances of CASE are broken into lines and indented is
| something I also vary on.
| thih9 wrote:
| Maintaining alignment in these queries seems a pain. I'd
| prefer the regular, newlines and fixed indentation; e.g.:
| SELECT a.foo, b.bar, g.zed FROM alpha a JOIN
| beta b ON a.id = b.alpha_id AND a.another = b.thing
| LEFT JOIN gamma g ON b.id = g.beta_id WHERE a.val >
| 1 AND b.col < 2 ORDER BY a.foo
|
| (bonus: "AND" got accidentally aligned with the end of
| "WHERE")
| akira2501 wrote:
| > Maintaining alignment in these queries seems a pain.
|
| I use tabs. SELECT t.foo, t.bar
| FROM a_table t
| zikzak wrote:
| This escalated quickly.
| jolmg wrote:
| > though some seem to dislike it significantly
|
| I can see why. The indentation of the whole statement is not
| determined by the first line, but by the 6th on the first and
| the 8th on the second on a `JOIN` clause. It's really
| arbitrary, and when you have that statement between other
| code, it's going to be weird how the start of the statement
| is much more indented than its preceding code. I really
| dislike it, too.
|
| I prefer the use of indentation to signal what's inside
| another syntax structure. So, for example, I also dislike how
| you aligned `ON` and `AND` when the `AND` is inside the `ON`
| expression. It makes it seem like the two lines are on the
| same syntactic level.
|
| Here's how I do it: SELECT a.foo ,
| b.bar , g.zed FROM alpha a JOIN
| beta b ON a.id = b.alpha_id AND
| a.another = b.thing LEFT JOIN gamma g ON
| b.id = g.beta_id WHERE a.val > 1 AND b.col
| < 2 ORDER BY a.foo
|
| You might also notice that I removed the padding you used to
| align the `=` signs. I dislike big changes where the only
| thing that changed for a line is the whitespace padding. It
| obscures the real change. It might not seem like a big thing
| when you only have 2 lines in alignment, but it's a real
| bother when reading a diff that does that for more lines. You
| have to compare between the - and + lines to find what really
| changed instead of the diff telling you outright.
| arh68 wrote:
| I like your second version. My own style, still evolving, is
| to write more lines and align further left
| select a.foo , b.bar , g.zed
| from alpha a inner join beta b on b.alpha_id =
| a.id and b.thing = a.another left join gamma g
| on g.beta_id = b.id where a.val > 1
| and b.col < 2 order by a.foo
| mulmen wrote:
| I like the idea in general, I have tried something similar
| before. But I've never understood the appeal of leading
| commas. It screws up your alignment and just looks messy.
| ZeroClickOk wrote:
| You can easily add remove columns like this, just
| removing the entire line. If you add the comma the
| traditional way, you will change 2 lines (the end of
| previous line with comma, and the new line). It's nice
| for maintenance and diff'ing
| mulmen wrote:
| That is only true of the first or last line.
| col1, col2, col3 col1 ,col2
| ,col3
|
| You can remove col2 from either of those examples and
| have valid syntax.
| ZeroClickOk wrote:
| Reference: https://www.sqlstyle.guide/
| saila wrote:
| This is the style I've settled on lately, where all the major
| keywords are left-aligned and the clauses are consistently
| indented. It uses a bit more vertical space, but I find it
| easier to read than any other formatting style I've seen (in
| the wild or produced by formatters). select
| a.foo, b.bar, g.zed from
| alpha a join beta b on b.alpha_id =
| a.id left join gamma g on g.beta_id
| = b.id where a.val > 1 and b.col <
| 2 order by a.foo
|
| It's really easy, for me anyway, to get an overview of the
| query with this style compared to styles that are more cramped
| or that are inconsistently wrapped/indented.
|
| For simpler queries, I think this is okay too, but only if the
| clauses easily fit on a single line: select
| ... from ... join ... where ...
| dbatten wrote:
| My company style guide explicitly specifies this.
|
| https://github.com/republicwireless-open/sql-style-guide
|
| I think Mozilla's does as well.
| croes wrote:
| It's also easier if you want to comment out certain parts of
| your code during debugging.
| scottmcdot wrote:
| On this, I like to also use Where 1 = 1
| And... And...
|
| Which makes it easy to comment out specific filters.
| Icathian wrote:
| 10 years messing with data across 5 major flavors of sql and
| this is the format I've settled on and advocated for my team.
| It seems to flow the best and be easiest for people to get
| used to.
| u801e wrote:
| I've found that using hanging indents and using prefixed
| binary operators, like you did here, is the most readable
| form and when adding additional clauses, makes for a readable
| diff.
| supernova87a wrote:
| Is there a good place to read from an advanced casual "lay
| user's" perspective what SQL query optimizers do in the
| background after you submit the query?
|
| I would love to know, so that I can know what optimizations and
| WHERE / JOIN conditions I should really be careful about making
| more efficient, versus others that I don't have to worry because
| the optimizer will take care of it.
|
| For example, if I'm joining 2 long tables together, should I be
| very careful to create 2 subtables with restrictive WHERE
| conditions first, so that it doesn't try to join the whole thing,
| or is the optimizer taking care of that if lump that query all
| into one entire join and only WHERE it afterwards? How do you
| tell what columns are indexed and inexpensive to query
| frequently, and which are not? Is it better to avoid joining on
| floating point value BETWEEN conditions?
|
| And other questions like this.
| fabian2k wrote:
| My experience is with Postgres, this might vary for other
| databases. As already said, using EXPLAIN ANALYZE is very
| useful to see what the planner is doing. This might be hard to
| read for more complex queries, but it is quite understandable
| for simple ones.
|
| One of the more important parts is simply understanding which
| indexes can be used in a query. The other part is understanding
| when the database will intentionally not use an index, this is
| mostly related to column statistics. The basics of indexes are
| pretty simple, but then there is a whole bunch of subtle
| details that can mean the index can't actually be used for your
| query.
|
| Another useful part to understand is how much IO a query
| requires, EXPLAIN (ANALYZE, BUFFERS) is helpful for that. But
| you also need to understand a bit the layout Postgres uses to
| store data, how it is stored in pages, TOAST and related stuff.
|
| For Postgres I'd really start with reading the manual on index
| types and on the statistics collector. After that I'd just play
| with explain analyze for queries you're writing.
|
| The order of JOINS is optimized automatically in Postgres, but
| only up to a pointf, for a large number of joins it has to fall
| back to heuristics.
| wolf550e wrote:
| This is a good explanation about join order optimization:
| https://www.sqlite.org/queryplanner-ng.html
|
| No database can find perfect join order when you have more
| than about 8 to 10 tables in the join.
| wolf550e wrote:
| You basically only need to know one thing to answer all your
| questions: use EXPLAIN PLAN. Postgresql has "explain analyze",
| which is even better than simple "explain", but all SQL
| databases have "explain", because they are kinda useless
| without it. The database will tell you what it's going to do
| (or what it did) and you will decide whether that's ok or
| whether it's doing something stupid (e.g. full table scan when
| only 1% of rows is needed), and then you can try things to get
| the plan that you want (ensuring statistics are up to date,
| adding indexes, changing the query, etc).
|
| Databases have ways to query the schema which includes the
| index definitions, so you can know which columns and indexed
| (and the order of the columns in those indexes).
|
| Unless you materialize a temporary table or materialized view
| or use a CTE with a planner that doesn't look inside CTEs, the
| planner will just "inline" your subqueries (what are
| "subtables"?) and it will not affect the way the join is
| performed.
|
| Join on floating point value is quite rare. Why do you need to
| do that?
| supernova87a wrote:
| _> Join on floating point value is quite rare. Why do you
| need to do that?_
|
| Ah, thanks for noticing this. They are, for example, (1)
| tables of timestamped events, and (2) tables of time ranges
| in which those events need to be associated with (but which
| unfortunately were not created with that in mind at the
| time)...
|
| So for example FROM tableA LEFT JOIN tableB ON (timestampA
| BETWEEN timestampB1 AND timestampB2)
|
| (and where the timestamps can be either floating point or
| integer nanoseconds)
| wolf550e wrote:
| Since it's a left join, you will get all the rows from
| tableA, and for each row the matching rows in tableB. If
| the ranges in tableB are non-overlapping, maybe you have
| names for time ranges and you want the name of the time
| range for each row in tableA?
|
| If tableB is large, I don't know what any particular query
| planner will do with such a query and whether an index on
| (timestampB1, timestampB2) will help. It should, but use
| "explain" to check. If tableB has many rows and also has
| many columns and you only need a few columns, a covering
| index on (timestampB1, timestampB2) that only has the
| columns you need can improve perf a lot, because it won't
| need to refer to tableB itself.
|
| If you use this construction to translate timestamp ranges
| into calendar ranges, your database might have a function
| to do that efficiently (convert unix timestamp into
| datetime, extract year/month/day/etc from the datatime). Or
| you might need to write a user defined function to do that,
| in whatever way your database allows (even C). This should
| be better than a join, IMO.
|
| One alternative rewriting of your query which you maybe did
| not think of, and which might be crazy or might be
| plausible, is to use a case statement in the select part,
| instead of a join. Basically use the info in tableB to
| generate the SQL for a computed column. If tableB has many
| rows, this might be worse than a join.
|
| If you want to use "names" from tableB to filter rows in
| tableA (inner join), and the query should result in a small
| proportion of the rows from tableA, an index on timestampA
| is needed. If tableA is really large, it might need to be
| partitioned on timestampA to filter out whole partitions,
| but only if you regularly query in such a way that whole
| partitions can be filtered out at query planning time.
| haolez wrote:
| Metabase is an amazing product, but I'm using Superset[0] in my
| company because it supports Azure AD SSO, which became a
| necessity for us. But as soon as this feature appears in
| Metabase, we are switching.
|
| [0] https://superset.apache.org/
| colonwqbang wrote:
| > Although it's possible to join using a WHERE clause (an
| implicit join), prefer an explicit JOIN instead, as the ON
| keyword can take advantage of the database's index.
|
| This implies that WHERE style join can't use indices.
|
| I can understand why some would prefer either syntax for
| readability/style reasons. But the idea that one uses indices and
| the other not, seems highly dubious.
|
| Looking at the postgres manual [1], the WHERE syntax is clearly
| presented as the main way of inner joining tables. The JOIN
| syntax is described as an "alternative syntax":
|
| > This [INNER JOIN] syntax is not as commonly used as the one
| above, but we show it here to help you understand the following
| topics.
|
| Maybe some database somewhere cannot optimise queries properly
| unless JOIN is used? Or is this just FUD?
|
| [1] https://www.postgresql.org/docs/13/tutorial-join.html
| SigmundA wrote:
| Yes the optimizer should end up with the same plan either way,
| although the ON syntax is SQL standard.
| joelcollinsdc wrote:
| Maybe not for indexes but what about using a sql syntax that is
| more common and extensible?
| FranzFerdiNaN wrote:
| I don't think I have ever seen that way of doing an inner join
| in the wild, despite working as a DBA or data engineer for the
| past 15 years, 10 of those Postgres-only roles.
| matwood wrote:
| I started my dba roles back with mssql 6.5, and the join
| using where was all that was supported. I found the join
| syntax much more clear around intent and moved as soon as it
| was available.
| majewsky wrote:
| I prefer the JOIN syntax because I don't have to rewrite
| everything when I realize later on that I need an OUTER JOIN.
| mtone wrote:
| > This syntax is not as commonly used as the one above
|
| This is going to need some sources. Is it true today? And why
| did they put parentheses in the ON condition?
|
| Worth nothing that there were variants of the WHERE syntax to
| support left joins using vendor-specific operators such as A +=
| B, A = B (+) -- those are clearly deprecated today. [1] [2]
|
| I have a really hard time finding any source on the internet
| that recommends using the WHERE style joins. So by extension, I
| wouldn't expect to be used much anymore except for legacy
| projects. MS SQL Server docs docs mention ON syntax being
| "preferred" [3], and MySQL says "Generally, the ON clause
| serves for conditions that specify how to join tables, and the
| WHERE clause restricts which rows to include in the result
| set." [4]
|
| The PostgreSQL docs seem misleading and outdated to me.
|
| [1] https://docs.microsoft.com/en-
| us/archive/blogs/wardpond/depr...
|
| [2]
| https://docs.oracle.com/cd/B19306_01/server.102/b14200/queri...
|
| [3] https://docs.microsoft.com/en-us/sql/relational-
| databases/pe...
|
| [4] https://dev.mysql.com/doc/refman/5.7/en/join.html
| colonwqbang wrote:
| I'm just quoting the manual here. I have no idea which style
| is really prevalent in the wild, now or X years ago.
|
| My goal was only to cast doubt on the idea that WHERE clauses
| in general can't use indices.
|
| Sure, let's debate what the nicest style is. But let's not
| claim that our preferred style somehow makes the DB go faster
| (without some kind of proof).
| mtone wrote:
| Indeed, they will perform the same or very close (the query
| plan might differ a bit due to the different orderings).
| Not sure where the author got that from. I'm complaining
| about the docs only.
| Justsignedup wrote:
| some of these are inaccurate.
|
| "a = 'foo'" is exactly the same performance as "a like 'foo'" and
| very close to the performance as "a like 'foo%'" and is fully
| indexed. When you put a wildcard in the front, the entire index
| is avoided, so you gotta switch to full text search.
| jjice wrote:
| Does anyone have any good resources for practicing SQL queries? I
| recently had an interview where I did well on a project and the
| programming portions, but fumbled on the more SQL queries that
| were above basic joins. I didn't realize how much I need to learn
| and practice. I don't know if my lack of knowledge was enough to
| cost me the position or not yet, but I'd like to prepare for the
| future either way.
|
| I've seen a few websites, but I don't know which ones to use. Or
| maybe there is a dataset with practice question I could download?
|
| Edit: I found https://pgexercises.com and it's been fantastic so
| far. Much more responsive than other sites, clear questions, and
| free.
| bob1029 wrote:
| This seems like reasonable discussion, but you would get far more
| traction if you have an opportunity to write an entire schema
| from scratch in the proper way.
|
| Not having to fight assumptions along lines of improperly
| denormalized columns (i.e. which table is source of truth for a
| specific fact) can auto-magically simplify a lot of really
| horrible joins and other SQL hack-arounds that otherwise wouldn't
| be necessary. The essential vs accidental complexity battle
| begins right here with domain modeling.
|
| You should be seeking something around 3rd normal form when
| developing a SQL schema for any arbitrary problem domain. Worry
| about performance after it's actually slow. A business expert who
| understands basic SQL should be able to look at and understand
| what every fact & relation table in your schema are for. They
| might even be able to help confirm the correctness of business
| logic throughout or even author some of it themselves. SQL can be
| an extremely powerful contract between the technology wizards and
| the business people.
|
| More along lines of the original topic - I would strongly
| advocate for views in cases where repetitive, complex queries are
| being made throughout the application. These serve as single
| points of reference for a particular projection of facts and can
| dramatically simplify downstream queries.
| grzm wrote:
| One of the strengths of Metabase is that it can plug into a
| variety of data sources, not just RDBMS. For example, AWS
| Athena over data in S3 buckets. Good design can still make
| things easier, of course, but not always an option. Relational
| purity is not going to be an option in such circumstances, so
| are in my opinion correctly not addressed in the piece.
| mulmen wrote:
| Metabase sounds like a great tool for building clean analytic
| schemas then. You still need to design those schemas though.
| tremon wrote:
| _Avoid functions in WHERE clauses_
|
| Avoid them on the column-side of expressions. This is called
| sargability [1], and refers to the ability of the query engine to
| limit the search to a specific index entry or data range. For
| example, WHERE SUBSTRING(field, 1, 1) = "A" will still cause a
| full table scan and the SUBSTRING function will be evaluated for
| every row, while WHERE field LIKE "A%" can use a partial index
| scan, provided an index on the _field_ column exists.
|
| _Prefer = to LIKE_
|
| And therefore this advice is wrong. As long as your LIKE
| expression doesn't start with a wildcard, LIKE can use an index
| just fine.
|
| _Filter with WHERE before HAVING_
|
| This usually isn't an issue, because the search terms you would
| use under HAVING can't be used in the WHERE clause. But yes, the
| other way around is possible, so the rule of thumb is: if the
| condition can be evaluated in the WHERE clause, it should be.
|
| _WITH_
|
| Be aware that not all database engines perform predicate
| propagation across CTE boundaries. That is, a query like this:
| WITH allRows AS ( SELECT id, result =
| difficult_calculation(col) FROM table) SELECT
| result FROM allRows WHERE id = 15;
|
| might cause the database engine to perform
| difficult_calculation() on all rows, not just row 15. All big
| databases support this nowadays, but it's not a given.
|
| [1] https://en.wikipedia.org/wiki/Sargable
| adrianmonk wrote:
| Also, some databases allow you to index the result of a
| function. Oracle calls them "function-based indexes".
| PostgreSQL seems to call them "indexes on expressions".
|
| And MySQL seems to support "generated columns" which can be
| "virtual" and can have indexes. (Although in that case the
| expression lives in the column definition, so it's not actually
| in a where clause.)
|
| Also, I guess some databases probably let you have an index on
| a view, which could be another way.
|
| So if you really need a function in your where clause, there
| may very well be a way to do it efficiently. Of course, the
| usual caveat applies that it requires more I/O to maintain more
| indexes.
| matt-snider wrote:
| MySQL 8.0 actually added support for functional indexes, but
| I found out the hard way that they don't work the same way as
| indexing generated columns. For me the biggest issue were the
| caveats around using the indexed column in conditions, which
| resulted in the index not being used in surprising
| situations. Anyways, I had to revert to generated columns
| which was a shame because it was a feature I had looked
| forward to using.
|
| I found this article on the topic to be helpful:
|
| https://saveriomiroddi.github.io/An-introduction-to-
| function...
| tehlike wrote:
| Postgres and the likes allows trigram indexes for like queries,
| and expression based indexes just fine. This argument doesn't
| really pass the smell test.
| richardeb wrote:
| Advice would have to be tailored to specific database
| technologies and probably specific versions.
|
| For example, in Apache Impala and Spark, "Prefer = to LIKE" is
| good advice, especially in join conditions, where an equijoin
| would allow the query planner to use a Hash Join, whereas a non
| equijoin limits the query planner to a Nested Loop join.
| musingsole wrote:
| This is ultimately my problem with databases. We use the term
| as a catchall, but every implementation is different and is
| unified only in that they store tables and can respond to
| SQL.
|
| People treat deciding your app will have a database as a
| design decision when in reality it is only about 10% of a
| design decision.
| arrosenberg wrote:
| That's a challenge with meatspace infrastructure too. You
| can have a standardized design for (e.g.) an airport, but
| "commercial jets" actually represents a wide variety of
| vehicles with different needs and tolerances, so all
| designs have to be adapted to the specific circumstances.
___________________________________________________________________
(page generated 2021-04-10 23:00 UTC)