[HN Gopher] Sqlfluff the SQL Linter for Humans
___________________________________________________________________
Sqlfluff the SQL Linter for Humans
Author : rbanffy
Score : 169 points
Date : 2021-10-06 11:11 UTC (11 hours ago)
(HTM) web link (www.sqlfluff.com)
(TXT) w3m dump (www.sqlfluff.com)
| slimebwoy wrote:
| We use this at my company to lint dbt sql code within our shared
| repositories. We are able to easily craft a set of rules that fit
| our style guides and ignore the rest.
| KronisLV wrote:
| This is a nice idea, but my problem with most of these tools is
| that they work with pure SQL/JS/whatever files, which isn't
| always the case.
|
| For example, if you use something like myBatis, you'll probably
| build the SQL dynamically, which could be stored in a mapper XML
| file: https://mybatis.org/mybatis-3/dynamic-sql.html
|
| Not only that, but certain frameworks have you embedding SQL
| inside of the main application code, when you're not using a DSL
| like HQL:
| https://www.tutorialspoint.com/hibernate/hibernate_query_lan...
| and
| https://www.tutorialspoint.com/hibernate/hibernate_native_sq...
|
| Furthermore, there seems to be this odd separation between
| "regular SQL" and "procedural SQL" in some DBVSes, such as in
| Oracle - there you might have to use "DECLARE ... BEGIN ... END;"
| in certain contexts, or maybe "EXECUTE IMMEDIATE". Then again,
| Oracle isn't among the supported dialects, which is perfectly
| understandable, because of both how the DBVS is positioned, as
| well as because of how large an undertaking supporting it would
| be: https://docs.sqlfluff.com/en/stable/dialects.html
|
| In short, the only passable tools that i've found for checking
| SQL are those that are integrated within the IDE, or using a
| specialized IDE with any plugins for specific technologies, like
| JetBrains DataGrip (commercial project):
| https://www.jetbrains.com/datagrip/
|
| But then you run into the issue of code style preferences: some
| companies have style guides where you have to use reserved words
| in uppercase with the dev created content in lowercase, like:
| SELECT some_column, another_column, yet_another_column AS
| one_more FROM some_table WHERE some_column >
| SYSDATE
|
| while certain tools (like SQL Developer) are more than happy to
| use autocomplete which doesn't work well with that:
| SELECT (start typing in "som") FROM some_table ... Oracle
| will autocomplete to: SOME_COLUMN
|
| And then you run into people's personal preferences, for example
| some format the columns that they want to select like this:
| SELECT some_column , another_column ,
| yet_another_column FROM ...
|
| which makes sense when you want to edit single lines, but could
| have been avoided by the SQL standard allowing dangling commas,
| like some other languages do, for example: SELECT
| some_column, another_column, yet_another_column,
| FROM ...
|
| but as it currently stands, that's not possible and the other way
| of formatting code also tends to break linters. And since the
| actual formatting of SQL doesn't matter as much as it would in
| Python, we end up with every IDE and other tool having their own
| preferences.
|
| I guess the lesson here is to avoid complexity as much as
| possible and just settle on whatever works.
| amshank wrote:
| You're totally right that there isn't any consistency in this
| space at the moment. It's part of the justification for tools
| like SQLFluff starting to align people on a common way of
| writing SQL. None of the incumbent players have an incentive to
| do that.
| furstenheim wrote:
| I haven't used sqlfluff, but they support jinja templating and
| in those cases they ignore spacing problems. So maybe your use
| case can be supported?
| jdub wrote:
| Yeah, you need to read more of the sqlfluff documentation. :-)
| epberry wrote:
| Fascinating to see the focus on SQL and continuous integration,
| https://docs.sqlfluff.com/en/stable/inthewild.html#inthewild....
| DBT has really changed things in that regard. @tomhallett's YT
| video seems to describe using this with a pre-release build of
| DBT to tightly couple it to sqlfluff.
| tomhallett wrote:
| I am spec'ing a data pipeline right now and the team is so
| excited to try this out. Here's a great intro video about the
| tool and their motivations:
| https://m.youtube.com/watch?v=veYB9uh0RCM&t=1105s
|
| My goal is to use it with a Rockset database, so I'll see how far
| I can get with one of the existing dialects, or how hard it is to
| extend an existing one to make one for Rockset
| ErunamoJAZZ wrote:
| I wanted to use this linter, but they don't support many things
| that are used in real production code. By example, a simple
| postgres function like this one can't be parsed:
| CREATE OR REPLACE FUNCTION public.setof_test() RETURNS
| SETOF text LANGUAGE sql STABLE STRICT AS
| $function$ select unnest(array['hi', 'test'])
| $function$ ;
| tunetheweb wrote:
| Looks like we didn't support SETOF.
|
| Added this here: https://github.com/sqlfluff/sqlfluff/pull/1522
|
| Demonstrating how easy it is to add this sort of thing to the
| project thanks to how the code is structured! :-)
| munk-a wrote:
| I really personally dislike how you fail to shout the keywords
| in the above function "that actually matter" - all the set
| dressing and obvious stuff is capitalized - but then the meat
| of the function is all in lowercase hiding in the middle.
| munro wrote:
| The code looks really great [1], very declarative, looks like
| it should be pretty easy to add better PostgreSQL support. I
| also wonder if how easy it would be to plug it into VS Code,
| and build in an auto formatter (o ). At a glance it looks like
| it's parsing AST, which is really hard to find for SQL, so auto
| formatter could be possible to build from this. I notice SQLite
| could use some improving too, seems like a great project to
| coalesce around.
|
| [1]
| https://github.com/sqlfluff/sqlfluff/blob/main/src/sqlfluff/...
| tunetheweb wrote:
| There already is a VS Code extension for this: https://market
| place.visualstudio.com/items?itemName=dorzey.v...
|
| https://github.com/sqlfluff/vscode-sqlfluff
| amshank wrote:
| Auto-formatter is already built in for some simple
| violations:
| https://docs.sqlfluff.com/en/stable/cli.html#sqlfluff-fix
| [deleted]
| twoodfin wrote:
| Most of the rules seem reasonable, but what's the justification
| for this?
|
| _Rule_L031: Avoid table aliases in from clauses and join
| conditions._
|
| I can't see what makes aliases clarifying in `select` or `where`
| but not in `join`. Also, inlined subqueries in a join must be
| given an alias, so there are cases where this rule can't be
| applied consistently.
| cryptonector wrote:
| This rule is no good.
| pwildenhain wrote:
| I think it depends on if your table names are readable or not.
| If you table names are "sales" and "orders" then I would argue
| that makes the sql a lot more readable i.e
|
| sales.amount orders.due_date
|
| vs
|
| s.amount ord.due_date
|
| In the second example I have to flip back and forth between the
| join statement in order to know where the columns are coming
| from.
|
| If you have awful table names though like qq123xc then I think
| aliasing would be appropriate
| cryptonector wrote:
| The moment you have the same table used more than once as a
| table source (in a JOIN) then this goes out the window.
|
| Moreover, there's nothing wrong with saying `SELECT ... FROM
| sales sales ...`.
| Orou wrote:
| Wholeheartedly agree with this. The problem isn't aliasing,
| it's lazy aliasing which is more I think what GP was
| referring to (e.g. emp.id instead of employee.id).
| mulmen wrote:
| Aliases rarely add clarity in my experience. An alias is just
| an extra lookup when I need context. It's a waste of mental
| swap space.
|
| Inlined subqueries are creating entirely new projections so
| they need descriptive names. If you are doing this you should
| consider a WITH anyway.
| tptacek wrote:
| I wondered how much good you could do with an SQL linter that
| didn't have any access to the underlying schema, and the answer
| seems to be "mostly whitespace and capitalization issues".
| laumars wrote:
| Makes you wonder if this tool would be better a code formatter
| rather than linter?
| forgetfulness wrote:
| It seems to have a formatter option in the `fix` argument you
| can give it.
|
| The authors should probably be selling it on that point since
| it seems to be its strength, it doesn't have many rules along
| the lines of the static analysis that people look for in
| linters.
|
| Some people (tech leads, CTOs) care _a lot_ about formatting
| but I wouldn 't make teammates go through a linter for that,
| it's a rather surface concern.
| amshank wrote:
| The reason that those people care about formatting is that
| it matters when the code is used to communicate with other
| people, not just with machines. Having a consistent style
| within the team, means the team can communicate more easily
| - even if the net effect for an individual working on a
| codebase alone is minimal.
| forgetfulness wrote:
| Yes but realistically, save for the code that someone
| who's really inexperienced or sloppy would write,
| fixating over whether someone is writing
| select a, b, c from foo where bar =
| 'quux'
|
| rather than SELECT a,
| b, c FROM foo
| WHERE bar = 'quux'
|
| is petty and immaterial to what the code communicates.
|
| Worse if it's about details within the roughly the same
| style.
|
| Edit: but if there's a code formatter that can be
| seamlessly added to the development environment, it makes
| sense to use it. It's just that if there's no tool
| support, such things are of little contribution.
| munk-a wrote:
| Code isn't meant to be read by computers - it's meant to
| be read by humans (including yourself in three years when
| you've forgotten everything). Code sugaring and style
| adherence are vital portions of retaining a maintainable
| and flexible codebase.
| forgetfulness wrote:
| Yeah I know the spirit. I'm saying it amounts to
| absolutely nothing in readability. If there's any
| formatting at all, it's structure and not particular
| whitespace and comma placement conventions that make or
| break readability, the rest is code quality theater.
| lazide wrote:
| It does actually matter quite a bit in readability in my
| experience - when you have a bunch of noobs, and a bunch
| of senior folks, and a bunch of interns, and a bunch of
| contractors all slinging code around for review to each
| other.
|
| Many folks will get sloppy (in syntax and cleanliness) or
| misunderstand things - which requires the reviewer be
| able to clearly and quickly see what they are trying to
| do to catch that - especially as deadlines approach. The
| more it slips, the more slipping becomes the norm, and
| the messier and harder to understand everything gets -
| which makes later work harder as well.
|
| If what you're showing is the standard, I guarantee 25%
| or more of the requested changes (which already won't
| meet whatever standard ANYONE sets strictly when first
| proposed) will be even worse. If that even worse becomes
| standard, etc, etc.
|
| Part of their job is to set the 'reasonable' ideal, and
| attempt to enforce it. It won't happen universally, or
| even necessarily often, but it pushes things more towards
| maintainability and obviousness, which is opposite of the
| normal trend in any group of people. The larger the
| group, the more of a problem it is, and the harder they
| need to work to do it.
| laumars wrote:
| In trivial examples when we are all relaxed then yes, you
| would be right. It's non-trivial occasions when we are
| stressed and trying to read someone elses code at two in
| the morning when I really value strict code style
| guidelines.
|
| The ability to read complex code when we are in
| situations where we are not functioning at 100% is vastly
| under valued when one talks about developers capacity for
| comprehension.
| tomnipotent wrote:
| I think a not-so-small-part of the success of JS/CSS linting
| was that large companies like AirBnB, GitHub, and Google made
| their rules public and helped with sane defaults. Almost every
| project I came across used one of these "defaults" and added
| custom rules as needed.
|
| Sqlfluff seems to ship with 48 "built-in" rules and opinions on
| spacing and capitalization. If it catches on, I think the
| larger community will mostly benefit after other companies
| "doing work" with it release their own collection of rules.
| snidane wrote:
| There surely is a need for a sql linter, but if it wants to be
| strict about syntax style, it should better familiarize itself
| with not-noob styles first.
|
| Eg. some people like to align the clauses vertically and use a
| leading, not trailing, comma. select a
| , b , count(*) as count from table
| where cond1 and cond2 group by a, b
| order by count desc
| data_ders wrote:
| could a synonym for "more not-noob" perhaps be "more esoteric"?
| rob74 wrote:
| This whole discussion here shows that the best time to
| introduce a linter/autoformatter is at the same time as the
| language it's supposed to lint/format (like Go did). Otherwise
| you will have interminable discussions and everyone will want
| their favorite style supported. I'm afraid that ship has sailed
| for SQL though...
| forgetfulness wrote:
| Leading comma is configurable, but I wouldn't know if the
| linter allows for that layout. Can't say I've seen it before,
| is it specific to a particular development environment?
| laumars wrote:
| It's an informal way some people write SQL if they're
| developing a script.
|
| I've done it this way before because it was quicker to edit,
| and and remove lines. But it's not a format I'd particularly
| like to share with others because it looks weird.
| mrbungie wrote:
| I've seen it a lot in Data Science/Analytics environments, as
| it allows for faster query iterations than trailing commas.
| forgetfulness wrote:
| Yes you can find sources advocating for leading commas and
| it's commonly used. I have "Google BigQuery: The Definitive
| Guide" on my desktop, it was written by senior Google
| employees involved in the steering and use of the product
| inside, for what it's worth, and it advocates for leading
| comma and SHOUTING THE KEYWORDS.
|
| I personally disobey both recommendations (I mean, the book
| is worth it for other reasons) but both those conventions
| are very commonly used.
|
| I meant more the the "center" layout that the commenter
| above was describing, it'd seem quite laborious to maintain
| by hand in most environments, if not using a particular one
| that supports it out of the box.
| y4mi wrote:
| jetbrain IDEs have it configured by default for most
| situations, so its generally just an auto-format away if
| you're actually writing sql and not embedding sql queries
| in some other code.
| darksaints wrote:
| If SQL would just allow a trailing comma on the last column,
| none of this leading comma shit would be necessary. I still use
| trailing commas, because code readability is more important
| that writeability IMO, but the fact that I'm forced to
| sympathize with such an ugly shit sandwich is really
| frustrating.
|
| Can't sympathize with the clause alignment though. And I can't
| figure out why someone who does leading commas for editability
| would chose a style that forces you to use different
| indentation levels for different lines. Oh wait, you wanna make
| that join a left join? Time to go re-indenting my whole damn
| query...
|
| It's not even an improvement in readability...who actually
| reads text like it's a column format? It's not a spreadsheet.
| nmstoker wrote:
| Yes, despite knowing how to do it right those damn comma
| issues always come up at the worst time and to add insult to
| injury they're reported by most of the tools I have available
| like they're totally insoluble to the computer (even though
| far more taxing things are routinely fixed automatically).
| marcosdumay wrote:
| The code you get with a trailing comma is valid SQL just
| like the one you get without, and both have different
| meanings.
|
| The language is just too lenient with syntax, what is quite
| on the line for a "human legible" language from the late
| 70's and earlier 80's, but a really undesirable
| characteristic in practice. Unfortunately, nobody is doing
| a "better SQL", just radically different things that often
| not even support the same DB engines. And those have a real
| hard time getting popular, so we are stuck with the late
| 70's speculative innovations.
| extr wrote:
| Yes! I write SQL this way. I grant the comma thing is more a
| matter of taste, but aligning clauses vertically and maximizing
| rivers is super useful to me and it forever mystifies me that
| it isn't standard. This is an extreme example, but I like to
| keep join clauses vertically aligned as well:
| select a , b , count(*) as count
| from table join table2 on table.key1
| = table2.key1 and table.key2
| = table2.key2 and DATEADD('day', 1,
| table.key3) = table2.key3 where cond1 and
| cond2 group by a, b order by count desc
|
| For me this lets me read the SQL query almost like a picture at
| first glance, identifying blocks of rivers as semantically
| connected groups of statements. Some SQL formatters out there
| love to put join clauses at the same indentation level as the
| join itself, drives me nuts.
| new_guy wrote:
| That formatting makes my eyes bleed.
| RussianCow wrote:
| Do you format your code this way by hand or does your IDE
| help you? Also, I can only imagine how noisy your diffs would
| be when you make changes and have to indent everything.
| extr wrote:
| I'll use PyCharm's SQL formatter sometimes to massage
| spaghetti code that someone else wrote, but mostly I just
| write it this way by hand.
|
| I will say my use case is probably different than others,
| I'm a Data Scientist so I write a ton of throwaway SQL and
| stuff that will only be seen by me. If I'm dealing with an
| existing SQL codebase or production stuff, definitely try
| to keep it more easily diff-able (like left justifying the
| keywords, really only caring about rivers in the join/where
| clauses), or just follow whatever patterns are already
| there.
| Orou wrote:
| > it forever mystifies me that it isn't standard
|
| I can give a few thoughts on why here:
|
| - it doesn't work for collaborative efforts unless everyone
| is using the same linter or IDE with the same configuration.
| Not friendly to edits outside of that ecosystem (e.g. open
| source). - there are other ways of writing SQL (e.g. dbt
| style guide) which are still very readable while also being
| much faster to write by hand without the need for an auto-
| formatter. - some of these patterns such as the from and join
| indentations have nothing to do with readability and are
| purely stylistic. Nothing wrong with that, but there's
| nothing objectively better about this pattern - it just comes
| down to how you're used to reading SQL.
| marcosdumay wrote:
| This looks nearly the same, but it's much easier to write and
| generates less spurious diffs: select a
| , b , count(*) as count from table
| join table2 on table.key1 = table2.key1
| and table.key2 = table2.key2 and DATEADD('day, 1,
| table.key3) = table2.key3 where cond1 and
| cond2 group by a, b order by count desc
|
| The largest difference is that the equal operators aren't
| aligned anymore, but on my opinion, aligning them is really
| not worth the cost.
| kcolford wrote:
| I hate you and and the awful layout you have just introduced me
| to... Also, that is how I will be formatting all my SQL from
| now on.
| fnord123 wrote:
| alt styling SELECT a, b,
| COUNT(*) as count FROM table
| WHERE cond1 AND cond2 GROUP BY
| a, b ORDER BY count DESC
| munk-a wrote:
| Capitalized keywords and section declarations on their own
| lines helps with quick comprehension for me at least
| EamonnMR wrote:
| That's the one I'm familiar with. The capitalized keywords
| help give you that Relational Database feel.
| snidane wrote:
| I've seen this one in the wild and have objections to it.
| - why capitalize KEYWORDS. We're no longer in 1970s. We use
| colors - why waste so much space. Why not just put
| table on the same line with from. With joins this just
| bloats up - for complicated conditions they have to
| be put on their own lines. Do I give each AND a separate
| line and waste ever more space?
| bntyhntr wrote:
| My editors and code review tools do not format or
| highlight sql for me when it's embedded in other files -
| eg java. I find capitalized keywords to be very
| convenient. Of course at that point, I can't run sqfluff
| on it either but this thread has already devolved fairly
| quickly.
| wodenokoto wrote:
| - Half my SQL queries are written inside some other
| script, like Python or bash, so capitalization is nice.
|
| - Space is not wasted when it it makes code more
| readable.
|
| - Yes, and also give the columns in select their own
| lines. So nice to read. Space well spend!
| snidane wrote:
| Good points. Maybe there is a difference between a SQL
| script embedded occasionally somewhere in python, for
| which this style probably makes sense.
|
| For heavy SQL codebases found in data analytics, there
| are better styles.
| fiddlerwoaroof wrote:
| I've always formatted my SQL like this, but I don't remember
| learning it from anyone else or seeing anyone else do it.
| mulmen wrote:
| Please do not use leading commas. It doesn't make any sense.
| Where else do you ever use them? Do you use them with an IN
| clause? What about in JSON?
| amshank wrote:
| Most of the SQLFluff style was based around the Mozilla SQL
| style guide and the dbt labs style guide:
| https://docs.telemetry.mozilla.org/concepts/sql_style.html
| https://github.com/dbt-labs/corp/blob/master/dbt_style_guide...
|
| The GitLab SQL style guide isn't far off either:
| https://about.gitlab.com/handbook/business-technology/data-t...
|
| Rather than supporting _all_ ways of formatting SQL, the aim of
| the tool was to fairly flexibly support the most mainstream
| ones with a view to becoming more opinionated in future to
| converge on a more unified style in future. Similar to what
| black has done with python.
| dugmartin wrote:
| Or the ultimate pedant (pedultimate?) order which matches the
| SQL order of operations (which for some queries can be
| illuminating): from ... where ...
| group ... having ... select ... order
| by ... limit ...
| wodenokoto wrote:
| Is that legal SQL? I'd love to write in that order.
|
| When working in Big Query, it would be so nice to first
| declare the table, so the online IDE can autocomplete
| columns.
| forgetfulness wrote:
| Standard SQL and most dialects enforce that in the syntax
| anyway no?
|
| Edit: misread and select was in the middle in that comment
| CamouflagedKiwi wrote:
| No, all databases support the common order of SELECT ...
| FROM ... WHERE ....
| forgetfulness wrote:
| You're right, I somehow missed that select was in a
| different order.
| ThePadawan wrote:
| This is a linter which complains about spaces and newlines, but
| lets "WHERE col = NULL" pass (which can never be meaningful) just
| fine?
|
| Obviously this depends on the rules configuration, but just
| judging that first smell test, it seems very ill thought out.
| tunetheweb wrote:
| Seems like a good thing to check for! So we added a rule for
| this: https://github.com/sqlfluff/sqlfluff/pull/1527
|
| Including autofixing if wanted.
|
| Will be in the next release.
| justusw wrote:
| I guess it depends on what you want out of a linter e.g. if you
| want it to check just syntax or semantics as well. Is "= NULL"
| syntactically correct?
| danachow wrote:
| The compiler checks syntax - if that's all a linter did it
| wouldn't be very useful.
| munk-a wrote:
| `= NULL` is perfectly legal syntax - but it isn't doing
| what you think it is.
| ThePadawan wrote:
| That's what parent is saying. He is contrasting a
| compiler (which this isn't) to a linter (which this
| arguably also isn't).
| ThePadawan wrote:
| A linter is supposed to tell you "you are probably making a
| mistake."
|
| "= NULL" is as syntactically correct in SQL as "if (x =
| null)" in C.
|
| If it wasn't, code highlighting or compiling would break,
| both of which are much much more obvious.
| [deleted]
| Genmutant wrote:
| In MS SQL Server, with ANSI_NULLS set to OFF, it will select
| all rows having a NULL value for col. AFAIK not used that
| often.
| marcosdumay wrote:
| That is a serious gotcha of SQL Server. It (and Oracle) has
| some serious problems with unique constraints because of that
| kind of stuff.
| jdub wrote:
| I've been using sqlfluff with dbt for a while now, and not only
| has it been a delight to use, the maintainers are incredibly
| responsive.
| nycjay wrote:
| Sorry - didn't get a chance to go through the documentation much
| yet... A problem we are looking to catch early in the development
| process is for things we don't want folks to do, even if they are
| technically correct. For example, dropping a column or altering
| the datatype/nullability in such a way that the table goes into
| reorg pending. I know we can write our own regex to look for that
| specific syntax, but I've not yet found a tool with those kinds
| of rules either build out already, or easy to add/maintain.
| amshank wrote:
| It's pretty new functionality in sqlfluff, but it now supports
| user defined plugins for org-specific rules if you want to
| forbid something more obscure. Documentation is sketchy, but
| you can see the proof of concept here:
| https://github.com/sqlfluff/sqlfluff/tree/main/plugins/sqlfl...
| Aeolun wrote:
| I've been using this for a while and I've much enjoyed both the
| autoformatting as well as the codebase (when submitting PRs).
| sails wrote:
| I think this is primarily aimed at analytics and data modelling
| SQL, the type of SQL in dbt [], though I'd be happily corrected
| by the team.
|
| As far as I understand, this has been quite a long time coming,
| and is far from trivial due to the diversity of SQL dialect.
|
| My use case would be to have this attached to a dbt project that
| ensures the dbt SQL files all confirm to a passable degree of
| consistency when working across a team. Especially as a CICD
| automated bot action thing on github.
|
| [] https://www.getdbt.com/
___________________________________________________________________
(page generated 2021-10-06 23:01 UTC)