[HN Gopher] Postgres Language Server: Implementing the Parser
___________________________________________________________________
Postgres Language Server: Implementing the Parser
Author : todsacerdoti
Score : 123 points
Date : 2023-12-08 12:33 UTC (10 hours ago)
(HTM) web link (supabase.com)
(TXT) w3m dump (supabase.com)
| kiwicopple wrote:
| (i'm on the supabase team)
|
| this is an update from the launch here:
|
| https://news.ycombinator.com/item?id=37020610
|
| The focus for the past few months has been getting the Parser
| right. It has been a lot of work. I'll ping Philipp and get him
| to join the discussion if there are any questions
|
| We'd also love more contributors, if this sort of thing is up
| your alley
| specialist wrote:
| > _...getting the Parser right. It has been a lot of work._
|
| True.
|
| I have an ANTLR4 grammar for SQL. Am noob. Getting things
| "correct" was some effort. I have the quixotic goal of
| accepting multiple dialects. It's a lot of time referencing
| docs, playing with SQL fiddles, and learning how other grammars
| work.
|
| Now working on performance. Just as tough for noob me. TIL That
| means removing any potential ambiquities at runtime. ANTLR4's
| ALL(*) algorithm does some runtime magic whenever the happy
| path fails. Which tanks performance. Great for making "good
| enough" grammars. Not so great for chewing thru a corpus of
| tests.
|
| SQLite has a terrific test suite. Including generated 'random'
| machine queries. My naive grammar took > 1000ms on these
| monsters, when it worked. After a few, it'd just ABEND with out
| of memory errors. Whoops!
|
| By removing ambiquities, those monsters now take < 100ms. I
| really shouldn't make any claims until all the tests pass. (I'm
| still trying to figure out if I can handle both T-SQL style
| bracketed [names] and Postgres style name[index] array
| references, in the same grammar, without symmantic predicates.)
|
| Any way. This experience has piqued my interest in PEG. I'm
| worried my LL(k) grammar is too brittle. Making it hard to for
| any one to maintain and adapt. As you well know, for sure. (For
| a taste of this challenge, peek at some of the other available
| ANTLR SQL grammars.) Oh well; that's a concern for later.
|
| I'm looking forward to checking out Supabase's grammar. I'll
| share mine (Show HN) when I think it's good enough.
| steinroe wrote:
| that's a huge task to take upon, looking forward to go
| through it! compared to you, we have gone the "easy" way and
| use the actual parser from the Postgres server. so no grammar
| definition and the like. our work was mainly around adapting
| libpg_query (which is build to parse executable SQL) for our
| use case.
| CoastalCoder wrote:
| > use the actual parser from the Postgres server. so no
| grammar definition and the like.
|
| Doesn't Postgres use Lex/Yacc to describe the grammar?
| sjansen wrote:
| It actually uses Flex/Bison, and does take advantage of
| the extended features they provide.
| atonse wrote:
| If we're all being postgres-specific, isn't there presumably
| a grammar somewhere in postgres's code that is used to parse
| the query SQL? Wouldn't you all want to use that same file as
| a source so it's 100% identical to what Postgres is actually
| doing?
|
| Genuine question. But maybe the tooling is incompatible.
|
| Also, have you looked at Azure Data Studio? I'm wondering how
| much you can reuse from the work they did.
| SR2Z wrote:
| Not necessarily. It might be a recursive descent parser (to
| generate better error messages).
| claytongulick wrote:
| Only tangentially related, but if it's of any interest I'm
| about 90% done with a PEG grammar for the PostgREST query DSL
| for similar reasons. Happy to share it if it'd be useful.
| steinroe wrote:
| hey, author here. Thanks for posting it!
|
| A bit of background: a few months ago we announced a Postgres
| language server[0]. A language server adds features like syntax
| error diagnostic and autocomplete to your editor (vscode, neovim,
| etc). We have iterated a lot on the parser over the past few
| months and want to share an update today.
|
| the parser is a core piece of any language server that constructs
| syntax trees from the raw input string. Usually first an untyped
| concrete syntax tree (cst) that represents the syntactic
| structure of the input, and subsequently a typed abstract syntax
| tree (ast) containing the meaning of the source.
|
| In our implementation, we leverage the actual Postgres parser to-
| do the heavy lifting. However, the parser is designed to parse
| executable SQL -- not to provide language intelligence. For
| example, it does not handle incomplete inputs, and outputs just
| the ast, not the cst. To use it for a language server we had to
| work around these limitations as good as possible.
|
| While we leverage procedural macros in rust to generate a lot of
| the repetitive parser code, there remains a portion that requires
| a bit of manual work. But the groundwork is completed, and we can
| finally start working on the data model and the actual server
| next. Our aim is to bring this to a usable state as swiftly as
| possible.
|
| Huge shout-out to pg_analyze for creating and maintaining
| libpg_query[1], without which this project would not be possible!
|
| [0] https://news.ycombinator.com/item?id=37020610
|
| [1] https://github.com/pganalyze/libpg_query
| kageiit wrote:
| Have you considered using something like a treesitter grammar?
| It could solve the editor specific uses cases like highlighting
| and even linting as it creates asts that are more amenable for
| a language server implementation
| chatmasta wrote:
| At Splitgraph we compiled tree-sitter to wasm for Postgres
| autocomplete. Indeed, one major reason we opted for it is
| that it has error handling so it can work with incomplete
| syntax (which is what your code is 90% of the time you're
| editing it).
|
| https://www.splitgraph.com/blog/parsing-pgsql-with-tree-
| sitt...
| steinroe wrote:
| thanks for the link, very interesting read! and you are
| right, libpg_query has its limitations.
|
| the idea is to first implement the parser with libpg_query
| and work around its limitations as good as possible. Since
| the scan api also returns all tokens for invalid sql, the
| language server will then have basic features and syntax
| error diagnostics for invalid statements, and advanced
| features for valid ones. once the server itself is done, we
| want to go back to the parser and replace the libpg_query-
| based parser with a more resilient alternative statement by
| statement. ultimately, the libpg_query-based parser should
| just be the fallback.
|
| that being said, very excited that there is so much
| development in postgres dx.
| chatmasta wrote:
| Yes, I think the optimal solution here is a combination
| of tree-sitter for real time (as-you-type) with a
| fallback to libpg_query. I mean it's technically the
| other way around, since libpg_query is preferred when it
| parses correctly. But yeah I think you inevitably need a
| combination. Pretty sure TS does similar things in
| VSCode.
| gen220 wrote:
| Thank you for doing this!
|
| Every year I get frustrated with a PostgreSQL formatter, look
| out into the webs for hope, and begrudgingly return to my sub-
| par editing experience.
|
| Please take your time to do a good job, and thank you, again!
| anarazel wrote:
| Have you investigated generating your parser from postgres'
| gram.y instead of basically basing it on the bison output?
| steinroe wrote:
| that's a very interesting idea!
|
| for now, our goal is to take the "easy" route with
| libpg_query and build a language server that provides basic
| lsp features for invalid sql, and advanced lsp features for
| valid sql as fast as possible. we then want to go back to the
| parser and replace the libpg_query-based approach with a more
| resilient alternative. as of now, the plan is to implement a
| handwritten recursive-descent statement by statement. will
| definitely do research to what extend we could leverage
| gram.y there, especially to potentially fast-track it.
| anarazel wrote:
| If some annotations or such would make that easier, it
| might be possible to do that upstream. Postgres currently
| has hand-generated code for tab-completion in psql and
| that's uh, not great (it has gotten large enough that msvc
| has trouble compiling the file).
| zubairq wrote:
| I Really like this!
| netcraft wrote:
| I can't wait for this. Just getting out a great AST would be
| amazing for things like formatters, but I want to build things
| like a library that knows everywhere you use a table in a
| particular way or reference a column, or linting that can help
| you identify problems with the way you do a join or forgot to
| apply the right filter for multi-tenant queries.
| steinroe wrote:
| that is definitely the goal, both a formatter and a linter. we
| want to add something like squawk and plpgsql_check directly to
| the language server, so you get eslint-like dx. with both the
| ast and the database schema at hand, you can basically add any
| rule you like.
|
| and this is far out, but eventually we are maybe even able to
| combine the language server with declarative schema management
| and have go-to-definition etc working.
|
| [0] https://github.com/sbdchd/squawk/tree/master [1]
| https://github.com/okbob/plpgsql_check
| mhh__ wrote:
| For a formatter you only need tokens and rules. ASTs are
| unnecessary unless you want to do transformations at the same
| time.
| 2mm3mm3 wrote:
| When will PG finally start using tasks instead of
| superheavyweight threads? When will they automatically
| precompile/hash statements to avoid reparsing? And when
| materialized views will be automatically updated? Where is an
| alternative to SQL Server Always On Availability Group released
| >10 years ago? PostgreSQL is the only database where guys
| recommend you to configure a separate connection pooler, because
| their database is unable to handle even mid. size number of
| connections.
| kiwicopple wrote:
| it's a great list. No database is perfect, and this is a good
| set of features that are missing from Postgres
|
| that said, the flexibility/extensibility of Postgres still
| makes it a great choice. You could easily point at other
| databases and produce a list similar (or longer) than this.
| Yasuraka wrote:
| I'll take vertical sharding until no longer possible and then
| dropping all further requests over touching Always On
| Availability Groups ever again
| anarazel wrote:
| When will you contribute?
| paulddraper wrote:
| lol, salty
| epgui wrote:
| Out of curiosity, why are parser combinators not the choice
| approach for something like this? Is it mainly a performance
| thing?
|
| PS-- this is super exciting!
| steinroe wrote:
| in our specific case, we needed something handwritten for the
| statement-level parser anyways. And the requirements for the LL
| parser are very simple: extract individual sql statements from
| a source input. we just compare the next n tokens with a list
| of tokens from which any statement starts, which is
| straightforward to implement with a handwritten LL parser.
|
| so after all, I would say its a decision based on the special
| requirements we have working around the limitation of
| libpg_query. I think its also the fastest one, but this was not
| the main reason.
| epgui wrote:
| Thanks for your response!
|
| I'm a little confused by it though... Maybe I am
| misunderstanding what you're telling me.
|
| I would definitely consider a parser implemented with parser
| combinators to be "handwritten". The main difference for me
| is that parser combinators allow you to express a language
| grammar in a more declarative way, which makes refactoring
| and correctness verification much simpler. I think what you
| describe as "straightforward" without combinators would be
| "completely trivial" with combinators.
|
| I know from reading recent comp sci papers (particularly in
| the Journal of Functional Programming) that "the parser
| problem" is not fully solved yet (!) and that there are
| various theoretical limits to all currently known formal
| approaches. I suspect that performance could be a limiting
| factor for certain types of syntaxes, although I'm not sure
| if Postgres' would be particularly problematic in this
| regard.
| steinroe wrote:
| sorry, I think I misunderstood your question!
|
| after all, we did not implement a "real" parser. we just
| use libpg_query, the actual Postgres parser, and work
| around its limitations as good as possible. The
| implementation thereby required maximum flexibility. we
| never define any grammar other than "a select statement
| starts with a SELECT keyword".
___________________________________________________________________
(page generated 2023-12-08 23:00 UTC)