[HN Gopher] Hacking the Postgres wire protocol
___________________________________________________________________
Hacking the Postgres wire protocol
Author : levkk
Score : 141 points
Date : 2025-04-15 14:33 UTC (8 hours ago)
(HTM) web link (pgdog.dev)
(TXT) w3m dump (pgdog.dev)
| skeptrune wrote:
| >we need to ... understand SQL syntax using a parser, Rust
| ecosystem has a great library called pg_query
|
| I'm impressed both that the Rust ecosystem had the right library
| available and that it was high enough quality code for the
| author's team to use it as a reference for building their own
| additional functionality.
| craigkerstiens wrote:
| Pgquery was created by the pganalyze team for their own
| purposes I believe initially for features like index
| recommendation tooling, but immediately planned as open source.
| It is indeed a very high quality project with the underlying C
| implementation having several wrappers that exist for a number
| of languages[1].
|
| [1]
| https://github.com/pganalyze/libpg_query/blob/15-latest/READ...
| levkk wrote:
| We're using it to rewrite queries too. It's a pretty cool
| library.
| bri3d wrote:
| Oddly, it actually originates from the Ruby ecosystem -
| `pganalyze` initially created `libpg_query` to parse Postgres
| in Ruby. `libpg_query` in C does the "magic" of reformulating
| Postgres's query parser into an AST generator, and serializes
| the resulting AST into a Protobuf. Then, `pg_query` bridges
| that into Rust.
| PaulHoule wrote:
| The best part I think is "pg_query is special. It doesn't
| actually implement parsing SQL. It works by extracting C source
| code directly from Postgres and wraps it with a nice Rust
| interface. This allows PgDog to understand all queries that
| Postgres can."
|
| One of the things that puts metaprogramming on wheels is being
| able to pull in grammars, schemas and such for external systems.
| Think how many cool tools we could have if we could do this in a
| more general way.
| pphysch wrote:
| I agree. Does anyone know much heavy lifting is done by
| pg_query in wrapping the Postgres code vs. Postgres in
| expressing that code in a manner that makes pg_query possible?
| ZiiS wrote:
| Tends to be a matter of opinion. Postgres does not expose the
| relevant functions, so
| https://github.com/pganalyze/libpg_query has to do some heavy
| lifting to convert their source code into a nice library.
| Conversely, Postgres is very well written code, in an
| extremely common language, with a stable release cadence, and
| such a long track record it is seen as the reference
| implementation for correctly parsing SQL.
| lfittl wrote:
| Yeah, as one of the main authors of libpg_query, I think
| the primary things that make this easier is that Postgres
| has good abstractions internally, and the parser works
| independently from other parts (e.g. the community
| discourages adding settings that affect parser behavior).
|
| Over the years we've only had to maintain a small set of
| patches on top of the Postgres source [0], together with
| some mocks and our libclang-based extraction logic [1]. Of
| course it would be nice if Postgres just packaged this
| directly like it packages the client libraries, but there
| is non-trivial effort involved to do that. From what I
| recall, the main issue is that error handling and memory
| allocations work differently in the client-side libraries
| (and so that would have to either also be moved out of the
| backend source, or use some other abstraction).
|
| [0]: https://github.com/pganalyze/libpg_query/tree/17-lates
| t/patc...
|
| [1]: https://github.com/pganalyze/libpg_query/blob/17-lates
| t/scri...
| mdaniel wrote:
| I had such high hopes for tree-sitter but once it went all "and
| then, $CC -c -o" all was lost :-(
| jitl wrote:
| What do you mean?
| boomskats wrote:
| Supabase's postgres LSP works in a similar way iirc.
| bri3d wrote:
| Yes, the same way. It's all based on the extremely useful
| `https://github.com/pganalyze/libpg_query` project, which is
| where the "extracted the parser from Postgres" part comes in.
|
| Supabase's LSP also uses tree-sitter for corrections and
| autocomplete, because one drawback of using the server's
| source is that pg_query only works on well-formed/executable
| SQL - when it detects a malformed query, it formulates an
| error and exits, since that's what you want in an SQL server.
| So for partially-correct syntax and fill-forward, tree-sitter
| covers the gaps.
| thenonameguy wrote:
| > One of the things that puts metaprogramming on wheels is
| being able to pull in grammars, schemas and such for external
| systems.
|
| That's what I've been trying to do with:
| https://github.com/schemamap/schemamap
|
| For a well-constrained Postgres schema, a deterministic SQL
| compiler can be built (with plenty of logic
| programming/constraint-solving/hypergraph-analysis) that can
| integrate arbitrary external systems data.
|
| While this is NP-hard, with some clever use of embeddings and
| ontologies, and use of every single constraint AST within the
| DB to reduce the search space, this becomes a feasible problem.
|
| For any Clojurists interested, I've packaged `pg_query`, so you
| can use it in your applications:
| https://github.com/schemamap/pg-query-clj
|
| At the moment I'm saving up for the next development cycle, so
| not only PG->PG schema mappings can be solved for (JSON schema
| is next!). Hope this sounds interesting :)
| ustad wrote:
| Has someone done this for MySQL?
| achanda358 wrote:
| https://vitess.io/
| grep_it wrote:
| > More complex examples, like IN (1, 2, 3) or id != 25 can be
| handled as well. For the former, we can hash all values and route
| the query to the matching shards. For the latter, we can do the
| opposite.
|
| This makes it sound like the query would be routed to all shards
| except the one matching HASH(25). But wouldn't id != 25 need to
| go to all shards, since there could be values in the id set that
| hash to the same shard as 25?
| levkk wrote:
| You're right. Good catch, I'll fix this in the article.
| jedberg wrote:
| Wrapping up the actual Postgres code is a brilliant idea. Then
| you never have to worry about updating your code to keep up with
| their changing API.
|
| And the fact that this works at all is another example of why
| Postgres is the default data store on the internet. It's battle
| tested and has a massive ecosystem of support tools, like this
| one.
| avinassh wrote:
| I am long on this project and excited about it.
|
| I am wondering, why it took so long for something like this to
| happen for Postgres, where as Vitess existed for MySQL more than
| a decade. I am not an expert, but I hear Citus is not really a
| fair comparison with Vitess.
| rubiquity wrote:
| pgdog represents maybe just the VTGate portion of Vitess.
| There's a whole lot more to Vitess or any replication-on-top of
| a database type service than just the query router.
| avinassh wrote:
| > pgdog represents maybe just the VTGate portion of Vitess
|
| That's today. The project is developing fast, so I am sure
| more things will be added :)
| levkk wrote:
| Exactly :)
| hamburglar wrote:
| It would be nice if they could go more into detail about how they
| decide whether a query is read or write. The casual mention of
| this makes it sound like they originally took a too-naive
| approach (looking at the first word) but even ignoring CTEs, you
| can still have an UPDATE within a SELECT subquery, which makes
| the problem very hard without parsing the whole query. Perhaps
| the C++ code extracted from pg handles this for them and returns
| some metadata that indicates whether there are any writes buried
| in the query?
| levkk wrote:
| Forgot to add that to the article, but yes, we parse the
| statement and route SelectStmt[1] to replicas. If there is an
| UPDATE in the SELECT, we don't handle that at the moment, but
| can be easily added.
|
| [1]
| https://docs.rs/pg_query/6.0.0/pg_query/protobuf/struct.Sele...
| hamburglar wrote:
| Do you have to traverse the syntax tree to determine that?
| levkk wrote:
| Yup.
| antirez wrote:
| > with 4 bytes added for itself
|
| Why one would want to do that? Only reason I can think of is, so
| that values < 4 can be reserved for future users, like new
| versions of client-server that agree on 64 bit payload if the
| first 4 bytes are zero or alike. But there are better ways.
| levkk wrote:
| I'm guessing so it's never zero and can't be confused with
| something else, like a bunch of NULs. There are a few messages
| that's don't have a payload, e.g. ParseComplete.
| antirez wrote:
| Makes sense, thanks.
___________________________________________________________________
(page generated 2025-04-15 23:01 UTC)