[HN Gopher] I wrote a new JIT compiler for PostgreSQL
       ___________________________________________________________________
        
       I wrote a new JIT compiler for PostgreSQL
        
       Author : mattashii
       Score  : 200 points
       Date   : 2024-03-18 11:57 UTC (11 hours ago)
        
 (HTM) web link (www.pinaraf.info)
 (TXT) w3m dump (www.pinaraf.info)
        
       | pinaraf wrote:
       | Author here. Thanks for submitting my article on hackernews. I'll
       | do my best to answer any question.
        
         | frizlab wrote:
         | Not a question, but I love this. I'm eager to see its
         | evolution.
        
         | pgaddict wrote:
         | Would be a great topic for pgconf.eu in June (pgcon moved to
         | Vancouver). Too bad the CfP is over, but there's the
         | "unconference" part (but the topics are decided at the event,
         | no guarantees).
        
           | mattashii wrote:
           | Did you mean pgconf.dev in May (which has the unconference),
           | or pgconf.eu in October (which doesn't have an unconference,
           | but the CfP will open sometime in the - hopefully near -
           | future)?
        
             | pgaddict wrote:
             | Yeah, I meant May. Sorry :-( Too many conferences around
             | that time, I got confused.
             | 
             | That being said, submitting this into the pgconf.eu CfP is
             | a good idea too. It's just that it seems like a nice
             | development topic, and the pgcon unconference was always a
             | great place to discuss this sort of stuff. There are a
             | couple more topics in the JIT area, so having a session or
             | two to talk about those and how to move that forward would
             | be beneficial.
        
         | can3p wrote:
         | Nice post, thanks! Do I read it right that using jit results in
         | the worst max times? What could be a reason in your opinion?
        
           | pinaraf wrote:
           | Two parts: I did the benchmark on a laptop and didn't spend
           | enough time forcing its runtime PM in a fixed state, I'll run
           | a real pgbench on my desktop once I implement all required
           | opcodes for it. And since JIT requires a minimum amount of
           | time (about 300us on my tests), on such small runtimes this
           | can quickly overcome the benefits.
        
         | winternewt wrote:
         | Is there a fundamental difference between copy and patch with C
         | and what compilers do when they target intermediate
         | representations? It seems to me that traditional compilation
         | methods are also "copy and patch" but with another intermediate
         | language than C.
        
           | tetha wrote:
           | I think conceptually, there is no real difference. In the
           | end, a compiler outputting machine code uses very small
           | stencils, like "mov _ _", which are rather simple to patch.
           | 
           | Practically though, it's an enormous difference, as the copy
           | and patch approach re-uses the years of work going into clang
           | / gcc supporting platforms, optimizations for different
           | platforms and so on. The approach enables a much larger pool
           | of people ("People capable of writing C" vs "People capable
           | of writing assembly / machine code") to implement very decent
           | JIT compilers.
        
             | pinaraf wrote:
             | The real difference is in the possible optimizations. If
             | you consider the full scope of JIT compilation in for
             | instance a web browser or the JVM, you could use copy and
             | patch as a tier 0 compiler, and once really hot paths are
             | identified, trigger a complete compiler with all the
             | optimizer steps. Some optimizations are more complicated to
             | implement with copy-patch, esp. if you can't use all the
             | tricks described in the paper (for instance they use the
             | ghccc calling convention to get a much finer register
             | allocation, but from the documentation I don't think it's
             | going to make it for PostgreSQL).
             | 
             | But as you say, yes, this enables people capable of writing
             | C and reading assembly (or you have to be perfect and never
             | have to go into gdb on your compiled code), and it makes
             | the job so much faster and easier... Writing several
             | machine code emitters is painful, and having the required
             | optimization strategies for each ISA is quickly out of
             | reach.
        
       | adzm wrote:
       | I'm still surprised there isn't a query/plan cache for
       | PostgreSQL. I could easily see these two approaches working in
       | harmony once it does, as frequent queries could end up being
       | cached and more aggressively optimized with a cache to offset the
       | compilation cost. Of course that adds a whole new layer of
       | complexity and trouble.
        
         | pinaraf wrote:
         | Honestly I thought the same as you, then I wrote this, and I
         | now understand it's going to be really hard to do. To make it
         | very simple: there are pointers to query parts "leaking"
         | everywhere across the execution engine. Removing them will
         | require a significant overall of the execution engine, the
         | planner and who knows what else. Even in a single session, two
         | compiled queries will have different compiled code because of
         | that (both llvm and my copyjit have to inject the adresses of
         | various structs in asm code)
        
           | adzm wrote:
           | Just going to say, I'm blown away by how simple this JIT is
           | though. Really quite a beautiful JIT approach.
        
             | pinaraf wrote:
             | Same for me, that's why I did this after finding out this
             | research paper. With the proper compiler settings and small
             | tricks you can remove some parts and already end up faster
             | than the interpreter (because you remove some branches and
             | a few memory accesses) and it's even possible to create
             | "super-stencils" covering typical opcodes series and
             | optimizing them further. Or the opposite, "sub-stencils" in
             | order to do some loop unrolling for instance.
        
         | williamdclt wrote:
         | (The article goes a bit above my head so my excuses if I am a
         | bit off-topic)
         | 
         | There is a form of query plan caching in PG: for prepared
         | statements, if PG determines that the actual value of
         | parameters won't affect the query plan much, it uses a "generic
         | plan" so that it reuses the same query plan for every execution
         | of the prepared statement
         | (https://www.postgresql.org/docs/current/sql-prepare.html, see
         | "notes")
        
           | pinaraf wrote:
           | Indeed, and right now it's the only possible way since it
           | remains in a single session, doing otherwise would be very
           | hard.
        
             | cbsmith wrote:
             | Unless you count stored procs...
        
           | SigmundA wrote:
           | Yes its manual and per session, DB's like MSSQL have that was
           | well but are very rarely used anymore because it got
           | automatic plan caching about 20 years ago which basically
           | eliminates any advantage to manually preparing. Its actually
           | better since it can be shared across all sessions
        
         | aeyes wrote:
         | The plan cache on Oracle in combination with prepared
         | statements where the optimizer can't peek into the parameters
         | has been really problematic for me in the past. I usually had
         | to go in and either add hints or force a plan.
         | 
         | Even simple queries like SELECT * FROM t WHERE x = TRUE; could
         | turn into a nightmare depending on the distribution of the x
         | values in the table.
         | 
         | With Postgres I rarely encountered such problems but I must
         | admit that I haven't used Postgres with prepared statements.
         | 
         | I have seen some queries with slow planning time (>100ms) where
         | a cache could have been useful but I don't remember ever really
         | needing to optimize one.
        
           | SigmundA wrote:
           | Is x a parameter because it doesn't look like it? MSSQL has
           | parameter sniffing and will make multiple plans based on
           | incoming parameters I would be surprised if Oracle does not
           | do the same. It can actually be problematic to sniff
           | parameters sometimes and it can be disabled with a hint!
        
             | aeyes wrote:
             | I just made up something simple but yes, I had these
             | problems with bind variables in prepared statements.
             | 
             | > SELECT * FROM t WHERE x = :var;
             | 
             | But I haven't used Oracle in years and back in the day,
             | there was no bind variable peeking.
        
       | weliveindetail wrote:
       | > There is one way to make the LLVM JIT compiler more usable, but
       | I fear it's going to take years to be implemented: being able to
       | cache and reuse compiled queries.
       | 
       | Actually, it's implemented in LLVM for years :)
       | https://github.com/llvm/llvm-project/commit/a98546ebcd2a692e...
        
         | pinaraf wrote:
         | Yeah, well, sorry, I should have been more explicit here: the
         | issue is with PostgreSQL, not LLVM. The JIT compiler has to
         | inject direct memory addresses, making the generated code
         | specific to your query and process.
        
           | weliveindetail wrote:
           | Interesting, because we store relocatable objects. And
           | process symbols can be resolved by name if you really want.
           | It might be yet another performance trade-off though.
        
         | SigmundA wrote:
         | Since PG uses one process per connection and the LLVM JIT code
         | is process specific the code can't be shared amongst all
         | connections to the DB.
         | 
         | Plans themselves suffer from this since they are in memory data
         | structures not designed to be shared amongst different
         | processes.
         | 
         | DB's like MSSQL don't have this issue since they use a single
         | process with threads which is also why it can handle more
         | concurrent connections without an external pooler. Although
         | MSSQL can also serialize plans to a non process specific
         | representation and store them in the DB for things like plan
         | locking.
        
           | hans_castorp wrote:
           | > Plans themselves suffer from this since they are in memory
           | data structures not designed to be shared amongst different
           | processes.
           | 
           | Oracle uses a process-per-connection model as well (at least
           | on Linux), and they are able to share execution plans across
           | connections. They put all the plans into the "global" shared
           | memory.
        
             | SigmundA wrote:
             | Looks like you can change that with THREADED_EXECUTION to
             | make it act like it does on Windows with a single process
             | and threads:
             | 
             | >On UNIX, starting with Oracle Database 12c Release 2
             | (12.2), Oracle Database can use an operating system process
             | or an operating system thread to implement each background
             | task such as database writer (DBW0), log writer (LGWR),
             | shared server process dispatchers, and shared servers.
             | 
             | The use of operating system threads instead of processes
             | allow resource sharing and reduce resource consumption.
             | 
             | On Windows, each background process is implemented as a
             | thread inside a single, large process.
             | 
             | https://docs.oracle.com/en/database/oracle/oracle-
             | database/1...
             | 
             | Processes in Windows are much more expensive than Unix
             | typically so using threads has always been preferred to
             | multi process, perhaps thats why MSSQL only has that option
             | with an almost fully recreated internal process model that
             | you can list and kill etc.
             | 
             | Even Oracle says it helps with resource usage, even on
             | Unix/Linux. Also looks like Oracle has had some kind shared
             | mode for a long time where it basically has a built in
             | pooler to keep actual OS process count down, not 1:1 like
             | PG.
             | 
             | Sharing plans can obviously be done using shared memory but
             | it's not a simple as just creating some C++ object model
             | (which I believe is what PG has internally) for the plan it
             | must have a process agnostic data format that is then
             | executed probably by deserializing into a executable model
             | from shared memory. Fully jitted code is even trickier vs
             | just a set of logical plan operations. With threads you
             | just share executable code.
        
           | hinkley wrote:
           | Way back on Oracle 9i, we had a mystery stall problem. We
           | couldn't saturate the network, the CPU, or the fiber channel
           | links. We were stuck at ~50% and stumped. Some fuckery was
           | going on and we had to call in a professional.
           | 
           | Turned out 9i could only run queries that currently resided
           | in the query cache, and some idiot (who was now my boss) had
           | fucked up our query builder code so that we were getting too
           | many unique queries. Not enough bind variables.
           | 
           | So it's clear Oracle was using a shared cache back then, but
           | like other people here, I'm scratching my head how this would
           | work with Postgres's flavor of MVCC. Maybe share query plans
           | when the transaction completes?
           | 
           | I feel like that would get you 90% of the way but with some
           | head of queue nastiness.
        
       | miohtama wrote:
       | Back in 386 era, there was a concept of self-modifying code
       | (assembly). A similar like stencils presented here, but because
       | code was a singleton, rarely a copy was made.
       | 
       | E.g. Doom on DOS used this optimisation techique, because
       | otherwise you could not cram out enough performance from tight
       | rendering loops on old CPUs.
        
       ___________________________________________________________________
       (page generated 2024-03-18 23:00 UTC)