hngopher.com

       [HN Gopher] Why SQLite Uses Bytecode
       ___________________________________________________________________
        
       Why SQLite Uses Bytecode
        
       Author : todsacerdoti
       Score  : 703 points
       Date   : 2024-04-30 02:48 UTC (20 hours ago)
        
 (HTM) web link (sqlite.org)
 (TXT) w3m dump (sqlite.org)
        
       | chrisaycock wrote:
       | SQLite's design docs were the first time I had seen a database
       | use a virtual machine instead of walking a tree. I later noticed
       | VMs in libraries, embedded DSLs, and other applications outside
       | of large general-purpose programming languages. That really drove
       | home for me that VMs could be anywhere and were often a useful
       | step in handling a user's expressions.
        
         | runlaszlorun wrote:
         | SQLite is the first one I've looked at the internals of. Do
         | others walk an AST of the query instead?
        
           | __s wrote:
           | Yes, postgres for example. It maps pretty close to what you
           | see from `explain` where the ops are pretty high level,
           | reducing interpreter overhead. JIT's big gain is speeding up
           | reading values out of row data
        
           | hn_throwaway_99 wrote:
           | FTA:
           | 
           | > Tree-Of-Objects - The input SQL is translated in a tree of
           | objects that represent the processing to be done. The SQL is
           | executed by walking this tree. This is the technique used by
           | MySQL and PostgreSQL.
        
             | Sesse__ wrote:
             | Note that this only holds true for fairly recent MySQL
             | (MySQL 8.x, not including the oldest 8.0.x releases). 5.7
             | and older, and by extension MariaDB, instead models pretty
             | much everything as a large fixed-function recursive
             | function that calls itself for each new table in the join,
             | plus some function pointers for handling of GROUP BY and
             | such.
             | 
             | TBH, when it comes to query execution (A JOIN B JOIN C
             | GROUP BY ...), I think the difference between SQLite's
             | bytecode and a tree of iterators (the classical Volcano
             | executor) is fairly small in practice; they are quite
             | interchangeable in terms of what they can do, and similar
             | when it comes to performance. The difference between
             | bytecode and tree structure is much larger when it comes to
             | evaluation of individual expressions (a + b * cos(c)),
             | especially since that involves much more of a type system;
             | that is more obviously in the "bytecode is the way to go"
             | camp to me.
        
           | runlaszlorun wrote:
           | Sorry, I'd read the article a couple years ago and forgot he
           | goes into depth on the other approaches. My bad.
        
         | whateveracct wrote:
         | Everything is either a compiler or interpreter
        
           | Zambyte wrote:
           | That's why SICP is often brought up in contexts that are not
           | explicitly about interpreters :)
        
         | bane wrote:
         | VMs really can be. They have a long history in code
         | portability. In the old days it really wasn't uncommon to use
         | an approach centered around some interpretable byte code
         | running in a vm, where the reusable vm was all that needed
         | porting to different architectures and operating systems. This
         | all happened well before Java.
         | 
         | It was really big in gaming, Zork, Sierra games, LucasArts
         | games, and even a few more "action" games like Flashback were
         | all designed around VMs to make porting somewhat sane. Running
         | the byte code is usually not the performance bottleneck in
         | these cases but drawing stuff to the screen is, and the VM had
         | to handle that.
        
           | runlaszlorun wrote:
           | And Pascal p-code! Not the first, I've heard, but I believe
           | it's close to being the first.
        
             | pjmlp wrote:
             | One of the first was Burroughs B5000,
             | 
             | https://en.wikipedia.org/wiki/Burroughs_Large_Systems
             | 
             | It used an almost memory safe systems language, ESPOL, zero
             | Assembly, all CPU capabilities are exposed via intrisics,
             | one of the first recoded uses of unsafe code blocks, there
             | was tagging and capabilities support, the CPUs were
             | microcoded. All of this in 1961, a decade before C came to
             | be.
             | 
             | ESPOL was quickly replaced by NEWP, although there are very
             | little data when it happened, probly a couple of years
             | later.
             | 
             | Nowadays it is still sold by Unisys under the guise of
             | being a mainframe system for those that value security
             | above all, as ClearPath MCP, and you can get NEWP manual.
             | 
             | https://www.unisys.com/solutions/enterprise-
             | computing/clearp...
             | 
             | https://public.support.unisys.com/aseries/docs/ClearPath-
             | MCP...
        
               | kragen wrote:
               | the b5000 was one of the first _non_ -virtual stack
               | machines, but its instruction set isn't any more of a
               | virtual machine than the 8088's or the pdp-10's. there
               | were a number of interpreted stack languages in the 50s,
               | though not nearly as many as there would be later
        
               | pjmlp wrote:
               | When one does digital archeology is it quite common to
               | see Assembly referred to as bytecode, when the CPUs are
               | actually interpreters written in microcode.
               | 
               | Another example, all the Xerox PARC workstations, which
               | loaded the respective interpreter (Smalltalk, Interlisp,
               | Mesa, Mesa/Cedar) into the CPU as first step during the
               | boot process.
        
               | whartung wrote:
               | According to 5s on Google, the x86 has always been
               | microcoded. I guess the argument is that "machine
               | language" is the public API of the CPU, and "assembly
               | language" is the higher level script used to, mostly,
               | directly create machine language.
               | 
               | Is Intel microcode able to be changed by anyone? I
               | understand that even CPUs get field updates nowadays.
        
               | kragen wrote:
               | i have never seen assembly, or the code generated from
               | it, referred to as 'bytecode' unless there was no
               | physical hardware or microcode that could interpret it --
               | _except_ in the case of smalltalk, whose bytecode was, as
               | you say, interpreted by microcode on the alto and the
               | d-machines. but possibly your archeology has delved into
               | cultures mine has not? i 'd be very interested in reading
               | more, if you happen to have any links handy
        
           | gpderetta wrote:
           | Don't know about Flashback, but Another World famously was VM
           | based.
        
             | bane wrote:
             | I might be wrong, but I've read on other project pages that
             | it is. https://github.com/sylefeb/a5k
        
         | hiAndrewQuinn wrote:
         | VMs are a persistently underrated tool by people who think the
         | conversation begins and ends at VirtualBox. They are a
         | phenomenal way to constrain the surface area of what one's own
         | code has to target, and that's always a plus.
        
         | kqr wrote:
         | Or indeed any time complexity need to be layered. A VM design
         | doesn't even have to have an explicit bytecode compilation
         | stage -- you can write an interpreter that runs as the
         | instructions are issued.
         | 
         | David Parnas talks a lot about this as a way to reliable
         | modularisation: https://two-wrongs.com/deliberate-
         | abstraction.html
        
         | anon291 wrote:
         | Stack-based VMs, like SQLite's (I think), ARE trees. A stack
         | based VM's bytecode (without DUP and POP) is just the post-
         | order depth-first traversal of the corresponding expression
         | tree. With DUP you have a connected acyclic DAG. With POP you
         | have an acyclic DAG with one or more components. With loops you
         | have a full graph.
         | 
         | When looked at this way, a VM makes the most sense actually
         | because a pointer-heavy tree implementation is just terrible
         | for caching and is super wasteful. Also, most SQL plans are
         | trees (Until you get to WITH RECURSIVE).
         | 
         | I'm working on a CEP in C right now and a stack-based bytecode
         | is simply the most 'obvious' way to represent a tree when you
         | don't want to have to deal with lots of memory allocations.
         | Just pre-allocate a buffer large enough and increment and go.
         | 
         | Either way, try taking something like the following and
         | printing an expression tree like the one below
         | LITERAL 2       LITERAL 3       LITERAL 4       MUL       PLUS
         | 
         | now, write something to produce on a terminal, the following:
         | Plus         Mul           3           4         2
         | 
         | You should be able to do this in a simple iteration through the
         | ops. Try it! Then try it with the variations above.
         | 
         | Now, try writing ops to manipulate and rewrite it. It's
         | actually really a fun way to represent trees.
        
           | Scaevolus wrote:
           | SQLite's VM is register-based, not stack-based.
        
             | ojosilva wrote:
             | Does SQLite's VM have an API? I mean, one where I can build
             | and issue opcodes into directly.
        
               | OskarS wrote:
               | Not a public one, because they want to have the freedom
               | to change how it works. But if you type in `EXPLAIN
               | CREATE TABLE x(y);` or `EXPLAIN SELECT 3;` or whatever
               | into `sqlite3`, you can see it. And it's reasonably
               | straightforward to dig into the source code (SQLite is
               | very well-written C) to hook that up yourself if you
               | really wanted to.
        
             | bch wrote:
             | It's been both - was stack, converted to register[0][1].
             | 
             | [0] https://www.sqlite.org/vdbe.html
             | 
             | [1] https://www.sqlite.org/src/info/051ec01f2799e095
        
             | notpeter wrote:
             | This was the question I was going to ask! I've recently
             | been diving into Lua 5.x internals and have been extremely
             | impressed with Lua's register-based bytecode
             | implementation. Lua has a stable byte code interface
             | between patch releases (5.4.0 -> 5.4.1, etc) but not
             | between major/minor revisions (5.3 -> 5.4). SQLite on the
             | other hand does not consider this a public interface at
             | all!
             | 
             | > "Remember: The VDBE opcodes are not part of the interface
             | definition for SQLite. The number of opcodes and their
             | names and meanings change from one release of SQLite to the
             | next." https://www.sqlite.org/opcode.html#the_opcodes
             | 
             | For anyone interested in understanding how a register-based
             | VM operates I highly recommend:
             | 
             | A No-Frills Introduction to Lua 5.1 VM Instructions by
             | Kein-Hong Man.
             | https://www.mcours.net/cours/pdf/hasclic3/hasssclic818.pdf
        
           | Sesse__ wrote:
           | > Also, most SQL plans are trees (Until you get to WITH
           | RECURSIVE).
           | 
           | WITH RECURSIVE can also generally be implemented tree-style,
           | you just loop a bunch in one of the iterators.
           | 
           | There are some databases, like Umbra, which can have DAG
           | query plans for the benefit of more efficient groupjoin
           | (GROUP BY pushed down into a join). IIRC some of the
           | unnesting patterns can also benefit from non-tree plans.
        
         | surfingdino wrote:
         | Doesn't MS Word internally run a Forth-like VM? I remember
         | reading an article by someone who decompiled an early MS-DOS
         | version of Word only to discover that there was a VM inside.
        
           | jpfr wrote:
           | They called it p-code at the time. The purpose was
           | (purported) to simplify the porting between architectures.
           | 
           | https://casadevall.pro/articles/2020/11/compiling-word-
           | for-w...
        
         | bitwize wrote:
         | Wozniak found that 16-bit arithmetic was much easier to perform
         | on a 16-bit machine. So he wrote SWEET16, a VM with its own
         | bytecode, to work with 16-bit pointers in Apple's Integer
         | BASIC.
         | 
         | Similarly, the TI-99/4 series of computers had a cumbersome way
         | of accessing memory, because _most_ of the RAM in the machine
         | was exclusively controlled by the video chip and could only be
         | accessed by poking a few registers. So the designers
         | implemented a bytecode VM in ROM called GPL (Graphics
         | Programming Language) that would make accesses to this video
         | RAM seem like straightforward RAM accesses. TI encouraged the
         | use of GPL to develop games and other applications for the
         | system. Unfortunately, the fact that it was interpreted by the
         | CPU, plus the fact that the CPU could only access video-chip
         | memory during the horizontal and vertical retrace intervals,
         | made GPL execution very slow. And since the machine 's in-ROM
         | BASIC interpreter was written in GPL, you now know why BASIC
         | programs on the TI-99/4 and /4A crept along at the most
         | leisurely of paces.
         | 
         | So VMs were definitely a thing used to make certain kinds of
         | tradeoffs, even way back in the 8-bit days when systems had
         | mere kilobytes of memory. Who knows how many might be lurking
         | in a modern system!
        
           | krallja wrote:
           | The unfortunate history of SWEET16 is that it was never used
           | for its intended purpose -
           | https://retrocomputing.stackexchange.com/a/14513/11579
        
         | naasking wrote:
         | > That really drove home for me that VMs could be anywhere and
         | were often a useful step in handling a user's expressions.
         | 
         | I think it's a special case of the fact that finite state
         | machines are everywhere you look.
        
         | atombender wrote:
         | MonetDB is another database that uses a VM [1], using an
         | instruction set called the MonetDB Assembly Language (MAL). I
         | believe it pioneered the technique [2]. The VM is basically
         | designed to express the query as vectorized functions on
         | columns.
         | 
         | [1] https://stratos.seas.harvard.edu/files/MonetDebull2012.pdf
         | 
         | [2]
         | https://www.researchgate.net/publication/220538804_Database_...
        
       | userbinator wrote:
       | _The problem of rendering a tree-of-objects as a table is
       | sufficiently difficult that nobody does it, as far as I know.
       | Hence, no tree-of-objects database engine provides the level of
       | detail in their "EXPLAIN" output that SQLite provides._
       | 
       | I believe Microsoft SQL Server uses an object tree internally,
       | and yet its query plan output is a table:
       | 
       | https://learn.microsoft.com/en-us/sql/t-sql/statements/set-s...
        
         | jiggawatts wrote:
         | Typically you get XML for showplan because it can represent the
         | tree structure better.
        
           | kruador wrote:
           | You get XML for estimated execution plans if you use SET
           | SHOWPLAN_XML ON. You get the actual execution plan XML along
           | with your results if you use SET STATISTICS XML. You can see
           | cached execution plans in the sys.dm_exec_query_plan dynamic
           | management view.
           | 
           | The estimated/actual execution plan feature in SQL Server
           | Management Studio was changed to use XML in SQL Server 2005.
           | The SQL Server team are only adding new information to the
           | XML representation, not the text representation.
           | 
           | The name "estimated" execution plan is a bit misleading. If
           | you executed the statement(s) with the current indexes and
           | statistics, that's how it would be executed. I think the
           | "estimated" part of the name is because the number of rows
           | returned from each node, and the execution time of each node,
           | is an estimate.
           | 
           | Assuming that the indexes and statistics don't change, you
           | should expect the "actual" execution plan to have the same
           | shape. It will just be annotated with the number of rows that
           | were actually retrieved as well as the estimated numbers. SQL
           | Server doesn't re-plan if it turns out the number of rows was
           | greater (or fewer) than expected.
           | 
           | As a SQLite and SQL Server user, I find that I would like
           | something more detailed from SQLite than EXPLAIN QUERY PLAN
           | (which just tells you the join order), but less detailed than
           | the bytecode. Particularly I want to know _why_ it chose that
           | join order or to use that index.
        
             | yread wrote:
             | I havent used SQL server much since the 2008 days but it
             | used to happen quite often that the estimated and actual
             | execution plan differed a lot. I think the estimated part
             | also includes how many rows get returned from a subquery
             | and that informs which joining algorithm is used.
             | 
             | I second your longing for something more detailed than
             | explain query plan in Sqlite though.
             | 
             | In fact I complained about it here and some sqlite
             | developer pointed me to these resources:
             | 
             | https://news.ycombinator.com/item?id=30405275
             | 
             | I've never actually tried yet. Queries are still too fast
             | to do all that:)
        
         | slaymaker1907 wrote:
         | It can also execute a query incrementally. In fact, certain
         | features rely on this behavior. "Watch live data" (in SSMS) for
         | extended events is actually an infinite table-valued function.
         | It's not exactly rocket science to do in a database, you just
         | need to model data sources and operators as iterators.
        
         | bryancoxwell wrote:
         | I don't doubt the author, but what is it that makes rendering a
         | tree of objects to a table a difficult problem? Is that not
         | what browsers do when they render a table element?
        
           | IsTom wrote:
           | At least they're hard to display on a terminal. Reading
           | EXPLAIN ANALYZE diagrams from psql does not spark joy.
        
           | mfoc wrote:
           | Quote from the article:
           | 
           | "A tree-of-objects representation is more difficult to
           | publish in a human-readable form. The objects that comprise
           | the tree tend to all be very different, and thus it is tricky
           | to come up with a consistent and simple table representation
           | with which to display the objects. Any any such table
           | representation that you do come up with would almost
           | certainly have more than six columns, probably many more. The
           | problem of rendering a tree-of-objects as a table is
           | sufficiently difficult that nobody does it"
           | 
           | To further elaborate on this important point.
           | 
           | There is an 'impedance mismatch' (conceptual difficulty
           | mapping between the two logic models) between the tree
           | abstract data type and the table abstract data type.
           | Specifically, there are four key differences between the
           | simple table data structure and the more complex tree data
           | structure that makes mapping between them a non-trivial
           | operation.
           | 
           | Hierarchical: A table has a flat representation; a tree has a
           | hierarchical representation.
           | 
           | Order: The order of the rows in a table typically do not
           | matter (they may have a unique rowid). The order of branches
           | (nodes) and leaves in a tree is important, and the ordering
           | itself in an encoding of valuable information.
           | 
           | Semi-structured: A table has a fixed structure (rows
           | multiplied by columns). A tree has a flexible structure - an
           | arbitrary combination of branches (internal nodes) and leaves
           | (terminal nodes). Semi-structured data has a structure that
           | may not necessarily be known in advance, the tree has
           | irregular and variable formation; the tree may have branches
           | with missing or supplementary nodes.
           | 
           | Meta-data: The information describing the meaning of the data
           | in a table is typically stored separately from the table -
           | consequently a schema is often mandatory. A schema is
           | optional for a tree abstract data type.
           | 
           | As an aside, I have been visiting hacker news almost daily
           | since 2010. This is my first comment on hacker news. I want
           | to say thank you to the community for the many valuable
           | insights over this years.
        
             | galaxyLogic wrote:
             | Congratulations for your first comment!
             | 
             | What I want to do is represent my tree-structure as a table
             | in a relational database and then be able to efficiently
             | get the tree structure back by transforming the table-
             | representation back into a tree.
             | 
             | Further I would like to do that in plain standard SQL. This
             | must be a common problem, any documented solutions out
             | there?
        
               | mfoc wrote:
               | Thank you.
               | 
               | Great question - you have touched on the key difference
               | between a labeling scheme and an encoding scheme for tree
               | data structures.
               | 
               | As mentioned previously, the tree is an abstract data
               | type, that is to say, a conceptual model that defines the
               | nodes in a tree, and their relationships.
               | 
               | To be able to evaluate a expression that processes a
               | tree, one needs a labeling scheme. The purpose of a
               | labeling scheme is to assign unique labels to each node
               | in the tree and these labels must facilitate node
               | ordering, and often (but not always) a labeling scheme
               | will permit the reconstruction of the tree structure.
               | 
               | However, no labeling scheme captures the node type, names
               | or the content stored at the nodes. For that we need an
               | encoding scheme. An encoding scheme is constructed upon a
               | labeling scheme and augments it with the information
               | necessary to fully represent the tree in a table-like
               | data structure. In answer to your question, it also
               | permits the full transformation from the table
               | representation to the original tree structure.
               | 
               | Thus, it sounds like what you are looking for is an
               | encoding scheme.
               | 
               | There are many different labeling schemes out there for
               | tree structure data, and virtually all of them can be
               | augmented with additional information to construct a
               | complete encoding scheme. Concerning documented solutions
               | - I have not been active in this space for a number of
               | years, so off the bat - I don't have a recommended
               | documented solution to point you too.
               | 
               | But to help, I will put a link to my PhD thesis [1] which
               | gives a more in-depth understanding of labeling schemes
               | and encoding schemes for tree structured data with an
               | example of a simple implementation of an encoding scheme
               | enabling the full transformation from the table
               | representation to the original tree structure (pages 5-9)
               | and a survey of the advantages and disadvantages of
               | existing labeling schemes concerning their usefulness to
               | be part of an encoding scheme you could use in your
               | solution (see chapter 2)
               | 
               | Caveat 1: My thesis was written in the context of
               | updating dynamic (XML) trees but it addresses the
               | transformation between tree and table data structures.
               | 
               | Caveat 2: The thesis was written 11 years ago, but every
               | now and then I have kept in touch with the latest
               | developments in the area, and to my knowledge, there have
               | been no major developments since.
               | 
               | I hope it helps.
               | 
               | [1]: https://doras.dcu.ie/19316/
        
             | bryancoxwell wrote:
             | Hey, this is excellent. Thanks for the thorough response.
        
         | winternewt wrote:
         | Same with PostgreSQL. You can choose between text output (one
         | table row per line), XML or JSON. But none of them are ordered
         | according to a strict execution order like bytecode would be.
         | To me that makes perfect sense though because if represents
         | what can be parallelized in the plan. The bytecode
         | representation in SQLite seems to have the inherent limitation
         | that it is single threaded.
        
         | yayr wrote:
         | Actually, SAP HANA has a nice tool called PlanViz for that in
         | Eclipse. A google image search gives you a good impression how
         | that works. There are also some blogs about it, e.g.
         | https://community.sap.com/t5/enterprise-resource-planning-bl...
        
           | Cthulhu_ wrote:
           | I love how the example SQL query uses abbreviated columns
           | that sound like runes:                   select         --
           | count(*)         a.mandt, b.rldnr, a.bukrs, a.gjahr, a.belnr,
           | b.docln, a.buzei         --     *         from       BSEG
           | as a         innerjoin  ACDOCA  as b         on    b.rclnt  =
           | a.mandt         and   b.rbukrs = a.bukrs         and
           | b.gjahr  = a.gjahr         and   b.belnr  = a.belnr
           | where a.mandt  = '715'         and   b.rldnr  = '0L'
           | and   b.docln  = '000001'         and   a.buzei  = '001'
           | and   a.gjahr  = '2018'         --and   a.gjahr = '2017'
           | --and   a.gjahr = '2019'         --order by a.mandt, b.rldnr,
           | a.bukrs, a.gjahr, a.belnr, b.docln, a.buzei         limit
           | 200;
           | 
           | Just realised this is probably from German too, "jahr" being
           | year.
        
             | yayr wrote:
             | yeah, these accounting table field names have not changed
             | since around 30 years, which is not necessarily a bad
             | thing...
             | 
             | for a human readable view of this you can look e.g. here
             | https://docsfortec.com/sap/S4/tables/ACDOCA
        
               | deanishe wrote:
               | They're a bit of a bilingual mess.
               | 
               | Did they start out with German table names and decide to
               | give new tables English names at some point?
        
         | zachmu wrote:
         | Dolt's EXPLAIN output prints the execution tree directly. E.g.:
         | explain select * from xy join uv on (x = u and u  > 0) where u
         | < 2;         Project          +- columns: [xy.x:2!null, xy.y:3,
         | uv.u:0!null, uv.v:1]          +- LookupJoin              +-
         | IndexedTableAccess(uv)              |   +- index: [uv.u]
         | |   +- static: [{(0, 2)}]              |   +- colSet: (3,4)
         | |   +- tableId: 2              |   +- Table              |
         | +- name: uv              |       +- columns: [u v]
         | +- IndexedTableAccess(xy)                  +- index: [xy.x]
         | +- keys: [uv.u:0!null]                  +- colSet: (1,2)
         | +- tableId: 1                  +- Table                      +-
         | name: xy                      +- columns: [x y]
        
       | Dwedit wrote:
       | Running bytecode is much lower latency than compiling into native
       | code. If you're not bottlenecked by running the bytecode (as
       | opposed to memory or disk speed), you don't really have to JIT it
       | any further into native code.
        
         | winrid wrote:
         | Yeah, but nobody is seriously considering that unless maybe for
         | huge prepared statements. The argument is usually bytecode vs
         | parser and associated data structures.
        
           | _flux wrote:
           | PostgreSQL is not only considering it, they're doing it!
           | https://www.postgresql.org/docs/current/jit-reason.html
           | 
           | I don't have personal experience on it, but I've read that in
           | practice it's not worth the effort--at least not yet.
           | Apparently there are some issues with it and it barely speeds
           | up queries (except perhaps certain ones?). I imagine this
           | could be in big part because LLVM is not really a good fit
           | for JIT where you want to spend very little time to do the
           | compilation itself.
        
             | masklinn wrote:
             | Yeah my experience of the pg jit is mostly negative, it's
             | quite slow and it has a hard time estimating the cost of
             | interpreted v compilation + compiled execution so more
             | often than not it's actively detrimental. It might fare
             | better if you make heavy use of a limited number of
             | prepared statements.
        
             | winrid wrote:
             | Interesting, thanks. I imagine they will have a query
             | planner for the query planner to determine to JIT or not :)
        
         | rhdunn wrote:
         | Which is why JavaScript engines (and JIT compilers for other
         | languages) are typically designed to:
         | 
         | 1. start interpreting the code once it has been parsed, and
         | start jitting a function being called;
         | 
         | 2. generate naive bytecode for the function that generates
         | native code equivalents of the run actions of the AST
         | (including some fast-path code for simple cases such as adding
         | two 32-bit integers, and falling back to function calls to
         | perform the add on more complex types);
         | 
         | 3. generate more optimal code for sequences of instructions in
         | the background, such as entire for/while loops, then patch in
         | calls to those fast-path versions when ready.
         | 
         | That way you can start running the code immediately after
         | parsing it, and can switch to the faster versions when they are
         | ready if the code takes longer to run in the slower version.
        
         | gpderetta wrote:
         | Depends how much compilation you want to do. For example
         | converting threaded code into native direct jumps need not be
         | more expensive than plain bytecode. Of course the gains are
         | also limited.
        
       | MrBuddyCasino wrote:
       | I was surprised the text didn't mention one major difference
       | between the byte code approach vs AST: coupling.
       | 
       | When your database engine runs in-process, there is no
       | possibility of the server and the client library having diverging
       | versions. But this is common with traditional databases.
       | 
       | Once you bake in the execution steps (,,how to execute") instead
       | of describing the query via AST (,,what to execute"), an
       | important part of the execution logic now lives in the driver.
       | 
       | I suspect this complicates version upgrades and bugfixes, because
       | the database is less self-contained.
       | 
       | Not an issue for sqlite, potentially disastrous for mysql.
        
         | tadfisher wrote:
         | Do clients typically communicate with the server in some AST
         | representation instead of, well, SQL? I'd be surprised if that
         | much parsing/planning happens on the client.
        
           | MrBuddyCasino wrote:
           | Since prepared statements are created by the driver, I was
           | assuming this was the case - but I might be completely wrong
           | here.
        
             | _flux wrote:
             | Converting a SELECT to a PRPEPARE does not really require
             | parsing the complete query--or even it it did, some small
             | concessions for this could be implemented in the line
             | protocol to enable the server to do the prepared statement
             | out of client query.
             | 
             | I don't believe *any* SQL client library actually tries to
             | parse e.g. PostgreSQL itself at any point of processing.
             | You can read what the PostgreSQL protocol does here:
             | https://www.postgresql.org/docs/current/protocol-
             | flow.html#P...
        
             | layer8 wrote:
             | No, prepared statements are server-side.
        
         | evanelias wrote:
         | > an important part of the execution logic now lives in the
         | driver
         | 
         | > potentially disastrous for mysql
         | 
         | This is completely incorrect. The MySQL client side isn't
         | responsible for execution steps and this isn't part of its wire
         | protocol at all.
         | 
         | With prepared statements, the server parses the statement and
         | gives the client back a handle, which is basically just a
         | unique identifier that the client can use to invoke the
         | prepared statement.
        
         | Sesse__ wrote:
         | > Not an issue for sqlite, potentially disastrous for mysql.
         | 
         | MySQL _completely_ switched execution paradigm through the
         | course of a couple of 8.0.x sub-releases, generally without
         | anyone noticing.
        
       | megadal wrote:
       | Perhaps my understanding is off, but I am pretty sure parsing and
       | translating SQL into bytecode still involves an AST.
       | 
       | Just that query processing itself is done from the bytecode
       | (produced presumably from an AST or something similar) rather
       | than directly from the AST itself.
       | 
       | If I'm right I can't really see how this performs better unless
       | you're excluding the parsing step from benchmarks
        
         | lifthrasiir wrote:
         | You don't need one, Lua is another example where no AST is ever
         | generated. In some sense the resulting bytecode closely
         | corresponds to the AST that would have been generated though.
        
           | megadal wrote:
           | Genuinely asking as parsing without an AST is something I've
           | never seen explained: How do you go from source code to
           | bytecode without an AST?
           | 
           | Isn't the bytecode just a flattened representation of an AST
           | obtained by some sort of tree traversal?
           | 
           | This seems to imply an AST is involved in the generation of
           | the bytecode
        
             | lifthrasiir wrote:
             | Have you used any parser generator like yacc/bison? They
             | have "actions", which are arbitrary codes that will be
             | executed when some grammar production is detected. For
             | example, `expr ::= expr mulop expr { some code }` will
             | execute `some code` when a multiplicative expression is
             | detected, where intermediate results from two `expr`s and
             | `mulop` are available to that code. This concept of actions
             | generally applies to all sort of parsers, not just
             | generated parsers.
             | 
             | Those actions would typically allocate and build (partial)
             | ASTs, but you can do anything with them. You can for
             | example directly evaluate the subexpression if your grammar
             | is simple enough. Likewise bytecodes can be generated on
             | the fly; the only concern here is a backward reference,
             | which has to be patched after the whole expression block
             | gets generated, but otherwise you don't have to build any
             | tree-like structure here. (Most practical parsers only need
             | a stack to function.)
        
             | kazinator wrote:
             | How you go from source to target code without an AST is
             | that the syntax-directed translation step in your
             | implementation (that which would build the AST) doesn't
             | bother with that and just builds the output code instead.
             | The extra traversal is skipped; replaced by the original
             | parser's traversal of the raw syntax.
             | 
             | E.g. pseudo-Yacc rules for compiling the while loop in a
             | C-like notation.                 while_loop : WHILE '('
             | expr ')' statement                    {
             | let back = get_label();                        let fwd =
             | get_label();                        let expr = $3; // code
             | for expr recursively generated                        let
             | stmt = $5; // code for statement, recursively generated
             | $$ = make_code(stmt.reg(),
             | `$back:\n`
             | `${expr.code()}\n`
             | `BF ${expr.reg}, $fwd\n`  // branch if false
             | `${stmt.code()}\n`
             | `JMP $back\n`
             | `$fwd:\n`)
             | }                    ;
             | 
             | Every code fragment coming out of a rule has .code() and
             | .reg(): the generated code, which is just a string, and the
             | output register where it leaves its value. Such
             | representational details are decided by the compiler
             | writer.
             | 
             | The while statement produces no value, so we just borrow
             | the statement's .reg() as a dummy in the call to make_code;
             | our rule is that every code fragment has an output
             | register, whether it produces a value or not.
             | 
             | When the LALR(1) parser reduces this while loop rule to the
             | while_loop grammar symbol, the expr and statements have
             | already been processed; so the rule action has ready access
             | to the code objects for them. We just synthesize a new code
             | object. We grab a pair of labels that we need for the
             | forward and back jump.
             | 
             | I'm assuming we have a vaguely JS-like programming language
             | being used for the grammar rule actions, in which we have
             | template strings with interpolation, and adjacent strings
             | get merged into one. The bytecode assembly is line
             | oriented, so we have \n breaks.
             | 
             | One possible kind of expression is a simple integer
             | constant, INTEGER:                 expr : INTEGER
             | {                let reg = allocate_reg();
             | let val = $1                $$ = make_code(reg,
             | `LOAD $reg, #$val\n`)              }
             | 
             | One possible statement is an empty statement dented by
             | empty curly braces:                 statement : '{' '}'
             | {                     $$ = make_code(R0,  // dedicated zero
             | register                                    ""); // no-code
             | solution                   }
             | 
             | So then when we have while (1) { } we might get R1
             | allocated in the expr rule, like this:                 LOAD
             | R1, #1\n   ; output register is R1
             | 
             | then in the while loop, things get put together like this:
             | L0:\n           ; back label       LOAD R1, #1\n   ; code
             | for expr       BF R1, L1\n     ; branch if false to L1
             | ; no code came from empty statement       JMP L0          ;
             | back branch       L1:\n           ; fwd label
        
             | thewakalix wrote:
             | See also: DOM versus streaming.
        
             | kccqzy wrote:
             | Another way to think about this is to imagine you are
             | working in a lazy-by-default language like Haskell. The
             | code might seem to build up an AST and then evaluate it,
             | but (disregarding parse errors) the AST is never
             | materialized in memory at once: the tree might only have
             | one level, with its children represented by thunks that
             | continue to parse more of the source code. (If you are
             | unfamiliar with Haskell laziness, imagine that the so-
             | called tree has children that are _functions_ to produce
             | the next level of the tree.) Of course you will need a
             | carefully designed bytecode in order to generate bytecode
             | from AST where the children is not yet known.
        
             | ufo wrote:
             | It's as if they generated an AST and then immediately
             | converted it to bytecode, all at once.
             | 
             | The key restriction is that you must be able to generate
             | the bytecode in a single pass over the source code. For
             | example, the compilation of a function at the top of the
             | file must not depend on declarations further down the file.
        
               | vidarh wrote:
               | Depending on the form the dependencies takes you can
               | handle that too. E.g. if it's "just" about knowing
               | addresses, then just building up a list of relocation
               | records where you need to patch in addresses as they
               | become known works just fine. Of course the more work
               | like that you have to do, the less you save over just
               | building an AST anyway.
        
             | lolinder wrote:
             | The second half of Crafting Interpreters [0] does this--
             | rather than outputting the AST, the bytecode interpreter
             | outputs bytecode directly. Essentially, the process of
             | creating the AST is already very similar to the tree
             | traversal you'll do to unpack it (recursive descent parsing
             | _is_ a traversal of the tree, just in the call stack rather
             | than over a real data structure), so if you don 't need the
             | AST you can skip it and just start outputting bytecode as
             | if you were already walking the tree.
             | 
             | [0] http://craftinginterpreters.com/
        
             | vidarh wrote:
             | Consider that if the bytecode is just a flattened
             | representation, then instead of generating and traversing
             | the tree, you can just traverse what would have been turned
             | into the tree in the same traversal order directly.
             | 
             | E.g. imagine you're in a leaf production for some simple
             | grammar for an operator that can only take integers, w/all
             | error handling ignored:                    parse_fooexpr()
             | {               left = parse_int
             | emit_int(left)               expect_token(FOO_OP)
             | right = parse_int               emit_int(right)
             | emit_foo_op()          }
             | 
             | Whether emit_int outputs a "push <someint>" VM instruction
             | to a bytecode buffer, or constructs an IntNode and pushes
             | it to an argument stack, and whether "emit_foo_op" emits a
             | "foo_op" bytecode instruction or pops the argument stack
             | and constructs a FooOpNode(...argument stack values) AST
             | node is an implementation detail in terms of the parser.
             | 
             | Wirth's compilers usually didn't use an AST, and his book
             | on compiler construction contains a thorough walkthrough on
             | generating code as you parse, if you want to see it
             | explained in a lot more detail and w/tricks to generate
             | better code than the most naive approaches will.
             | 
             | The difficulty with this approach comes when you want to
             | apply optimisations or where it's inconvenient to generate
             | the code in parse-order because the language tries to smart
             | about what feels better to write, but you can get pretty
             | far with this method.
        
             | ksherlock wrote:
             | Take a look at the evergreen Let's Build a Compiler, by
             | Jack Crenshaw
             | 
             | https://compilers.iecc.com/crenshaw/
             | 
             | Chapter 2 and 3, expression parsing, should give you the
             | idea. Instead of building the AST, you build code. So
             | you're cutting out the middle man (AST).
        
             | UK-AL wrote:
             | Parse Tree and AST are slightly different. A parse tree can
             | be implict. A recursive descent parser can can explore the
             | parse tree without there being an explict parse tree data
             | structure.
        
         | miloignis wrote:
         | An AST being generated as an intermediate step is mentioned in
         | the article, at least in passing in section 2.4.
         | 
         | The reason bytecode is generally faster (not just for SQL, but
         | in most interpreted languages you may use (Python, etc)) is
         | that walking the AST is relatively expensive and doesn't treat
         | caches nicely. Bytecode is generally located next to each other
         | in memory, and you can make the bytecode fetch/dispatch pretty
         | tight, relative to indirect function calls on an object in an
         | AST.
        
           | rhdunn wrote:
           | Another advantage to that is that it avoids the branching of
           | the while loop in the interpreter that iterates over the AST,
           | providing better instruction pipelining with having all the
           | run code next to each other.
           | 
           | The downside -- especially for dynamic languages like
           | JavaScript -- is that you need to keep all of the type checks
           | and fast-paths in the code, resulting in larger code blocks.
           | With more type analysis you could group fast-path
           | instructions together (e.g. within a while or for loop) but
           | that takes time, which is typically why a JIT engine uses
           | multiple passes -- generate the slower machine code first,
           | then improve the fast-path blocks for code that is long
           | running.
        
             | epcoa wrote:
             | > Another advantage to that is that it avoids the branching
             | of the while loop in the interpreter that iterates over the
             | AST
             | 
             | Huh? A bytecode interpreter still ends up branching based
             | on the decoded value of the byte codes. The VM bytecodes
             | are not directly executed by the CPU (at least not usually,
             | Jazelle and some older P-code stuff being rare exceptions).
             | 
             | This is what it looks like for the example being discussed
             | ("a massive switch statement"): https://github.com/sqlite/s
             | qlite/blob/b11daa50f9ea11c332bb59...
             | 
             | Perhaps confusing bytecode generation and interpretation
             | with JIT? JIT is often paired with a bytecode but neither
             | is dependent on the other.
        
               | rhdunn wrote:
               | The parent mentioned avoiding the dispatch of a virtual
               | function call used to evaluate the AST/bytecode so I was
               | assuming it was a minimal JIT instead of running the
               | resulting bytecode in an interprative loop.
               | 
               | But yes, if you are interpreting the intermediate VM
               | bytecode you will still have a switch or dispatch branch
               | when evaluating each instruction.
        
         | katzenversteher wrote:
         | Quote from the article:
         | 
         | "The bytecode generated by SQLite is usually smaller than the
         | corresponding AST coming out of the parser. During initial
         | processing of SQL text (during the call to sqlite3_prepare()
         | and similar) both the AST and the bytecode exist in memory at
         | the same time and so more memory is used then. But that is a
         | transient state. The AST is quickly discarded and its memory
         | recycled [...]"
        
       | wolf550e wrote:
       | The page is the result of this exchange on Twitter:
       | 
       | https://twitter.com/gorilla0513/status/1784756577465200740
       | 
       | I was surprised to receive a reply from you, the author. Thank
       | you :) Since I'm a novice with both compilers and databases,
       | could you tell me what the advantages and disadvantages are of
       | using a VM with SQLite?
       | 
       | https://twitter.com/DRichardHipp/status/1784783482788413491
       | 
       | It is difficult to summarize the advantages and disadvantages of
       | byte-code versus AST for SQL in a tweet. I need to write a new
       | page on this topic for the SQLite documentation. Please remind me
       | if something does not appear in about a week.
        
         | paulddraper wrote:
         | There are three approaches:
         | 
         | 1. interpreted code
         | 
         | 2. compiled then interpreted bytecode
         | 
         | 3. compiled machine code
         | 
         | The further up, the simpler.
         | 
         | The further down, the faster.
        
           | branko_d wrote:
           | > 2. compiled then interpreted bytecode
           | 
           | You can also compile to bytecode (when building), and then
           | compile that bytecode to the machine code (when running).
           | That way, you can take advantage of the exact processor that
           | is running your program.
           | 
           | This is the approach taken by .NET and Java, and I presume
           | most other runtime environments that use bytecode.
        
             | gpderetta wrote:
             | There is also the option of native code generation from
             | bytecode at install time as opposed of runtime.
        
               | usrusr wrote:
               | Aka missing out on all the delicious profiling
               | information approaches with later translation enjoy.
               | There's no simple best answer.
        
               | 0x457 wrote:
               | I'm probably wrong, but I always thought for platforms
               | like .Net and JVM, AoT delivers the same performance as
               | JIT in most cases. Since a lot of information is
               | available before runtime, unlike JS where VM always needs
               | to be read, ditch optimized byte code and go back to the
               | whiteboard.
        
               | neonsunset wrote:
               | There are a few concerns here that make it confusing for
               | the general public to reason about AOT vs JIT
               | performance.
               | 
               | Different JIT compilers have different levels of
               | sophistication, being able to dynamically profile code at
               | runtime or otherwise, and the range of applied
               | optimizations of this type may also vary significantly.
               | 
               | Then, AOT imposes a different restrictions in the form of
               | what types of modularity the language/runtime wants to
               | support (Swift ABI vs no AOT ABI in .NET NativeAOT or
               | GraalVM Native Image), what kind of operations modifying
               | type system are allowed, if any, and how the binary is
               | produced.
               | 
               | In more trivial implementations, where JIT does not
               | perform any dynamic recompilation and profile based
               | optimizations, the difference might come down to simply
               | pre-JITing the code, with the same quality of codegen. Or
               | the JIT may be sophisticated by AOT mode might still be a
               | plain pre-JIT which makes dynamic profiling optimizations
               | off-limits (which was _historically_ the case for .NET 's
               | ReadyToRun and partially NGEN, IIRC JVM situation was
               | similar up until GrallVM Native Image).
               | 
               | In more advanced implementations, there are many concerns
               | which might be at odds with each other: JIT compilers
               | have to maintain high throughput in order to be
               | productive but may be able to instrument intermediate
               | compilations to gather profile data, AOT compilers, in
               | the absence of static profile data (the general case),
               | have to make a lot more assumptions about the code, but
               | can reason about compiled code statically, assuming such
               | compilation comes with "frozen world" type of packaging
               | (rather than delegating dynamic parts to emulation with
               | interpreter).
               | 
               | And then there are smaller details - JIT may not be able
               | to ever emit pure direct calls for user code where a jump
               | is performed to an immediate encoded in the machine code,
               | because JIT may have to retain the ability to patch and
               | backpatch the callsites, should it need to de/reoptimize.
               | Instead, the function addresses may be stored in the
               | memory, and those locations are encoded instead where
               | calls are emitted in the form of dereference of a
               | function pointer from a static address and then a jump to
               | it.
               | 
               | JIT may also be able to embed the values of static
               | readonly fields as JIT constants in codegen, which .NET
               | does aggressively, but is unable to pre-initialize such
               | values by interpreting static initialization at compile
               | time in such an aggressive way that AOT can (constexpr
               | style).
               | 
               | So in general, a lot of it comes down to offering a
               | different performance profile. A lot of the beliefs in
               | AOT performance stem from the fact lower-level languages
               | rely on it, and the compilers offering it are very
               | advanced (GCC and Clang mainly), which can expend very
               | long compilation times on hundreds of optimization passes
               | JIT compilers cannot. But otherwise, JIT compilers can
               | and do compete, just a lot of modern day advanced
               | implementations are held back by the code that they are
               | used for, in particular in Java land where OpenJDK is
               | really, really good, but happens to be hampered by being
               | targeted by Java and JVM bytecode abstraction, which is
               | not as much of a limitation in C# that can trade blows
               | with C++ and Rust quite confidently the moment the
               | abstraction types match (when all use templates/struct
               | generics for example).
               | 
               | More on AOT and JIT optimizations (in the case of .NET):
               | 
               | - https://devblogs.microsoft.com/dotnet/performance-
               | improvemen...
               | 
               | - https://migeel.sk/blog/2023/11/22/top-3-whole-program-
               | optimi...
               | 
               | If someone has similar authoritative content on what
               | GraalVM Native Image does - please post it.
        
           | DeathArrow wrote:
           | There is also JIT bytecode.
        
             | paulddraper wrote:
             | Yes, the timing of that compilation of bytecode and machine
             | code can be either AOT or JIT.
             | 
             | For example, Java/JVM compiles bytecode AOT and compiles
             | machine code JIT.
        
           | tanelpoder wrote:
           | 4. lots of small building blocks of static machine code
           | precompiled/shipped with DB software binary, later iterated &
           | looped through based on the query plan the optimizer came up
           | with. Oracle does this with their columnar/vector/SIMD
           | processing In-Memory Database option (it's not like LLVM as
           | it doesn't compile/link/rearrange the existing binary
           | building blocks, just jumps & loops through the existing ones
           | in the required order)
           | 
           | Edit: It's worth clarifying that the _entire_ codebase does
           | _not_ run like that, not even the entire plan tree - just the
           | scans and tight vectorized aggregation /join loops on the
           | columns/data ranges that happen to be held in RAM in a
           | columnar format.
        
             | FridgeSeal wrote:
             | That's a really cool idea! Is there any writeups or
             | articles about this in more detail?
        
               | UncleEntity wrote:
               | Sounds like 'copy & patch compilation' is a close cousin.
        
             | lolinder wrote:
             | Isn't what you're describing just an interpreter, bytecode
             | or otherwise? An interpreter walks through a data structure
             | in order to determine which bits of precompiled code to
             | execute in which order and with which arguments. In a
             | bytecode interpreter the data structure is a sequential
             | array, in a plain interpreter it's something more
             | complicated like an AST.
             | 
             | Is this just the same as "interpret the query plan" or am I
             | missing something?
        
             | paulddraper wrote:
             | > lots of small building blocks of static machine code
             | precompiled/shipped with DB software binary
             | 
             | You might call those interpreter operations.
        
           | GartzenDeHaes wrote:
           | You can also parse the code into an augmented syntax tree or
           | code DOM and then directly interpret that. This approach
           | eliminates the intermediate code and bytecode machine at the
           | cost of slower interpretation. It's slower due to memory
           | cache and branch prediction issues rather than algorithmic
           | ones.
        
           | fancyfredbot wrote:
           | For simple functions which are not called repeatedly you have
           | to invert this list - interpreted code is fastest and
           | compiling the code is slowest.
           | 
           | The complied code would still be faster if you excluded the
           | compilation time. It's just that the overhead of compiling it
           | is sometimes higher than the benefit.
        
             | galaxyLogic wrote:
             | I wonder in actual business scenarios isn't the SQL fully
             | known before an app goes into production? So couldn't it
             | make sense to compile it all the way down?
             | 
             | There are situations where the analyst enters SQL into the
             | computer interactively. But in those cases the overhead of
             | compiling does not really matter since this is infrequently
             | done, and there is only a single user asking for the
             | operation of running the SQL.
        
           | SQLite wrote:
           | Performance analysis indicates that SQLite spends very little
           | time doing bytecode decoding and dispatch. Most CPU cycles
           | are consumed in walking B-Trees, doing value comparisons, and
           | decoding records - all of which happens in compiled C code.
           | Bytecode dispatch is using less than 3% of the total CPU
           | time, according to my measurements.
           | 
           | So at least in the case of SQLite, compiling all the way down
           | to machine code might provide a performance boost 3% or less.
           | That's not very much, considering the size, complexity, and
           | portability costs involved.
           | 
           | A key point to keep in mind is that SQLite bytecodes tend to
           | be very high-level (create a B-Tree cursor, position a B-Tree
           | cursor, extract a specific column from a record, etc). You
           | might have had prior experience with bytecode that is lower
           | level (add two numbers, move a value from one memory location
           | to another, jump if the result is non-zero, etc). The ratio
           | of bytecode overhead to actual work done is much higher with
           | low-level bytecode. SQLite also does these kinds of low-level
           | bytecodes, but most of its time is spent inside of high-level
           | bytecodes, where the bytecode overhead is negligible compared
           | to the amount of work being done.
        
             | blacklion wrote:
             | Looks like SQLIte bytecode is similar to IBM RPG language
             | from "IBM i" platform, which is used directly, without
             | translation from SQL :)
             | 
             | Edit: PRG->RPG.
        
               | wglb wrote:
               | I'm curious. Would you have a pointer to the
               | documentation of PRG language?
        
               | blacklion wrote:
               | Oooops, it is IBM RPG, not PRG! My bad!
               | 
               | There are some links in Wikipedia.
               | 
               | I never used it, but only read big thread about SQL vs
               | RPG on Russian-speaking forum, and there was a ton of
               | examples from person who works with IBM i platform in
               | some big bank. Basic operations look like SQLite
               | bytecodes: open table, move cursor after record with key
               | value "X", get some fields, update some field, plus
               | loops, if's, basic arithmetic, etc.
        
               | kayodelycaon wrote:
               | For those that aren't familiar, IBM i is ye olde A/S 400.
               | 
               | To be fair, RPG has been continuously updated and the
               | version I've seen has Java integration.
               | 
               | It's somewhat amusing that a single developer using
               | Python can run circles around entire teams of RPG
               | programmers. Technology marches on. :)
        
             | beeboobaa3 wrote:
             | > Bytecode dispatch is using less than 3% of the total CPU
             | time, according to my measurements
             | 
             | > compiling all the way down to machine code might provide
             | a performance boost 3% or less
             | 
             | This logic doesn't seem sound? Because the application now
             | spends 3% on bytecode dispatch doesn't tell us anything
             | about how long it would instead spend on e.g. interpreting
             | SQL.
        
               | asimpletune wrote:
               | He's saying in the best case scenario that 3% would go to
               | 0. Therefore the reality would probably be even less.
        
               | vlovich123 wrote:
               | Unfortunately you can't do performance analysis this way
               | but I think the overall point that it's a fraction
               | probably still stands as you'd expect the I/O work to
               | dominate.
               | 
               | The reason you can't do the analysis the way that you
               | explicitly stated (which fairly is what was implied) is
               | that when you lower the code to machine code you
               | typically get rid of branches in the execution path.
               | Since branches slow down execution by a disproportionate
               | amount vs how long they themselves take, it's easy to get
               | a >3% boost getting rid of code paths that seem like
               | there's only 3% of room.
               | 
               | What the author also failed to mention is that they've
               | gone on many optimization hunts eeking out less than a
               | percent here or there to get a cumulative boost, so 3%
               | isn't anything to sneeze at.
               | 
               | That being said, the broader point which is valid is that
               | the maintenance isn't worth the performance profile that
               | SQLite targets not to mention that JITs open up an attack
               | vector for security exploits. So the cost vs benefit is
               | squarely in the interpreted side for now.
               | 
               | Edit: Forgot to mention that performance tuning a JIT to
               | work well is also hard - for small queries you'll spend
               | more time compiling the byte code than you would just
               | executing. That's why all the big JS engines do a tiered
               | approach where each tier optimizes a bit more.
        
             | temporarely wrote:
             | A while back was musing if it was possible to come up with
             | something resembling the instruction set for a CPU for an
             | abstract relational-engine. Is this basically what SQLite
             | is doing with bytecodes?
        
               | rkrzr wrote:
               | No. The SQLite bytecode is much higher-level than the
               | instruction set of a CPU.
        
             | titzer wrote:
             | Part of the benefit of compiling bytecode (or anything) is
             | specializing code to the context (types, values, etc) in
             | which it appears. While I don't doubt your analysis, it
             | could be the case that compiled C code in question is full
             | of branches that can be folded away when specialized to the
             | context of the query, such as the structure of the rows,
             | the type and values of columns, etc.
             | 
             | Basically all of what you are saying about high-level
             | bytecodes applies to dynamic languages, too. But they
             | benefit highly from specializing each bytecode given static
             | and dynamic context, and shortcutting dataflow through
             | local variables.
        
             | paulddraper wrote:
             | > A key point to keep in mind is that SQLite bytecodes tend
             | to be very high-level
             | 
             | That's right. "Bytecode" is a spectrum. SQLite's bytecode
             | is higher-level than e.g. WASM, JVM, Python, etc. (Notably,
             | because the original source code is higher-level.)
        
             | wolf550e wrote:
             | It is possible to construct a worst case scenario for
             | bytecode execution, for example very complex expressions in
             | the WHERE and/or SELECT clauses that compute values, and a
             | query plan that performs a full table scan over a table
             | with say 100 million rows that is cached in RAM (or maybe
             | use generate_series, whatever works best).
             | 
             | Computing the expressions should dominate execution time,
             | right?
             | 
             | Then, to compare against the best possible case, we can
             | write a custom C program that uses sqlite internals to
             | perform the same task (full scan of the table, extract
             | values from row) and does not use the bytecode VM and
             | computes the complex expressions in regular C code (e.g. a
             | function that accepts floats and returns a float or
             | whatever).
             | 
             | Then comparing the two implementations will tell us how
             | much faster sqlite can be if it had a "perfect" JIT.
        
             | cryptonector wrote:
             | > Most CPU cycles are consumed in walking B-Trees, doing
             | value comparisons, and _decoding records_
             | 
             | Emphasis added. This is because of SQLite3's varint
             | encoding method for numbers. Performance-wise it was
             | probably a mistake, though it's a form of compression,
             | which might have paid off in terms of space.
             | 
             | (I seem to remember seeing something, possibly by you,
             | about this before.)
             | 
             | I wonder if it would be possible to replace the varint
             | encoding... Yes, it'd be a compatibility break in that
             | older SQLite3s couldn't open newer DBs.
        
           | AtlasBarfed wrote:
           | optimizing runtimes (usually bytecode) can beat statically
           | compiled code, because they can profile the code over time,
           | like the JVM.
           | 
           | ... which isn't going to close the cold startup execution
           | gap, which all the benchmarks/lies/benchmarks will measure,
           | but it is a legitimate thing.
           | 
           | I believe Intel CPUs actually sort of are #2. All your x86 is
           | converted to microcode ops, sometimes optimized on the fly (I
           | remember Intel discussing "micro ops fusion" a decade ago I
           | think).
        
             | ehaliewicz2 wrote:
             | Yeah, for example, movs between registers are generally
             | effectively no-ops and handled by the register renaming
             | hardware.
        
           | vmchale wrote:
           | APL interpreters are tree-walking, but they're backed by C
           | procedures. Then GCC optimizes quite well and you get
           | excellent performance! With stuff like vector instructions.
           | 
           | Getting on par with GCC/Clang with your own JIT is pretty
           | hefty.
        
         | coolandsmartrr wrote:
         | I am amazed that the author (D Richard Hipp) made an effort to
         | find and respond to a tweet that was (1) not directed/"@tted"
         | at him or (2) written in his native language of English
         | (original tweet is in Japanese[1]).
         | 
         | [1] https://twitter.com/gorilla0513/status/1784623660193677762
        
           | bena wrote:
           | He might have a google alert for sqlite.org
        
       | alex_smart wrote:
       | I recently implemented my own expression evaluator in java for
       | in-memory data frames, and once you think about doing that
       | deeply, you very quickly understand the need for bytecode. If you
       | directly evaluate the expression using a tree representation, you
       | basically have to do a whole lot of branching (either via switch
       | statements or polymorphism) for every single line of useful
       | operation. Yes, the branch predictor kicks in and it means that
       | your code wouldn't be as slow as if it didn't, but it is still
       | measurably slower than if you converted the expression into
       | bytecode once and just ran that on all rows instead.
        
         | eklavya wrote:
         | Bytecode will only impact packing, so more efficient ram, cache
         | and cpu wise. But I don't understand how it would help with
         | branching? You still have to make the same decisions? As in the
         | bytecode executor still needs to do differnt things based on
         | the op code, its not in hardware.
        
           | hickelpickle wrote:
           | There is threaded bytecode as well, which uses direct jumping
           | vs a switch for dispatch. This can improve branch prediction,
           | though it is a debated topic and may not offer much
           | improvement for modern processors.
        
             | immibis wrote:
             | How does it know where to jump to?
        
               | chuckadams wrote:
               | The jump target is compiled into the bytecode, so rather
               | than return to the big switch statement, it jumps
               | straight to the next opcode's implementation. The process
               | is called "direct threading". These days a decent switch-
               | based interpreter should fit in cache, so I'm not sure
               | direct threading is much of a win anymore.
        
             | kaba0 wrote:
             | Do you have perhaps some links/references on that?
             | 
             | I have once tried benchmarking it by writing a tiny VM
             | interpreter and a corresponding threaded one with direct
             | jumps in Zig (which can force inline a call, so I could do
             | efficient direct jumps) and I have - to me surprisingly-
             | found that the naive while-switch loop was faster, even
             | though the resulting assembly of the second approach seemed
             | right.
             | 
             | I wasn't sure if I saw it only due to my tiny language and
             | dumb example program, or if it's something deeper. E.g. the
             | JVM does use direct threaded code for their interpreter.
        
           | alex_smart wrote:
           | Consider evaluating an expression like "col1 + col2 == col3"
           | 
           | A tree based evaluator has to do something like -
           | for (int i = 0; i < df.size(); i++) {             // switch
           | on expression type (column, binary operator, unary operator
           | etc)             // switch on expression operator type (add,
           | subtract, equals etc)             // switch on column type
           | (int, long, float etc)             // actual evaluation
           | }
           | 
           | whereas a bytecode based evaluator is able to run something
           | equivalent to -                   for (int i = 0; i <
           | df.size(); i++) {             result[i] =
           | df.getLong(column1Index, i) + df.getLong(column2Index, i)) ==
           | df.getLong(column3Index, i)         }
        
             | naasking wrote:
             | You still have to branch on opcode selection which you're
             | omitting in your translation. Branching on "expression
             | type" and column type in your first example is also
             | unnecessary. Bytecodes have different opcodes to operate on
             | different types and different arities, so in both cases you
             | have only one unpredictable branch for each
             | operation/opcode.
             | 
             | The main benefit of bytecodes is then cache friendliness,
             | by removing all of those unpredictable loads and stores to
             | obtain the expression details (opcode, arguments, etc.).
             | All of those are inlined into the bytecode array which is
             | already in cache.
        
               | alex_smart wrote:
               | > Branching on "expression type" and column type in your
               | first example is also unnecessary
               | 
               | It's not?
        
               | naasking wrote:
               | It is unnecessary, only branching on expression operator
               | type is strictly necessary. You're trying to abstract too
               | much in your AST which yields an inefficient interpreter.
               | I can invent an inefficient bytecode too, but that
               | wouldn't be a correct evaluation of how fast bytecode
               | interpretation could be when done right.
        
         | sitkack wrote:
         | You should look at https://janino-compiler.github.io/janino/ it
         | can compile Java into class files in memory that can be
         | directly executed.
        
           | alex_smart wrote:
           | I found this by looking at spark source code to figure out
           | what they were doing under the hood to deal solve this
           | problem. :)
        
       | adontz wrote:
       | Looks like SQLite could benefit from copy-and-patch JIT compiler.
        
         | Someone wrote:
         | I think that would make it cross a border where 'Lite' no
         | longer applies.
         | 
         | It also would be quite a challenge given their long-term-
         | support statement (https://www.sqlite.org/lts.html):
         | 
         |  _"Cross-platform Code - SQLite runs on any platform with an
         | 8-bit byte, two 's complement 32-bit and 64-bit integers, and a
         | C compiler. It is actively tested on all currently popular CPUs
         | and operating systems. The extreme portability of the SQLite
         | code and file format will help it remain viable on future
         | platforms."_
        
           | adontz wrote:
           | There is no implied requirement to support JIT everywhere.
        
         | samatman wrote:
         | I think this is unlikely for SQLite (other comments cover why
         | it probably wouldn't happen even if it were likely to benefit),
         | but I happened to have the copy and patch paper open in a tab,
         | so I'll take the opportunity to share it here.
         | 
         | https://sillycross.github.io/assets/copy-and-patch.pdf
         | 
         | It has great potential for many applications, if not this
         | particular one.
        
       | bambax wrote:
       | Typo: _A prepared statement is an object that represents the
       | steps needed_ [to] _accomplish the input SQL_.
        
         | sgbeal wrote:
         | Fixed, thank you for the report. It won't be visible on the
         | site until the next time it's rebuilt.
        
       | jiehong wrote:
       | SQLite EXPLAIN plan is indeed represented as a table, but I don't
       | find it necessarily that much easier to read and understand.
       | 
       | I often miss having the cardinality and the amount of bytes read
       | for each part of the query like Oracle query plans provide.
       | 
       | Or is everyone really that happy with SQLite query plans?
        
         | bvrmn wrote:
         | There is EXPLAIN QUERY PLAN which outputs usual high level plan
         | description. But there is no easily reached disk/cache usage
         | stats.
        
       | manx wrote:
       | I'm wondering if one could write this bytecode directly (or with
       | a higher level imperative language) instead of SQL. Often, the
       | programmer knows exactly which index lookups need to happen in a
       | loop, while it seems like a burden to express that in SQL. This
       | might also be an opportunity to create a different type safe dsl
       | for database access.
        
         | thayne wrote:
         | I was wondering the same thing. And in particular if a new
         | query language that avoided many of the pitfalls of SQL could
         | compile down to that bytecode and avoid having to deal with SQL
         | as an intermediate representation. Also, if you can compile to
         | bytecode ahead of time, then that could save the time needed to
         | parse the text of a sql query to bytecode at runtime.
        
           | nraynaud wrote:
           | I'm puttig my wish list here:
           | 
           | - be able to put the projection in a varable and reuse it
           | (and I think orm people might love it)
           | 
           | - have a quick way to forward the the non-aggregated fields
           | of projection to a group by (maybe with the aforementionned
           | variables)
        
             | TachyonicBytes wrote:
             | The DuckDB API I was talking above seem to already meet
             | your use-cases?
             | 
             | - Does this[1] solve the group by wish?
             | 
             | - Depending on what you mean by projection, maybe this[2]
             | or this[3]?
             | 
             | [1] https://duckdb.org/docs/api/python/relational_api#aggre
             | gatee...
             | 
             | [2] https://duckdb.org/docs/api/python/relational_api#proje
             | ctexp...
             | 
             | [3] https://duckdb.org/docs/api/python/expression#column-
             | express...
        
           | forinti wrote:
           | A client-server database would hash the SQL and cache the
           | output of the parser. The end result is the same.
           | 
           | Maybe SQLite could have a similar mechanism, but the cache
           | stays on disk or an external memory cache.
        
             | gwbas1c wrote:
             | In the C# library for SQLite, a DbCommand with
             | parameterized SQL can be reused, thus reusing the bytecode.
        
           | astrobe_ wrote:
           | > Also, if you can compile to bytecode ahead of time, then
           | that could save the time needed to parse the text of a sql
           | query to bytecode at runtime.
           | 
           | I think we already kind of have that already; one can prepare
           | a query/statement and then use it multiple times. Regex
           | engines also do that, except that in the case of SQlite one
           | can bind different parameter values.
           | 
           | Programmers worried about SQL lexing/parsing times can
           | compile their most used queries once for all at programmer
           | startup. Plus, one can sometimes break a query with
           | annoyingly variable parts into smaller queries glued with
           | SQLite API calls.
        
         | wruza wrote:
         | I was advocating for it for decades, but everyone dismisses it
         | with "you don't know better than rdbms". That's almost a
         | religion. Despite the fact that most of the tables most people
         | have are never any big (except for a few oltps that you ought
         | to leave to specific sql server wizards anyway) and you usually
         | _do have_ the idea how it should work. SQL is a cool solution
         | for a set of problems that has a very non-zero xor with a set
         | of problems that we actually have. And most of our complex
         | queries are trivially expressible in imperative loops. Index
         | lookup and a complex plan building are key features of an
         | rdbms, everything else it steals from you and forces to use an
         | arcane inconvenient language that in practice isn't even a
         | standard. Make plan building and indexing core ops and leave
         | everything else to some sort of a general purpose bytecode,
         | everyone will be happier and will create dozens of DSLs for all
         | their needs.
        
           | lqet wrote:
           | > I was advocating for it for decades, but everyone dismisses
           | it with "you don't know better than rdbms". That's almost a
           | religion.
           | 
           | Which is simply not true. Query optimization is done
           | heuristically, for the simple reason that you usually _need
           | to run the query_ to get the information required for
           | "perfect" optimization. If the RDBMS really knew better, it
           | wouldn't offer query hints.
        
             | kayodelycaon wrote:
             | Postgres doesn't offer query hints. ;)
        
         | TachyonicBytes wrote:
         | Something akin to this is available[1] in DuckDB, itself
         | started as a "shell" over SQLite. Using Python, you can compose
         | your SQL plans as you like.
         | 
         | I recall a previous discussion about the SQLite bytecode and
         | potential alternatives, where the main idea was that, if SQLite
         | had gone the other way, you could have a much more general
         | engine, where you could achieve something like DuckDB itself
         | just as a set of extensions.
         | 
         | Reading the draft, it doesn't seem that extensibility of SQLite
         | was a main factor in deciding. Maybe this trade-off also covers
         | extensions somehow, which would be nice to see in the final
         | document.
         | 
         | [1] https://duckdb.org/docs/api/python/expression
        
         | asvitkine wrote:
         | The main downside is now you're making the bytecode an API
         | which means all future changes need to be backwards compatible.
        
       | Retr0id wrote:
       | > Hence, no tree-of-objects database engine provides the level of
       | detail in their "EXPLAIN" output that SQLite provides.
       | 
       | I've seen some cool graphical visualisation tools for postgres
       | http://tatiyants.com/postgres-query-plan-visualization/
        
       | chucke1992 wrote:
       | Well SQLite is basically a religious cult at this point. I am
       | always impressed by it.
        
       | euroderf wrote:
       | Doesn't the Go language also execute bytecode ? There's gotta be
       | a hyperefficient hybrid lurking in there somewhere.
        
         | zadokshi wrote:
         | go compiles to binary files and libraries. it's more like c
         | than Java
        
         | euroderf wrote:
         | Aah OK my bad.
         | 
         | https://github.com/teh-cmc/go-internals/blob/master/chapter1...
        
       | matharmin wrote:
       | What I find interesting is that this approach means the query
       | plan is chosen when "compiling" the statement, not when running
       | it. However, there are still some cases where SQLite will
       | recompile the statement when running, such as when the schema
       | changed.
        
       | groue wrote:
       | Out of curiosity, I asked on the SQLite forum how the query
       | optimizer fits in this schema:
       | https://sqlite.org/forum/forumpost/93f9bfcec0
        
       | carterschonwald wrote:
       | So one thing that's not often talked about is that with the right
       | invariants/program transformations, the two approaches are
       | equivalent :)
        
       | zoomablemind wrote:
       | How does this reflect on SQLite's possibilty of parallelizing
       | some of the operations, like aggregates, for example?
        
         | Ericson2314 wrote:
         | At the bottom, it it says that is a downside
        
       | dittonedo wrote:
       | I just wonder why can't hackers write homoiconic (if thats the
       | word), languages that get array tokens and provide results like
       | so. I often wonder what if sqlite was also designed to have its
       | statements like ('select '* 'from 'table), it would also had been
       | better for third part libraries, to implement their orms.
       | 
       | Since all of this (byte coding) (AST) is done by almost many
       | programming languages I ever explored (Python,Ruby, lua, ...). It
       | would have been awesome, if they were easily parse-able I guess.
       | 
       | Most of the ORM implementations are very error prone to those
       | design decisions. (Especially SQL databases).
        
       | pizlonator wrote:
       | That's awesome, I didn't know SQLite had a bytecode.
       | 
       | Based on my experience writing interpreters, I know that bytecode
       | is faster to interpret.
       | 
       | (Source: I wrote JavaScriptCore's interpreter and it's optimizing
       | JITs. I've also worked on other language implementations. I've
       | seen the AST vs bytecode thing play out more than once.)
        
         | andruc wrote:
         | Faster to interpret than what?
        
           | jjice wrote:
           | I assume they meant faster to interpret than a tree of
           | objects, since that's the other concept discussed in the
           | article.
        
             | pizlonator wrote:
             | Yes
        
       | pdubroy wrote:
       | I think most people associate bytecode VMs / interpreters with
       | general-purpose programming languages, but it's a surprisingly
       | useful concept in other contexts.
       | 
       | Sometimes bytecode VMs appear in unexpected places! A few that
       | I'm aware of:
       | 
       | - eBPF, an extension mechanism in the Linux kernel - DWARF
       | expression language, with interpreters in debuggers like GDB and
       | LLDB - The RAR file format includes a bytecode encoding for
       | custom data transformation
       | 
       | More here: https://dubroy.com/blog/bytecode-vms-in-surprising-
       | places/
        
         | WickedSmoke wrote:
         | Bytecode is great for tiny domain languages and I use them in
         | many projects.
         | 
         | Ultima IV used one to animate sprites on it's title screen map.
         | For the next version of XU4 I implemented three bytecode
         | interpreters to script the entire title sequence. There's a
         | high level presentation interpreter, a GPU rendering
         | interpreter, and one for the Ultima IV bytecode.
        
         | xymostech wrote:
         | The original TeX Fonts stored their metrics in TFM (short for
         | TeX font metrics) files, which contains a bytecode interpreter
         | for calculating ligatures and kerning between characters. I
         | learned about that when I tried reading the files myself.
         | 
         | From what I can tell, modern fonts using OpenType just have
         | tables to accomplish something similar now, in the form of the
         | GSUB and GPOS tables?
         | 
         | Documentation for the TFM format here:
         | https://tug.org/TUGboat/Articles/tb02-1/tb02fuchstfm.pdf
         | (search for lig/kern)
        
           | atombender wrote:
           | TrueType (and OpenType, which is an evolution of TT)
           | absolutely includes a bytecode instruction set:
           | https://developer.apple.com/fonts/TrueType-Reference-
           | Manual/...
        
         | astrobe_ wrote:
         | Also regular expression engines.
        
         | omoikane wrote:
         | I believe these fall under "interpreter pattern":
         | 
         | https://en.wikipedia.org/wiki/Interpreter_pattern
         | 
         | https://gameprogrammingpatterns.com/bytecode.html
         | 
         | https://wiki.c2.com/?InterpreterPattern
         | 
         | I first read about it from the "Design Patterns" book, which I
         | would recommend to everyone.
        
         | Narishma wrote:
         | They are frequently used in games to run scripting languages.
        
         | UncleEntity wrote:
         | Yep, Greenspun's tenth rule:
         | 
         | Any sufficiently complicated C or Fortran program contains an
         | ad hoc, informally-specified, bug-ridden, slow implementation
         | of half of Common Lisp.
        
       | naasking wrote:
       | On incrementally running bytecode:
       | 
       | > This is more difficult to achieve in a tree-of-objects design.
       | When the prepared statement is a tree-of-objects, execution is
       | normally accomplished by walking the tree. To pause the statement
       | in the middle of a computation means unwinding the stack back up
       | to the caller, all the while saving enough state to resume
       | evaluation where it last left off. This is not impossible to do,
       | but it is sufficiently difficult that I have never seen it
       | actually done.
       | 
       | I don't think it's too difficult. You can avoid stack unwinding
       | by just not using the native stack in the first place and using
       | an explicit stack. This is one step towards a bytecode VM though,
       | as you're virtualizing the stack. Definitely easier in bytecode
       | though as it already makes all of that state explicit.
       | 
       | For the article overall, isn't it a bit trivial to see that a
       | bytecode interpreter must be faster and more compact? Write out
       | the expression tree to an array as an in-order traversal and this
       | is basically an executable bytecode. It will be more compact
       | because you no longer need hidden malloc/object headers and
       | pointers for subexpressions (they're preceding indices!), and it
       | will be faster because it's basically all in cache, so no cache
       | misses from pointer chasing.
       | 
       | I can't imagine a situation in which the tree would ever be
       | faster for execution, in general. At best, the overhead might be
       | just too small to care about. We use trees in compilers because
       | we need to perform incremental rewrites. This is trivial with
       | trees but can get expensive when expanding and contracting
       | arrays. Execution doesn't need to perform rewrites though, it
       | needs linear, cache-friendly access patterns.
       | 
       | However, the article does make a good point that the query plan
       | may be tuned on the fly during execution, eg. the small rewrites
       | I mentioned above occurring while executing. I'm not fully
       | convinced this couldn't be done in bytecode though.
        
       ___________________________________________________________________
       (page generated 2024-04-30 23:01 UTC)