[HN Gopher] Why SQLite Uses Bytecode
___________________________________________________________________
Why SQLite Uses Bytecode
Author : todsacerdoti
Score : 703 points
Date : 2024-04-30 02:48 UTC (20 hours ago)
(HTM) web link (sqlite.org)
(TXT) w3m dump (sqlite.org)
| chrisaycock wrote:
| SQLite's design docs were the first time I had seen a database
| use a virtual machine instead of walking a tree. I later noticed
| VMs in libraries, embedded DSLs, and other applications outside
| of large general-purpose programming languages. That really drove
| home for me that VMs could be anywhere and were often a useful
| step in handling a user's expressions.
| runlaszlorun wrote:
| SQLite is the first one I've looked at the internals of. Do
| others walk an AST of the query instead?
| __s wrote:
| Yes, postgres for example. It maps pretty close to what you
| see from `explain` where the ops are pretty high level,
| reducing interpreter overhead. JIT's big gain is speeding up
| reading values out of row data
| hn_throwaway_99 wrote:
| FTA:
|
| > Tree-Of-Objects - The input SQL is translated in a tree of
| objects that represent the processing to be done. The SQL is
| executed by walking this tree. This is the technique used by
| MySQL and PostgreSQL.
| Sesse__ wrote:
| Note that this only holds true for fairly recent MySQL
| (MySQL 8.x, not including the oldest 8.0.x releases). 5.7
| and older, and by extension MariaDB, instead models pretty
| much everything as a large fixed-function recursive
| function that calls itself for each new table in the join,
| plus some function pointers for handling of GROUP BY and
| such.
|
| TBH, when it comes to query execution (A JOIN B JOIN C
| GROUP BY ...), I think the difference between SQLite's
| bytecode and a tree of iterators (the classical Volcano
| executor) is fairly small in practice; they are quite
| interchangeable in terms of what they can do, and similar
| when it comes to performance. The difference between
| bytecode and tree structure is much larger when it comes to
| evaluation of individual expressions (a + b * cos(c)),
| especially since that involves much more of a type system;
| that is more obviously in the "bytecode is the way to go"
| camp to me.
| runlaszlorun wrote:
| Sorry, I'd read the article a couple years ago and forgot he
| goes into depth on the other approaches. My bad.
| whateveracct wrote:
| Everything is either a compiler or interpreter
| Zambyte wrote:
| That's why SICP is often brought up in contexts that are not
| explicitly about interpreters :)
| bane wrote:
| VMs really can be. They have a long history in code
| portability. In the old days it really wasn't uncommon to use
| an approach centered around some interpretable byte code
| running in a vm, where the reusable vm was all that needed
| porting to different architectures and operating systems. This
| all happened well before Java.
|
| It was really big in gaming, Zork, Sierra games, LucasArts
| games, and even a few more "action" games like Flashback were
| all designed around VMs to make porting somewhat sane. Running
| the byte code is usually not the performance bottleneck in
| these cases but drawing stuff to the screen is, and the VM had
| to handle that.
| runlaszlorun wrote:
| And Pascal p-code! Not the first, I've heard, but I believe
| it's close to being the first.
| pjmlp wrote:
| One of the first was Burroughs B5000,
|
| https://en.wikipedia.org/wiki/Burroughs_Large_Systems
|
| It used an almost memory safe systems language, ESPOL, zero
| Assembly, all CPU capabilities are exposed via intrisics,
| one of the first recoded uses of unsafe code blocks, there
| was tagging and capabilities support, the CPUs were
| microcoded. All of this in 1961, a decade before C came to
| be.
|
| ESPOL was quickly replaced by NEWP, although there are very
| little data when it happened, probly a couple of years
| later.
|
| Nowadays it is still sold by Unisys under the guise of
| being a mainframe system for those that value security
| above all, as ClearPath MCP, and you can get NEWP manual.
|
| https://www.unisys.com/solutions/enterprise-
| computing/clearp...
|
| https://public.support.unisys.com/aseries/docs/ClearPath-
| MCP...
| kragen wrote:
| the b5000 was one of the first _non_ -virtual stack
| machines, but its instruction set isn't any more of a
| virtual machine than the 8088's or the pdp-10's. there
| were a number of interpreted stack languages in the 50s,
| though not nearly as many as there would be later
| pjmlp wrote:
| When one does digital archeology is it quite common to
| see Assembly referred to as bytecode, when the CPUs are
| actually interpreters written in microcode.
|
| Another example, all the Xerox PARC workstations, which
| loaded the respective interpreter (Smalltalk, Interlisp,
| Mesa, Mesa/Cedar) into the CPU as first step during the
| boot process.
| whartung wrote:
| According to 5s on Google, the x86 has always been
| microcoded. I guess the argument is that "machine
| language" is the public API of the CPU, and "assembly
| language" is the higher level script used to, mostly,
| directly create machine language.
|
| Is Intel microcode able to be changed by anyone? I
| understand that even CPUs get field updates nowadays.
| kragen wrote:
| i have never seen assembly, or the code generated from
| it, referred to as 'bytecode' unless there was no
| physical hardware or microcode that could interpret it --
| _except_ in the case of smalltalk, whose bytecode was, as
| you say, interpreted by microcode on the alto and the
| d-machines. but possibly your archeology has delved into
| cultures mine has not? i 'd be very interested in reading
| more, if you happen to have any links handy
| gpderetta wrote:
| Don't know about Flashback, but Another World famously was VM
| based.
| bane wrote:
| I might be wrong, but I've read on other project pages that
| it is. https://github.com/sylefeb/a5k
| hiAndrewQuinn wrote:
| VMs are a persistently underrated tool by people who think the
| conversation begins and ends at VirtualBox. They are a
| phenomenal way to constrain the surface area of what one's own
| code has to target, and that's always a plus.
| kqr wrote:
| Or indeed any time complexity need to be layered. A VM design
| doesn't even have to have an explicit bytecode compilation
| stage -- you can write an interpreter that runs as the
| instructions are issued.
|
| David Parnas talks a lot about this as a way to reliable
| modularisation: https://two-wrongs.com/deliberate-
| abstraction.html
| anon291 wrote:
| Stack-based VMs, like SQLite's (I think), ARE trees. A stack
| based VM's bytecode (without DUP and POP) is just the post-
| order depth-first traversal of the corresponding expression
| tree. With DUP you have a connected acyclic DAG. With POP you
| have an acyclic DAG with one or more components. With loops you
| have a full graph.
|
| When looked at this way, a VM makes the most sense actually
| because a pointer-heavy tree implementation is just terrible
| for caching and is super wasteful. Also, most SQL plans are
| trees (Until you get to WITH RECURSIVE).
|
| I'm working on a CEP in C right now and a stack-based bytecode
| is simply the most 'obvious' way to represent a tree when you
| don't want to have to deal with lots of memory allocations.
| Just pre-allocate a buffer large enough and increment and go.
|
| Either way, try taking something like the following and
| printing an expression tree like the one below
| LITERAL 2 LITERAL 3 LITERAL 4 MUL PLUS
|
| now, write something to produce on a terminal, the following:
| Plus Mul 3 4 2
|
| You should be able to do this in a simple iteration through the
| ops. Try it! Then try it with the variations above.
|
| Now, try writing ops to manipulate and rewrite it. It's
| actually really a fun way to represent trees.
| Scaevolus wrote:
| SQLite's VM is register-based, not stack-based.
| ojosilva wrote:
| Does SQLite's VM have an API? I mean, one where I can build
| and issue opcodes into directly.
| OskarS wrote:
| Not a public one, because they want to have the freedom
| to change how it works. But if you type in `EXPLAIN
| CREATE TABLE x(y);` or `EXPLAIN SELECT 3;` or whatever
| into `sqlite3`, you can see it. And it's reasonably
| straightforward to dig into the source code (SQLite is
| very well-written C) to hook that up yourself if you
| really wanted to.
| bch wrote:
| It's been both - was stack, converted to register[0][1].
|
| [0] https://www.sqlite.org/vdbe.html
|
| [1] https://www.sqlite.org/src/info/051ec01f2799e095
| notpeter wrote:
| This was the question I was going to ask! I've recently
| been diving into Lua 5.x internals and have been extremely
| impressed with Lua's register-based bytecode
| implementation. Lua has a stable byte code interface
| between patch releases (5.4.0 -> 5.4.1, etc) but not
| between major/minor revisions (5.3 -> 5.4). SQLite on the
| other hand does not consider this a public interface at
| all!
|
| > "Remember: The VDBE opcodes are not part of the interface
| definition for SQLite. The number of opcodes and their
| names and meanings change from one release of SQLite to the
| next." https://www.sqlite.org/opcode.html#the_opcodes
|
| For anyone interested in understanding how a register-based
| VM operates I highly recommend:
|
| A No-Frills Introduction to Lua 5.1 VM Instructions by
| Kein-Hong Man.
| https://www.mcours.net/cours/pdf/hasclic3/hasssclic818.pdf
| Sesse__ wrote:
| > Also, most SQL plans are trees (Until you get to WITH
| RECURSIVE).
|
| WITH RECURSIVE can also generally be implemented tree-style,
| you just loop a bunch in one of the iterators.
|
| There are some databases, like Umbra, which can have DAG
| query plans for the benefit of more efficient groupjoin
| (GROUP BY pushed down into a join). IIRC some of the
| unnesting patterns can also benefit from non-tree plans.
| surfingdino wrote:
| Doesn't MS Word internally run a Forth-like VM? I remember
| reading an article by someone who decompiled an early MS-DOS
| version of Word only to discover that there was a VM inside.
| jpfr wrote:
| They called it p-code at the time. The purpose was
| (purported) to simplify the porting between architectures.
|
| https://casadevall.pro/articles/2020/11/compiling-word-
| for-w...
| bitwize wrote:
| Wozniak found that 16-bit arithmetic was much easier to perform
| on a 16-bit machine. So he wrote SWEET16, a VM with its own
| bytecode, to work with 16-bit pointers in Apple's Integer
| BASIC.
|
| Similarly, the TI-99/4 series of computers had a cumbersome way
| of accessing memory, because _most_ of the RAM in the machine
| was exclusively controlled by the video chip and could only be
| accessed by poking a few registers. So the designers
| implemented a bytecode VM in ROM called GPL (Graphics
| Programming Language) that would make accesses to this video
| RAM seem like straightforward RAM accesses. TI encouraged the
| use of GPL to develop games and other applications for the
| system. Unfortunately, the fact that it was interpreted by the
| CPU, plus the fact that the CPU could only access video-chip
| memory during the horizontal and vertical retrace intervals,
| made GPL execution very slow. And since the machine 's in-ROM
| BASIC interpreter was written in GPL, you now know why BASIC
| programs on the TI-99/4 and /4A crept along at the most
| leisurely of paces.
|
| So VMs were definitely a thing used to make certain kinds of
| tradeoffs, even way back in the 8-bit days when systems had
| mere kilobytes of memory. Who knows how many might be lurking
| in a modern system!
| krallja wrote:
| The unfortunate history of SWEET16 is that it was never used
| for its intended purpose -
| https://retrocomputing.stackexchange.com/a/14513/11579
| naasking wrote:
| > That really drove home for me that VMs could be anywhere and
| were often a useful step in handling a user's expressions.
|
| I think it's a special case of the fact that finite state
| machines are everywhere you look.
| atombender wrote:
| MonetDB is another database that uses a VM [1], using an
| instruction set called the MonetDB Assembly Language (MAL). I
| believe it pioneered the technique [2]. The VM is basically
| designed to express the query as vectorized functions on
| columns.
|
| [1] https://stratos.seas.harvard.edu/files/MonetDebull2012.pdf
|
| [2]
| https://www.researchgate.net/publication/220538804_Database_...
| userbinator wrote:
| _The problem of rendering a tree-of-objects as a table is
| sufficiently difficult that nobody does it, as far as I know.
| Hence, no tree-of-objects database engine provides the level of
| detail in their "EXPLAIN" output that SQLite provides._
|
| I believe Microsoft SQL Server uses an object tree internally,
| and yet its query plan output is a table:
|
| https://learn.microsoft.com/en-us/sql/t-sql/statements/set-s...
| jiggawatts wrote:
| Typically you get XML for showplan because it can represent the
| tree structure better.
| kruador wrote:
| You get XML for estimated execution plans if you use SET
| SHOWPLAN_XML ON. You get the actual execution plan XML along
| with your results if you use SET STATISTICS XML. You can see
| cached execution plans in the sys.dm_exec_query_plan dynamic
| management view.
|
| The estimated/actual execution plan feature in SQL Server
| Management Studio was changed to use XML in SQL Server 2005.
| The SQL Server team are only adding new information to the
| XML representation, not the text representation.
|
| The name "estimated" execution plan is a bit misleading. If
| you executed the statement(s) with the current indexes and
| statistics, that's how it would be executed. I think the
| "estimated" part of the name is because the number of rows
| returned from each node, and the execution time of each node,
| is an estimate.
|
| Assuming that the indexes and statistics don't change, you
| should expect the "actual" execution plan to have the same
| shape. It will just be annotated with the number of rows that
| were actually retrieved as well as the estimated numbers. SQL
| Server doesn't re-plan if it turns out the number of rows was
| greater (or fewer) than expected.
|
| As a SQLite and SQL Server user, I find that I would like
| something more detailed from SQLite than EXPLAIN QUERY PLAN
| (which just tells you the join order), but less detailed than
| the bytecode. Particularly I want to know _why_ it chose that
| join order or to use that index.
| yread wrote:
| I havent used SQL server much since the 2008 days but it
| used to happen quite often that the estimated and actual
| execution plan differed a lot. I think the estimated part
| also includes how many rows get returned from a subquery
| and that informs which joining algorithm is used.
|
| I second your longing for something more detailed than
| explain query plan in Sqlite though.
|
| In fact I complained about it here and some sqlite
| developer pointed me to these resources:
|
| https://news.ycombinator.com/item?id=30405275
|
| I've never actually tried yet. Queries are still too fast
| to do all that:)
| slaymaker1907 wrote:
| It can also execute a query incrementally. In fact, certain
| features rely on this behavior. "Watch live data" (in SSMS) for
| extended events is actually an infinite table-valued function.
| It's not exactly rocket science to do in a database, you just
| need to model data sources and operators as iterators.
| bryancoxwell wrote:
| I don't doubt the author, but what is it that makes rendering a
| tree of objects to a table a difficult problem? Is that not
| what browsers do when they render a table element?
| IsTom wrote:
| At least they're hard to display on a terminal. Reading
| EXPLAIN ANALYZE diagrams from psql does not spark joy.
| mfoc wrote:
| Quote from the article:
|
| "A tree-of-objects representation is more difficult to
| publish in a human-readable form. The objects that comprise
| the tree tend to all be very different, and thus it is tricky
| to come up with a consistent and simple table representation
| with which to display the objects. Any any such table
| representation that you do come up with would almost
| certainly have more than six columns, probably many more. The
| problem of rendering a tree-of-objects as a table is
| sufficiently difficult that nobody does it"
|
| To further elaborate on this important point.
|
| There is an 'impedance mismatch' (conceptual difficulty
| mapping between the two logic models) between the tree
| abstract data type and the table abstract data type.
| Specifically, there are four key differences between the
| simple table data structure and the more complex tree data
| structure that makes mapping between them a non-trivial
| operation.
|
| Hierarchical: A table has a flat representation; a tree has a
| hierarchical representation.
|
| Order: The order of the rows in a table typically do not
| matter (they may have a unique rowid). The order of branches
| (nodes) and leaves in a tree is important, and the ordering
| itself in an encoding of valuable information.
|
| Semi-structured: A table has a fixed structure (rows
| multiplied by columns). A tree has a flexible structure - an
| arbitrary combination of branches (internal nodes) and leaves
| (terminal nodes). Semi-structured data has a structure that
| may not necessarily be known in advance, the tree has
| irregular and variable formation; the tree may have branches
| with missing or supplementary nodes.
|
| Meta-data: The information describing the meaning of the data
| in a table is typically stored separately from the table -
| consequently a schema is often mandatory. A schema is
| optional for a tree abstract data type.
|
| As an aside, I have been visiting hacker news almost daily
| since 2010. This is my first comment on hacker news. I want
| to say thank you to the community for the many valuable
| insights over this years.
| galaxyLogic wrote:
| Congratulations for your first comment!
|
| What I want to do is represent my tree-structure as a table
| in a relational database and then be able to efficiently
| get the tree structure back by transforming the table-
| representation back into a tree.
|
| Further I would like to do that in plain standard SQL. This
| must be a common problem, any documented solutions out
| there?
| mfoc wrote:
| Thank you.
|
| Great question - you have touched on the key difference
| between a labeling scheme and an encoding scheme for tree
| data structures.
|
| As mentioned previously, the tree is an abstract data
| type, that is to say, a conceptual model that defines the
| nodes in a tree, and their relationships.
|
| To be able to evaluate a expression that processes a
| tree, one needs a labeling scheme. The purpose of a
| labeling scheme is to assign unique labels to each node
| in the tree and these labels must facilitate node
| ordering, and often (but not always) a labeling scheme
| will permit the reconstruction of the tree structure.
|
| However, no labeling scheme captures the node type, names
| or the content stored at the nodes. For that we need an
| encoding scheme. An encoding scheme is constructed upon a
| labeling scheme and augments it with the information
| necessary to fully represent the tree in a table-like
| data structure. In answer to your question, it also
| permits the full transformation from the table
| representation to the original tree structure.
|
| Thus, it sounds like what you are looking for is an
| encoding scheme.
|
| There are many different labeling schemes out there for
| tree structure data, and virtually all of them can be
| augmented with additional information to construct a
| complete encoding scheme. Concerning documented solutions
| - I have not been active in this space for a number of
| years, so off the bat - I don't have a recommended
| documented solution to point you too.
|
| But to help, I will put a link to my PhD thesis [1] which
| gives a more in-depth understanding of labeling schemes
| and encoding schemes for tree structured data with an
| example of a simple implementation of an encoding scheme
| enabling the full transformation from the table
| representation to the original tree structure (pages 5-9)
| and a survey of the advantages and disadvantages of
| existing labeling schemes concerning their usefulness to
| be part of an encoding scheme you could use in your
| solution (see chapter 2)
|
| Caveat 1: My thesis was written in the context of
| updating dynamic (XML) trees but it addresses the
| transformation between tree and table data structures.
|
| Caveat 2: The thesis was written 11 years ago, but every
| now and then I have kept in touch with the latest
| developments in the area, and to my knowledge, there have
| been no major developments since.
|
| I hope it helps.
|
| [1]: https://doras.dcu.ie/19316/
| bryancoxwell wrote:
| Hey, this is excellent. Thanks for the thorough response.
| winternewt wrote:
| Same with PostgreSQL. You can choose between text output (one
| table row per line), XML or JSON. But none of them are ordered
| according to a strict execution order like bytecode would be.
| To me that makes perfect sense though because if represents
| what can be parallelized in the plan. The bytecode
| representation in SQLite seems to have the inherent limitation
| that it is single threaded.
| yayr wrote:
| Actually, SAP HANA has a nice tool called PlanViz for that in
| Eclipse. A google image search gives you a good impression how
| that works. There are also some blogs about it, e.g.
| https://community.sap.com/t5/enterprise-resource-planning-bl...
| Cthulhu_ wrote:
| I love how the example SQL query uses abbreviated columns
| that sound like runes: select --
| count(*) a.mandt, b.rldnr, a.bukrs, a.gjahr, a.belnr,
| b.docln, a.buzei -- * from BSEG
| as a innerjoin ACDOCA as b on b.rclnt =
| a.mandt and b.rbukrs = a.bukrs and
| b.gjahr = a.gjahr and b.belnr = a.belnr
| where a.mandt = '715' and b.rldnr = '0L'
| and b.docln = '000001' and a.buzei = '001'
| and a.gjahr = '2018' --and a.gjahr = '2017'
| --and a.gjahr = '2019' --order by a.mandt, b.rldnr,
| a.bukrs, a.gjahr, a.belnr, b.docln, a.buzei limit
| 200;
|
| Just realised this is probably from German too, "jahr" being
| year.
| yayr wrote:
| yeah, these accounting table field names have not changed
| since around 30 years, which is not necessarily a bad
| thing...
|
| for a human readable view of this you can look e.g. here
| https://docsfortec.com/sap/S4/tables/ACDOCA
| deanishe wrote:
| They're a bit of a bilingual mess.
|
| Did they start out with German table names and decide to
| give new tables English names at some point?
| zachmu wrote:
| Dolt's EXPLAIN output prints the execution tree directly. E.g.:
| explain select * from xy join uv on (x = u and u > 0) where u
| < 2; Project +- columns: [xy.x:2!null, xy.y:3,
| uv.u:0!null, uv.v:1] +- LookupJoin +-
| IndexedTableAccess(uv) | +- index: [uv.u]
| | +- static: [{(0, 2)}] | +- colSet: (3,4)
| | +- tableId: 2 | +- Table |
| +- name: uv | +- columns: [u v]
| +- IndexedTableAccess(xy) +- index: [xy.x]
| +- keys: [uv.u:0!null] +- colSet: (1,2)
| +- tableId: 1 +- Table +-
| name: xy +- columns: [x y]
| Dwedit wrote:
| Running bytecode is much lower latency than compiling into native
| code. If you're not bottlenecked by running the bytecode (as
| opposed to memory or disk speed), you don't really have to JIT it
| any further into native code.
| winrid wrote:
| Yeah, but nobody is seriously considering that unless maybe for
| huge prepared statements. The argument is usually bytecode vs
| parser and associated data structures.
| _flux wrote:
| PostgreSQL is not only considering it, they're doing it!
| https://www.postgresql.org/docs/current/jit-reason.html
|
| I don't have personal experience on it, but I've read that in
| practice it's not worth the effort--at least not yet.
| Apparently there are some issues with it and it barely speeds
| up queries (except perhaps certain ones?). I imagine this
| could be in big part because LLVM is not really a good fit
| for JIT where you want to spend very little time to do the
| compilation itself.
| masklinn wrote:
| Yeah my experience of the pg jit is mostly negative, it's
| quite slow and it has a hard time estimating the cost of
| interpreted v compilation + compiled execution so more
| often than not it's actively detrimental. It might fare
| better if you make heavy use of a limited number of
| prepared statements.
| winrid wrote:
| Interesting, thanks. I imagine they will have a query
| planner for the query planner to determine to JIT or not :)
| rhdunn wrote:
| Which is why JavaScript engines (and JIT compilers for other
| languages) are typically designed to:
|
| 1. start interpreting the code once it has been parsed, and
| start jitting a function being called;
|
| 2. generate naive bytecode for the function that generates
| native code equivalents of the run actions of the AST
| (including some fast-path code for simple cases such as adding
| two 32-bit integers, and falling back to function calls to
| perform the add on more complex types);
|
| 3. generate more optimal code for sequences of instructions in
| the background, such as entire for/while loops, then patch in
| calls to those fast-path versions when ready.
|
| That way you can start running the code immediately after
| parsing it, and can switch to the faster versions when they are
| ready if the code takes longer to run in the slower version.
| gpderetta wrote:
| Depends how much compilation you want to do. For example
| converting threaded code into native direct jumps need not be
| more expensive than plain bytecode. Of course the gains are
| also limited.
| MrBuddyCasino wrote:
| I was surprised the text didn't mention one major difference
| between the byte code approach vs AST: coupling.
|
| When your database engine runs in-process, there is no
| possibility of the server and the client library having diverging
| versions. But this is common with traditional databases.
|
| Once you bake in the execution steps (,,how to execute") instead
| of describing the query via AST (,,what to execute"), an
| important part of the execution logic now lives in the driver.
|
| I suspect this complicates version upgrades and bugfixes, because
| the database is less self-contained.
|
| Not an issue for sqlite, potentially disastrous for mysql.
| tadfisher wrote:
| Do clients typically communicate with the server in some AST
| representation instead of, well, SQL? I'd be surprised if that
| much parsing/planning happens on the client.
| MrBuddyCasino wrote:
| Since prepared statements are created by the driver, I was
| assuming this was the case - but I might be completely wrong
| here.
| _flux wrote:
| Converting a SELECT to a PRPEPARE does not really require
| parsing the complete query--or even it it did, some small
| concessions for this could be implemented in the line
| protocol to enable the server to do the prepared statement
| out of client query.
|
| I don't believe *any* SQL client library actually tries to
| parse e.g. PostgreSQL itself at any point of processing.
| You can read what the PostgreSQL protocol does here:
| https://www.postgresql.org/docs/current/protocol-
| flow.html#P...
| layer8 wrote:
| No, prepared statements are server-side.
| evanelias wrote:
| > an important part of the execution logic now lives in the
| driver
|
| > potentially disastrous for mysql
|
| This is completely incorrect. The MySQL client side isn't
| responsible for execution steps and this isn't part of its wire
| protocol at all.
|
| With prepared statements, the server parses the statement and
| gives the client back a handle, which is basically just a
| unique identifier that the client can use to invoke the
| prepared statement.
| Sesse__ wrote:
| > Not an issue for sqlite, potentially disastrous for mysql.
|
| MySQL _completely_ switched execution paradigm through the
| course of a couple of 8.0.x sub-releases, generally without
| anyone noticing.
| megadal wrote:
| Perhaps my understanding is off, but I am pretty sure parsing and
| translating SQL into bytecode still involves an AST.
|
| Just that query processing itself is done from the bytecode
| (produced presumably from an AST or something similar) rather
| than directly from the AST itself.
|
| If I'm right I can't really see how this performs better unless
| you're excluding the parsing step from benchmarks
| lifthrasiir wrote:
| You don't need one, Lua is another example where no AST is ever
| generated. In some sense the resulting bytecode closely
| corresponds to the AST that would have been generated though.
| megadal wrote:
| Genuinely asking as parsing without an AST is something I've
| never seen explained: How do you go from source code to
| bytecode without an AST?
|
| Isn't the bytecode just a flattened representation of an AST
| obtained by some sort of tree traversal?
|
| This seems to imply an AST is involved in the generation of
| the bytecode
| lifthrasiir wrote:
| Have you used any parser generator like yacc/bison? They
| have "actions", which are arbitrary codes that will be
| executed when some grammar production is detected. For
| example, `expr ::= expr mulop expr { some code }` will
| execute `some code` when a multiplicative expression is
| detected, where intermediate results from two `expr`s and
| `mulop` are available to that code. This concept of actions
| generally applies to all sort of parsers, not just
| generated parsers.
|
| Those actions would typically allocate and build (partial)
| ASTs, but you can do anything with them. You can for
| example directly evaluate the subexpression if your grammar
| is simple enough. Likewise bytecodes can be generated on
| the fly; the only concern here is a backward reference,
| which has to be patched after the whole expression block
| gets generated, but otherwise you don't have to build any
| tree-like structure here. (Most practical parsers only need
| a stack to function.)
| kazinator wrote:
| How you go from source to target code without an AST is
| that the syntax-directed translation step in your
| implementation (that which would build the AST) doesn't
| bother with that and just builds the output code instead.
| The extra traversal is skipped; replaced by the original
| parser's traversal of the raw syntax.
|
| E.g. pseudo-Yacc rules for compiling the while loop in a
| C-like notation. while_loop : WHILE '('
| expr ')' statement {
| let back = get_label(); let fwd =
| get_label(); let expr = $3; // code
| for expr recursively generated let
| stmt = $5; // code for statement, recursively generated
| $$ = make_code(stmt.reg(),
| `$back:\n`
| `${expr.code()}\n`
| `BF ${expr.reg}, $fwd\n` // branch if false
| `${stmt.code()}\n`
| `JMP $back\n`
| `$fwd:\n`)
| } ;
|
| Every code fragment coming out of a rule has .code() and
| .reg(): the generated code, which is just a string, and the
| output register where it leaves its value. Such
| representational details are decided by the compiler
| writer.
|
| The while statement produces no value, so we just borrow
| the statement's .reg() as a dummy in the call to make_code;
| our rule is that every code fragment has an output
| register, whether it produces a value or not.
|
| When the LALR(1) parser reduces this while loop rule to the
| while_loop grammar symbol, the expr and statements have
| already been processed; so the rule action has ready access
| to the code objects for them. We just synthesize a new code
| object. We grab a pair of labels that we need for the
| forward and back jump.
|
| I'm assuming we have a vaguely JS-like programming language
| being used for the grammar rule actions, in which we have
| template strings with interpolation, and adjacent strings
| get merged into one. The bytecode assembly is line
| oriented, so we have \n breaks.
|
| One possible kind of expression is a simple integer
| constant, INTEGER: expr : INTEGER
| { let reg = allocate_reg();
| let val = $1 $$ = make_code(reg,
| `LOAD $reg, #$val\n`) }
|
| One possible statement is an empty statement dented by
| empty curly braces: statement : '{' '}'
| { $$ = make_code(R0, // dedicated zero
| register ""); // no-code
| solution }
|
| So then when we have while (1) { } we might get R1
| allocated in the expr rule, like this: LOAD
| R1, #1\n ; output register is R1
|
| then in the while loop, things get put together like this:
| L0:\n ; back label LOAD R1, #1\n ; code
| for expr BF R1, L1\n ; branch if false to L1
| ; no code came from empty statement JMP L0 ;
| back branch L1:\n ; fwd label
| thewakalix wrote:
| See also: DOM versus streaming.
| kccqzy wrote:
| Another way to think about this is to imagine you are
| working in a lazy-by-default language like Haskell. The
| code might seem to build up an AST and then evaluate it,
| but (disregarding parse errors) the AST is never
| materialized in memory at once: the tree might only have
| one level, with its children represented by thunks that
| continue to parse more of the source code. (If you are
| unfamiliar with Haskell laziness, imagine that the so-
| called tree has children that are _functions_ to produce
| the next level of the tree.) Of course you will need a
| carefully designed bytecode in order to generate bytecode
| from AST where the children is not yet known.
| ufo wrote:
| It's as if they generated an AST and then immediately
| converted it to bytecode, all at once.
|
| The key restriction is that you must be able to generate
| the bytecode in a single pass over the source code. For
| example, the compilation of a function at the top of the
| file must not depend on declarations further down the file.
| vidarh wrote:
| Depending on the form the dependencies takes you can
| handle that too. E.g. if it's "just" about knowing
| addresses, then just building up a list of relocation
| records where you need to patch in addresses as they
| become known works just fine. Of course the more work
| like that you have to do, the less you save over just
| building an AST anyway.
| lolinder wrote:
| The second half of Crafting Interpreters [0] does this--
| rather than outputting the AST, the bytecode interpreter
| outputs bytecode directly. Essentially, the process of
| creating the AST is already very similar to the tree
| traversal you'll do to unpack it (recursive descent parsing
| _is_ a traversal of the tree, just in the call stack rather
| than over a real data structure), so if you don 't need the
| AST you can skip it and just start outputting bytecode as
| if you were already walking the tree.
|
| [0] http://craftinginterpreters.com/
| vidarh wrote:
| Consider that if the bytecode is just a flattened
| representation, then instead of generating and traversing
| the tree, you can just traverse what would have been turned
| into the tree in the same traversal order directly.
|
| E.g. imagine you're in a leaf production for some simple
| grammar for an operator that can only take integers, w/all
| error handling ignored: parse_fooexpr()
| { left = parse_int
| emit_int(left) expect_token(FOO_OP)
| right = parse_int emit_int(right)
| emit_foo_op() }
|
| Whether emit_int outputs a "push <someint>" VM instruction
| to a bytecode buffer, or constructs an IntNode and pushes
| it to an argument stack, and whether "emit_foo_op" emits a
| "foo_op" bytecode instruction or pops the argument stack
| and constructs a FooOpNode(...argument stack values) AST
| node is an implementation detail in terms of the parser.
|
| Wirth's compilers usually didn't use an AST, and his book
| on compiler construction contains a thorough walkthrough on
| generating code as you parse, if you want to see it
| explained in a lot more detail and w/tricks to generate
| better code than the most naive approaches will.
|
| The difficulty with this approach comes when you want to
| apply optimisations or where it's inconvenient to generate
| the code in parse-order because the language tries to smart
| about what feels better to write, but you can get pretty
| far with this method.
| ksherlock wrote:
| Take a look at the evergreen Let's Build a Compiler, by
| Jack Crenshaw
|
| https://compilers.iecc.com/crenshaw/
|
| Chapter 2 and 3, expression parsing, should give you the
| idea. Instead of building the AST, you build code. So
| you're cutting out the middle man (AST).
| UK-AL wrote:
| Parse Tree and AST are slightly different. A parse tree can
| be implict. A recursive descent parser can can explore the
| parse tree without there being an explict parse tree data
| structure.
| miloignis wrote:
| An AST being generated as an intermediate step is mentioned in
| the article, at least in passing in section 2.4.
|
| The reason bytecode is generally faster (not just for SQL, but
| in most interpreted languages you may use (Python, etc)) is
| that walking the AST is relatively expensive and doesn't treat
| caches nicely. Bytecode is generally located next to each other
| in memory, and you can make the bytecode fetch/dispatch pretty
| tight, relative to indirect function calls on an object in an
| AST.
| rhdunn wrote:
| Another advantage to that is that it avoids the branching of
| the while loop in the interpreter that iterates over the AST,
| providing better instruction pipelining with having all the
| run code next to each other.
|
| The downside -- especially for dynamic languages like
| JavaScript -- is that you need to keep all of the type checks
| and fast-paths in the code, resulting in larger code blocks.
| With more type analysis you could group fast-path
| instructions together (e.g. within a while or for loop) but
| that takes time, which is typically why a JIT engine uses
| multiple passes -- generate the slower machine code first,
| then improve the fast-path blocks for code that is long
| running.
| epcoa wrote:
| > Another advantage to that is that it avoids the branching
| of the while loop in the interpreter that iterates over the
| AST
|
| Huh? A bytecode interpreter still ends up branching based
| on the decoded value of the byte codes. The VM bytecodes
| are not directly executed by the CPU (at least not usually,
| Jazelle and some older P-code stuff being rare exceptions).
|
| This is what it looks like for the example being discussed
| ("a massive switch statement"): https://github.com/sqlite/s
| qlite/blob/b11daa50f9ea11c332bb59...
|
| Perhaps confusing bytecode generation and interpretation
| with JIT? JIT is often paired with a bytecode but neither
| is dependent on the other.
| rhdunn wrote:
| The parent mentioned avoiding the dispatch of a virtual
| function call used to evaluate the AST/bytecode so I was
| assuming it was a minimal JIT instead of running the
| resulting bytecode in an interprative loop.
|
| But yes, if you are interpreting the intermediate VM
| bytecode you will still have a switch or dispatch branch
| when evaluating each instruction.
| katzenversteher wrote:
| Quote from the article:
|
| "The bytecode generated by SQLite is usually smaller than the
| corresponding AST coming out of the parser. During initial
| processing of SQL text (during the call to sqlite3_prepare()
| and similar) both the AST and the bytecode exist in memory at
| the same time and so more memory is used then. But that is a
| transient state. The AST is quickly discarded and its memory
| recycled [...]"
| wolf550e wrote:
| The page is the result of this exchange on Twitter:
|
| https://twitter.com/gorilla0513/status/1784756577465200740
|
| I was surprised to receive a reply from you, the author. Thank
| you :) Since I'm a novice with both compilers and databases,
| could you tell me what the advantages and disadvantages are of
| using a VM with SQLite?
|
| https://twitter.com/DRichardHipp/status/1784783482788413491
|
| It is difficult to summarize the advantages and disadvantages of
| byte-code versus AST for SQL in a tweet. I need to write a new
| page on this topic for the SQLite documentation. Please remind me
| if something does not appear in about a week.
| paulddraper wrote:
| There are three approaches:
|
| 1. interpreted code
|
| 2. compiled then interpreted bytecode
|
| 3. compiled machine code
|
| The further up, the simpler.
|
| The further down, the faster.
| branko_d wrote:
| > 2. compiled then interpreted bytecode
|
| You can also compile to bytecode (when building), and then
| compile that bytecode to the machine code (when running).
| That way, you can take advantage of the exact processor that
| is running your program.
|
| This is the approach taken by .NET and Java, and I presume
| most other runtime environments that use bytecode.
| gpderetta wrote:
| There is also the option of native code generation from
| bytecode at install time as opposed of runtime.
| usrusr wrote:
| Aka missing out on all the delicious profiling
| information approaches with later translation enjoy.
| There's no simple best answer.
| 0x457 wrote:
| I'm probably wrong, but I always thought for platforms
| like .Net and JVM, AoT delivers the same performance as
| JIT in most cases. Since a lot of information is
| available before runtime, unlike JS where VM always needs
| to be read, ditch optimized byte code and go back to the
| whiteboard.
| neonsunset wrote:
| There are a few concerns here that make it confusing for
| the general public to reason about AOT vs JIT
| performance.
|
| Different JIT compilers have different levels of
| sophistication, being able to dynamically profile code at
| runtime or otherwise, and the range of applied
| optimizations of this type may also vary significantly.
|
| Then, AOT imposes a different restrictions in the form of
| what types of modularity the language/runtime wants to
| support (Swift ABI vs no AOT ABI in .NET NativeAOT or
| GraalVM Native Image), what kind of operations modifying
| type system are allowed, if any, and how the binary is
| produced.
|
| In more trivial implementations, where JIT does not
| perform any dynamic recompilation and profile based
| optimizations, the difference might come down to simply
| pre-JITing the code, with the same quality of codegen. Or
| the JIT may be sophisticated by AOT mode might still be a
| plain pre-JIT which makes dynamic profiling optimizations
| off-limits (which was _historically_ the case for .NET 's
| ReadyToRun and partially NGEN, IIRC JVM situation was
| similar up until GrallVM Native Image).
|
| In more advanced implementations, there are many concerns
| which might be at odds with each other: JIT compilers
| have to maintain high throughput in order to be
| productive but may be able to instrument intermediate
| compilations to gather profile data, AOT compilers, in
| the absence of static profile data (the general case),
| have to make a lot more assumptions about the code, but
| can reason about compiled code statically, assuming such
| compilation comes with "frozen world" type of packaging
| (rather than delegating dynamic parts to emulation with
| interpreter).
|
| And then there are smaller details - JIT may not be able
| to ever emit pure direct calls for user code where a jump
| is performed to an immediate encoded in the machine code,
| because JIT may have to retain the ability to patch and
| backpatch the callsites, should it need to de/reoptimize.
| Instead, the function addresses may be stored in the
| memory, and those locations are encoded instead where
| calls are emitted in the form of dereference of a
| function pointer from a static address and then a jump to
| it.
|
| JIT may also be able to embed the values of static
| readonly fields as JIT constants in codegen, which .NET
| does aggressively, but is unable to pre-initialize such
| values by interpreting static initialization at compile
| time in such an aggressive way that AOT can (constexpr
| style).
|
| So in general, a lot of it comes down to offering a
| different performance profile. A lot of the beliefs in
| AOT performance stem from the fact lower-level languages
| rely on it, and the compilers offering it are very
| advanced (GCC and Clang mainly), which can expend very
| long compilation times on hundreds of optimization passes
| JIT compilers cannot. But otherwise, JIT compilers can
| and do compete, just a lot of modern day advanced
| implementations are held back by the code that they are
| used for, in particular in Java land where OpenJDK is
| really, really good, but happens to be hampered by being
| targeted by Java and JVM bytecode abstraction, which is
| not as much of a limitation in C# that can trade blows
| with C++ and Rust quite confidently the moment the
| abstraction types match (when all use templates/struct
| generics for example).
|
| More on AOT and JIT optimizations (in the case of .NET):
|
| - https://devblogs.microsoft.com/dotnet/performance-
| improvemen...
|
| - https://migeel.sk/blog/2023/11/22/top-3-whole-program-
| optimi...
|
| If someone has similar authoritative content on what
| GraalVM Native Image does - please post it.
| DeathArrow wrote:
| There is also JIT bytecode.
| paulddraper wrote:
| Yes, the timing of that compilation of bytecode and machine
| code can be either AOT or JIT.
|
| For example, Java/JVM compiles bytecode AOT and compiles
| machine code JIT.
| tanelpoder wrote:
| 4. lots of small building blocks of static machine code
| precompiled/shipped with DB software binary, later iterated &
| looped through based on the query plan the optimizer came up
| with. Oracle does this with their columnar/vector/SIMD
| processing In-Memory Database option (it's not like LLVM as
| it doesn't compile/link/rearrange the existing binary
| building blocks, just jumps & loops through the existing ones
| in the required order)
|
| Edit: It's worth clarifying that the _entire_ codebase does
| _not_ run like that, not even the entire plan tree - just the
| scans and tight vectorized aggregation /join loops on the
| columns/data ranges that happen to be held in RAM in a
| columnar format.
| FridgeSeal wrote:
| That's a really cool idea! Is there any writeups or
| articles about this in more detail?
| UncleEntity wrote:
| Sounds like 'copy & patch compilation' is a close cousin.
| lolinder wrote:
| Isn't what you're describing just an interpreter, bytecode
| or otherwise? An interpreter walks through a data structure
| in order to determine which bits of precompiled code to
| execute in which order and with which arguments. In a
| bytecode interpreter the data structure is a sequential
| array, in a plain interpreter it's something more
| complicated like an AST.
|
| Is this just the same as "interpret the query plan" or am I
| missing something?
| paulddraper wrote:
| > lots of small building blocks of static machine code
| precompiled/shipped with DB software binary
|
| You might call those interpreter operations.
| GartzenDeHaes wrote:
| You can also parse the code into an augmented syntax tree or
| code DOM and then directly interpret that. This approach
| eliminates the intermediate code and bytecode machine at the
| cost of slower interpretation. It's slower due to memory
| cache and branch prediction issues rather than algorithmic
| ones.
| fancyfredbot wrote:
| For simple functions which are not called repeatedly you have
| to invert this list - interpreted code is fastest and
| compiling the code is slowest.
|
| The complied code would still be faster if you excluded the
| compilation time. It's just that the overhead of compiling it
| is sometimes higher than the benefit.
| galaxyLogic wrote:
| I wonder in actual business scenarios isn't the SQL fully
| known before an app goes into production? So couldn't it
| make sense to compile it all the way down?
|
| There are situations where the analyst enters SQL into the
| computer interactively. But in those cases the overhead of
| compiling does not really matter since this is infrequently
| done, and there is only a single user asking for the
| operation of running the SQL.
| SQLite wrote:
| Performance analysis indicates that SQLite spends very little
| time doing bytecode decoding and dispatch. Most CPU cycles
| are consumed in walking B-Trees, doing value comparisons, and
| decoding records - all of which happens in compiled C code.
| Bytecode dispatch is using less than 3% of the total CPU
| time, according to my measurements.
|
| So at least in the case of SQLite, compiling all the way down
| to machine code might provide a performance boost 3% or less.
| That's not very much, considering the size, complexity, and
| portability costs involved.
|
| A key point to keep in mind is that SQLite bytecodes tend to
| be very high-level (create a B-Tree cursor, position a B-Tree
| cursor, extract a specific column from a record, etc). You
| might have had prior experience with bytecode that is lower
| level (add two numbers, move a value from one memory location
| to another, jump if the result is non-zero, etc). The ratio
| of bytecode overhead to actual work done is much higher with
| low-level bytecode. SQLite also does these kinds of low-level
| bytecodes, but most of its time is spent inside of high-level
| bytecodes, where the bytecode overhead is negligible compared
| to the amount of work being done.
| blacklion wrote:
| Looks like SQLIte bytecode is similar to IBM RPG language
| from "IBM i" platform, which is used directly, without
| translation from SQL :)
|
| Edit: PRG->RPG.
| wglb wrote:
| I'm curious. Would you have a pointer to the
| documentation of PRG language?
| blacklion wrote:
| Oooops, it is IBM RPG, not PRG! My bad!
|
| There are some links in Wikipedia.
|
| I never used it, but only read big thread about SQL vs
| RPG on Russian-speaking forum, and there was a ton of
| examples from person who works with IBM i platform in
| some big bank. Basic operations look like SQLite
| bytecodes: open table, move cursor after record with key
| value "X", get some fields, update some field, plus
| loops, if's, basic arithmetic, etc.
| kayodelycaon wrote:
| For those that aren't familiar, IBM i is ye olde A/S 400.
|
| To be fair, RPG has been continuously updated and the
| version I've seen has Java integration.
|
| It's somewhat amusing that a single developer using
| Python can run circles around entire teams of RPG
| programmers. Technology marches on. :)
| beeboobaa3 wrote:
| > Bytecode dispatch is using less than 3% of the total CPU
| time, according to my measurements
|
| > compiling all the way down to machine code might provide
| a performance boost 3% or less
|
| This logic doesn't seem sound? Because the application now
| spends 3% on bytecode dispatch doesn't tell us anything
| about how long it would instead spend on e.g. interpreting
| SQL.
| asimpletune wrote:
| He's saying in the best case scenario that 3% would go to
| 0. Therefore the reality would probably be even less.
| vlovich123 wrote:
| Unfortunately you can't do performance analysis this way
| but I think the overall point that it's a fraction
| probably still stands as you'd expect the I/O work to
| dominate.
|
| The reason you can't do the analysis the way that you
| explicitly stated (which fairly is what was implied) is
| that when you lower the code to machine code you
| typically get rid of branches in the execution path.
| Since branches slow down execution by a disproportionate
| amount vs how long they themselves take, it's easy to get
| a >3% boost getting rid of code paths that seem like
| there's only 3% of room.
|
| What the author also failed to mention is that they've
| gone on many optimization hunts eeking out less than a
| percent here or there to get a cumulative boost, so 3%
| isn't anything to sneeze at.
|
| That being said, the broader point which is valid is that
| the maintenance isn't worth the performance profile that
| SQLite targets not to mention that JITs open up an attack
| vector for security exploits. So the cost vs benefit is
| squarely in the interpreted side for now.
|
| Edit: Forgot to mention that performance tuning a JIT to
| work well is also hard - for small queries you'll spend
| more time compiling the byte code than you would just
| executing. That's why all the big JS engines do a tiered
| approach where each tier optimizes a bit more.
| temporarely wrote:
| A while back was musing if it was possible to come up with
| something resembling the instruction set for a CPU for an
| abstract relational-engine. Is this basically what SQLite
| is doing with bytecodes?
| rkrzr wrote:
| No. The SQLite bytecode is much higher-level than the
| instruction set of a CPU.
| titzer wrote:
| Part of the benefit of compiling bytecode (or anything) is
| specializing code to the context (types, values, etc) in
| which it appears. While I don't doubt your analysis, it
| could be the case that compiled C code in question is full
| of branches that can be folded away when specialized to the
| context of the query, such as the structure of the rows,
| the type and values of columns, etc.
|
| Basically all of what you are saying about high-level
| bytecodes applies to dynamic languages, too. But they
| benefit highly from specializing each bytecode given static
| and dynamic context, and shortcutting dataflow through
| local variables.
| paulddraper wrote:
| > A key point to keep in mind is that SQLite bytecodes tend
| to be very high-level
|
| That's right. "Bytecode" is a spectrum. SQLite's bytecode
| is higher-level than e.g. WASM, JVM, Python, etc. (Notably,
| because the original source code is higher-level.)
| wolf550e wrote:
| It is possible to construct a worst case scenario for
| bytecode execution, for example very complex expressions in
| the WHERE and/or SELECT clauses that compute values, and a
| query plan that performs a full table scan over a table
| with say 100 million rows that is cached in RAM (or maybe
| use generate_series, whatever works best).
|
| Computing the expressions should dominate execution time,
| right?
|
| Then, to compare against the best possible case, we can
| write a custom C program that uses sqlite internals to
| perform the same task (full scan of the table, extract
| values from row) and does not use the bytecode VM and
| computes the complex expressions in regular C code (e.g. a
| function that accepts floats and returns a float or
| whatever).
|
| Then comparing the two implementations will tell us how
| much faster sqlite can be if it had a "perfect" JIT.
| cryptonector wrote:
| > Most CPU cycles are consumed in walking B-Trees, doing
| value comparisons, and _decoding records_
|
| Emphasis added. This is because of SQLite3's varint
| encoding method for numbers. Performance-wise it was
| probably a mistake, though it's a form of compression,
| which might have paid off in terms of space.
|
| (I seem to remember seeing something, possibly by you,
| about this before.)
|
| I wonder if it would be possible to replace the varint
| encoding... Yes, it'd be a compatibility break in that
| older SQLite3s couldn't open newer DBs.
| AtlasBarfed wrote:
| optimizing runtimes (usually bytecode) can beat statically
| compiled code, because they can profile the code over time,
| like the JVM.
|
| ... which isn't going to close the cold startup execution
| gap, which all the benchmarks/lies/benchmarks will measure,
| but it is a legitimate thing.
|
| I believe Intel CPUs actually sort of are #2. All your x86 is
| converted to microcode ops, sometimes optimized on the fly (I
| remember Intel discussing "micro ops fusion" a decade ago I
| think).
| ehaliewicz2 wrote:
| Yeah, for example, movs between registers are generally
| effectively no-ops and handled by the register renaming
| hardware.
| vmchale wrote:
| APL interpreters are tree-walking, but they're backed by C
| procedures. Then GCC optimizes quite well and you get
| excellent performance! With stuff like vector instructions.
|
| Getting on par with GCC/Clang with your own JIT is pretty
| hefty.
| coolandsmartrr wrote:
| I am amazed that the author (D Richard Hipp) made an effort to
| find and respond to a tweet that was (1) not directed/"@tted"
| at him or (2) written in his native language of English
| (original tweet is in Japanese[1]).
|
| [1] https://twitter.com/gorilla0513/status/1784623660193677762
| bena wrote:
| He might have a google alert for sqlite.org
| alex_smart wrote:
| I recently implemented my own expression evaluator in java for
| in-memory data frames, and once you think about doing that
| deeply, you very quickly understand the need for bytecode. If you
| directly evaluate the expression using a tree representation, you
| basically have to do a whole lot of branching (either via switch
| statements or polymorphism) for every single line of useful
| operation. Yes, the branch predictor kicks in and it means that
| your code wouldn't be as slow as if it didn't, but it is still
| measurably slower than if you converted the expression into
| bytecode once and just ran that on all rows instead.
| eklavya wrote:
| Bytecode will only impact packing, so more efficient ram, cache
| and cpu wise. But I don't understand how it would help with
| branching? You still have to make the same decisions? As in the
| bytecode executor still needs to do differnt things based on
| the op code, its not in hardware.
| hickelpickle wrote:
| There is threaded bytecode as well, which uses direct jumping
| vs a switch for dispatch. This can improve branch prediction,
| though it is a debated topic and may not offer much
| improvement for modern processors.
| immibis wrote:
| How does it know where to jump to?
| chuckadams wrote:
| The jump target is compiled into the bytecode, so rather
| than return to the big switch statement, it jumps
| straight to the next opcode's implementation. The process
| is called "direct threading". These days a decent switch-
| based interpreter should fit in cache, so I'm not sure
| direct threading is much of a win anymore.
| kaba0 wrote:
| Do you have perhaps some links/references on that?
|
| I have once tried benchmarking it by writing a tiny VM
| interpreter and a corresponding threaded one with direct
| jumps in Zig (which can force inline a call, so I could do
| efficient direct jumps) and I have - to me surprisingly-
| found that the naive while-switch loop was faster, even
| though the resulting assembly of the second approach seemed
| right.
|
| I wasn't sure if I saw it only due to my tiny language and
| dumb example program, or if it's something deeper. E.g. the
| JVM does use direct threaded code for their interpreter.
| alex_smart wrote:
| Consider evaluating an expression like "col1 + col2 == col3"
|
| A tree based evaluator has to do something like -
| for (int i = 0; i < df.size(); i++) { // switch
| on expression type (column, binary operator, unary operator
| etc) // switch on expression operator type (add,
| subtract, equals etc) // switch on column type
| (int, long, float etc) // actual evaluation
| }
|
| whereas a bytecode based evaluator is able to run something
| equivalent to - for (int i = 0; i <
| df.size(); i++) { result[i] =
| df.getLong(column1Index, i) + df.getLong(column2Index, i)) ==
| df.getLong(column3Index, i) }
| naasking wrote:
| You still have to branch on opcode selection which you're
| omitting in your translation. Branching on "expression
| type" and column type in your first example is also
| unnecessary. Bytecodes have different opcodes to operate on
| different types and different arities, so in both cases you
| have only one unpredictable branch for each
| operation/opcode.
|
| The main benefit of bytecodes is then cache friendliness,
| by removing all of those unpredictable loads and stores to
| obtain the expression details (opcode, arguments, etc.).
| All of those are inlined into the bytecode array which is
| already in cache.
| alex_smart wrote:
| > Branching on "expression type" and column type in your
| first example is also unnecessary
|
| It's not?
| naasking wrote:
| It is unnecessary, only branching on expression operator
| type is strictly necessary. You're trying to abstract too
| much in your AST which yields an inefficient interpreter.
| I can invent an inefficient bytecode too, but that
| wouldn't be a correct evaluation of how fast bytecode
| interpretation could be when done right.
| sitkack wrote:
| You should look at https://janino-compiler.github.io/janino/ it
| can compile Java into class files in memory that can be
| directly executed.
| alex_smart wrote:
| I found this by looking at spark source code to figure out
| what they were doing under the hood to deal solve this
| problem. :)
| adontz wrote:
| Looks like SQLite could benefit from copy-and-patch JIT compiler.
| Someone wrote:
| I think that would make it cross a border where 'Lite' no
| longer applies.
|
| It also would be quite a challenge given their long-term-
| support statement (https://www.sqlite.org/lts.html):
|
| _"Cross-platform Code - SQLite runs on any platform with an
| 8-bit byte, two 's complement 32-bit and 64-bit integers, and a
| C compiler. It is actively tested on all currently popular CPUs
| and operating systems. The extreme portability of the SQLite
| code and file format will help it remain viable on future
| platforms."_
| adontz wrote:
| There is no implied requirement to support JIT everywhere.
| samatman wrote:
| I think this is unlikely for SQLite (other comments cover why
| it probably wouldn't happen even if it were likely to benefit),
| but I happened to have the copy and patch paper open in a tab,
| so I'll take the opportunity to share it here.
|
| https://sillycross.github.io/assets/copy-and-patch.pdf
|
| It has great potential for many applications, if not this
| particular one.
| bambax wrote:
| Typo: _A prepared statement is an object that represents the
| steps needed_ [to] _accomplish the input SQL_.
| sgbeal wrote:
| Fixed, thank you for the report. It won't be visible on the
| site until the next time it's rebuilt.
| jiehong wrote:
| SQLite EXPLAIN plan is indeed represented as a table, but I don't
| find it necessarily that much easier to read and understand.
|
| I often miss having the cardinality and the amount of bytes read
| for each part of the query like Oracle query plans provide.
|
| Or is everyone really that happy with SQLite query plans?
| bvrmn wrote:
| There is EXPLAIN QUERY PLAN which outputs usual high level plan
| description. But there is no easily reached disk/cache usage
| stats.
| manx wrote:
| I'm wondering if one could write this bytecode directly (or with
| a higher level imperative language) instead of SQL. Often, the
| programmer knows exactly which index lookups need to happen in a
| loop, while it seems like a burden to express that in SQL. This
| might also be an opportunity to create a different type safe dsl
| for database access.
| thayne wrote:
| I was wondering the same thing. And in particular if a new
| query language that avoided many of the pitfalls of SQL could
| compile down to that bytecode and avoid having to deal with SQL
| as an intermediate representation. Also, if you can compile to
| bytecode ahead of time, then that could save the time needed to
| parse the text of a sql query to bytecode at runtime.
| nraynaud wrote:
| I'm puttig my wish list here:
|
| - be able to put the projection in a varable and reuse it
| (and I think orm people might love it)
|
| - have a quick way to forward the the non-aggregated fields
| of projection to a group by (maybe with the aforementionned
| variables)
| TachyonicBytes wrote:
| The DuckDB API I was talking above seem to already meet
| your use-cases?
|
| - Does this[1] solve the group by wish?
|
| - Depending on what you mean by projection, maybe this[2]
| or this[3]?
|
| [1] https://duckdb.org/docs/api/python/relational_api#aggre
| gatee...
|
| [2] https://duckdb.org/docs/api/python/relational_api#proje
| ctexp...
|
| [3] https://duckdb.org/docs/api/python/expression#column-
| express...
| forinti wrote:
| A client-server database would hash the SQL and cache the
| output of the parser. The end result is the same.
|
| Maybe SQLite could have a similar mechanism, but the cache
| stays on disk or an external memory cache.
| gwbas1c wrote:
| In the C# library for SQLite, a DbCommand with
| parameterized SQL can be reused, thus reusing the bytecode.
| astrobe_ wrote:
| > Also, if you can compile to bytecode ahead of time, then
| that could save the time needed to parse the text of a sql
| query to bytecode at runtime.
|
| I think we already kind of have that already; one can prepare
| a query/statement and then use it multiple times. Regex
| engines also do that, except that in the case of SQlite one
| can bind different parameter values.
|
| Programmers worried about SQL lexing/parsing times can
| compile their most used queries once for all at programmer
| startup. Plus, one can sometimes break a query with
| annoyingly variable parts into smaller queries glued with
| SQLite API calls.
| wruza wrote:
| I was advocating for it for decades, but everyone dismisses it
| with "you don't know better than rdbms". That's almost a
| religion. Despite the fact that most of the tables most people
| have are never any big (except for a few oltps that you ought
| to leave to specific sql server wizards anyway) and you usually
| _do have_ the idea how it should work. SQL is a cool solution
| for a set of problems that has a very non-zero xor with a set
| of problems that we actually have. And most of our complex
| queries are trivially expressible in imperative loops. Index
| lookup and a complex plan building are key features of an
| rdbms, everything else it steals from you and forces to use an
| arcane inconvenient language that in practice isn't even a
| standard. Make plan building and indexing core ops and leave
| everything else to some sort of a general purpose bytecode,
| everyone will be happier and will create dozens of DSLs for all
| their needs.
| lqet wrote:
| > I was advocating for it for decades, but everyone dismisses
| it with "you don't know better than rdbms". That's almost a
| religion.
|
| Which is simply not true. Query optimization is done
| heuristically, for the simple reason that you usually _need
| to run the query_ to get the information required for
| "perfect" optimization. If the RDBMS really knew better, it
| wouldn't offer query hints.
| kayodelycaon wrote:
| Postgres doesn't offer query hints. ;)
| TachyonicBytes wrote:
| Something akin to this is available[1] in DuckDB, itself
| started as a "shell" over SQLite. Using Python, you can compose
| your SQL plans as you like.
|
| I recall a previous discussion about the SQLite bytecode and
| potential alternatives, where the main idea was that, if SQLite
| had gone the other way, you could have a much more general
| engine, where you could achieve something like DuckDB itself
| just as a set of extensions.
|
| Reading the draft, it doesn't seem that extensibility of SQLite
| was a main factor in deciding. Maybe this trade-off also covers
| extensions somehow, which would be nice to see in the final
| document.
|
| [1] https://duckdb.org/docs/api/python/expression
| asvitkine wrote:
| The main downside is now you're making the bytecode an API
| which means all future changes need to be backwards compatible.
| Retr0id wrote:
| > Hence, no tree-of-objects database engine provides the level of
| detail in their "EXPLAIN" output that SQLite provides.
|
| I've seen some cool graphical visualisation tools for postgres
| http://tatiyants.com/postgres-query-plan-visualization/
| chucke1992 wrote:
| Well SQLite is basically a religious cult at this point. I am
| always impressed by it.
| euroderf wrote:
| Doesn't the Go language also execute bytecode ? There's gotta be
| a hyperefficient hybrid lurking in there somewhere.
| zadokshi wrote:
| go compiles to binary files and libraries. it's more like c
| than Java
| euroderf wrote:
| Aah OK my bad.
|
| https://github.com/teh-cmc/go-internals/blob/master/chapter1...
| matharmin wrote:
| What I find interesting is that this approach means the query
| plan is chosen when "compiling" the statement, not when running
| it. However, there are still some cases where SQLite will
| recompile the statement when running, such as when the schema
| changed.
| groue wrote:
| Out of curiosity, I asked on the SQLite forum how the query
| optimizer fits in this schema:
| https://sqlite.org/forum/forumpost/93f9bfcec0
| carterschonwald wrote:
| So one thing that's not often talked about is that with the right
| invariants/program transformations, the two approaches are
| equivalent :)
| zoomablemind wrote:
| How does this reflect on SQLite's possibilty of parallelizing
| some of the operations, like aggregates, for example?
| Ericson2314 wrote:
| At the bottom, it it says that is a downside
| dittonedo wrote:
| I just wonder why can't hackers write homoiconic (if thats the
| word), languages that get array tokens and provide results like
| so. I often wonder what if sqlite was also designed to have its
| statements like ('select '* 'from 'table), it would also had been
| better for third part libraries, to implement their orms.
|
| Since all of this (byte coding) (AST) is done by almost many
| programming languages I ever explored (Python,Ruby, lua, ...). It
| would have been awesome, if they were easily parse-able I guess.
|
| Most of the ORM implementations are very error prone to those
| design decisions. (Especially SQL databases).
| pizlonator wrote:
| That's awesome, I didn't know SQLite had a bytecode.
|
| Based on my experience writing interpreters, I know that bytecode
| is faster to interpret.
|
| (Source: I wrote JavaScriptCore's interpreter and it's optimizing
| JITs. I've also worked on other language implementations. I've
| seen the AST vs bytecode thing play out more than once.)
| andruc wrote:
| Faster to interpret than what?
| jjice wrote:
| I assume they meant faster to interpret than a tree of
| objects, since that's the other concept discussed in the
| article.
| pizlonator wrote:
| Yes
| pdubroy wrote:
| I think most people associate bytecode VMs / interpreters with
| general-purpose programming languages, but it's a surprisingly
| useful concept in other contexts.
|
| Sometimes bytecode VMs appear in unexpected places! A few that
| I'm aware of:
|
| - eBPF, an extension mechanism in the Linux kernel - DWARF
| expression language, with interpreters in debuggers like GDB and
| LLDB - The RAR file format includes a bytecode encoding for
| custom data transformation
|
| More here: https://dubroy.com/blog/bytecode-vms-in-surprising-
| places/
| WickedSmoke wrote:
| Bytecode is great for tiny domain languages and I use them in
| many projects.
|
| Ultima IV used one to animate sprites on it's title screen map.
| For the next version of XU4 I implemented three bytecode
| interpreters to script the entire title sequence. There's a
| high level presentation interpreter, a GPU rendering
| interpreter, and one for the Ultima IV bytecode.
| xymostech wrote:
| The original TeX Fonts stored their metrics in TFM (short for
| TeX font metrics) files, which contains a bytecode interpreter
| for calculating ligatures and kerning between characters. I
| learned about that when I tried reading the files myself.
|
| From what I can tell, modern fonts using OpenType just have
| tables to accomplish something similar now, in the form of the
| GSUB and GPOS tables?
|
| Documentation for the TFM format here:
| https://tug.org/TUGboat/Articles/tb02-1/tb02fuchstfm.pdf
| (search for lig/kern)
| atombender wrote:
| TrueType (and OpenType, which is an evolution of TT)
| absolutely includes a bytecode instruction set:
| https://developer.apple.com/fonts/TrueType-Reference-
| Manual/...
| astrobe_ wrote:
| Also regular expression engines.
| omoikane wrote:
| I believe these fall under "interpreter pattern":
|
| https://en.wikipedia.org/wiki/Interpreter_pattern
|
| https://gameprogrammingpatterns.com/bytecode.html
|
| https://wiki.c2.com/?InterpreterPattern
|
| I first read about it from the "Design Patterns" book, which I
| would recommend to everyone.
| Narishma wrote:
| They are frequently used in games to run scripting languages.
| UncleEntity wrote:
| Yep, Greenspun's tenth rule:
|
| Any sufficiently complicated C or Fortran program contains an
| ad hoc, informally-specified, bug-ridden, slow implementation
| of half of Common Lisp.
| naasking wrote:
| On incrementally running bytecode:
|
| > This is more difficult to achieve in a tree-of-objects design.
| When the prepared statement is a tree-of-objects, execution is
| normally accomplished by walking the tree. To pause the statement
| in the middle of a computation means unwinding the stack back up
| to the caller, all the while saving enough state to resume
| evaluation where it last left off. This is not impossible to do,
| but it is sufficiently difficult that I have never seen it
| actually done.
|
| I don't think it's too difficult. You can avoid stack unwinding
| by just not using the native stack in the first place and using
| an explicit stack. This is one step towards a bytecode VM though,
| as you're virtualizing the stack. Definitely easier in bytecode
| though as it already makes all of that state explicit.
|
| For the article overall, isn't it a bit trivial to see that a
| bytecode interpreter must be faster and more compact? Write out
| the expression tree to an array as an in-order traversal and this
| is basically an executable bytecode. It will be more compact
| because you no longer need hidden malloc/object headers and
| pointers for subexpressions (they're preceding indices!), and it
| will be faster because it's basically all in cache, so no cache
| misses from pointer chasing.
|
| I can't imagine a situation in which the tree would ever be
| faster for execution, in general. At best, the overhead might be
| just too small to care about. We use trees in compilers because
| we need to perform incremental rewrites. This is trivial with
| trees but can get expensive when expanding and contracting
| arrays. Execution doesn't need to perform rewrites though, it
| needs linear, cache-friendly access patterns.
|
| However, the article does make a good point that the query plan
| may be tuned on the fly during execution, eg. the small rewrites
| I mentioned above occurring while executing. I'm not fully
| convinced this couldn't be done in bytecode though.
___________________________________________________________________
(page generated 2024-04-30 23:01 UTC)