https://sqlite.org/draft/whybytecode.html
SQLite
*** DRAFT ***
* Home
* Menu
* About
* Documentation
* Download
* License
* Support
* Purchase
* Search
* About
* Documentation
* Download
* Support
* Purchase
[Search Documentation] [ ] [Go]
Why SQLite Uses Bytecode
> Table Of Contents
1. Introduction
1.1. How To Provide Feedback
1.2. Definition Of "Bytecode"
1.3. Definition Of "Abstract Syntax Tree" or "AST"
1.4. Dataflow Programming
2. Advantages To Compiling Into Bytecode
2.1. Bytecode Is Easier To Understand
2.2. Bytecode Is Easier To Debug
2.3. Bytecode Can Be Run Incrementally
2.4. Bytecode Is Smaller
2.5. Bytecode Is Faster
3. Advantages Of Compiling Into A Tree Of Objects
3.1. Query Planning Decisions Can Be Deferred Until Runtime
3.2. Dataflow Programs Are Easy To Parallelize
1. Introduction
Every SQL database engine works in roughly the same way: It first
translates the input SQL text into a "prepared statement". Then it
"executes" the prepared statement to generate a result.
A prepared statement is an object that represents the steps needed to
accomplish the input SQL. Or, to think of it in another way, the
prepared statement is the SQL statement translated into a form that
is more easily understood by the computer.
In SQLite, a prepared statement is an instance of the sqlite3_stmt
object. In other systems, the prepared statement is usually an
internal data structure that is not directly visible to the
application programmer. Developers of other SQL database engines do
not necessarily call these objects "prepared statements". But such
objects exists, whatever they might be called. This paper will use
the term "prepared statement".
There are countless ways of implementing a prepared statement. This
paper will look at the two most common methods:
1. Bytecode - The input SQL is translated into a virtual machine
language that is then run by a virtual machine interpreter. This
is the technique used by SQLite.
2. Tree-Of-Objects - The input SQL is translated in a tree of
objects that represent the processing to be done. The SQL is
executed by walking this tree. This is the technique used by
MySQL and PostgreSQL.
There are advantages and disadvantages to each of these
representations of a prepared statement. The purpose of this paper is
to articulate some of those advantages and disadvantages.
1.1. How To Provide Feedback
This document is written from the perspective of the original author
of SQLite. If you disagree with any of the opinions offered in this
document, you are welcomed to offer corrections and/or contrary views
on the SQLite Forum. Or you can email the author directly.
1.2. Definition Of "Bytecode"
The bytecode generated by SQLite might be a little different from
what many readers think of as bytecode. The bytecode used (for
example) by the Java virtual machine or by WebAssembly consists
almost entirely of low-level operations, similar to what physical
CPUs implement: basic math operators, comparisons, conditional jumps,
and instructions to move content between different memory locations.
SQLite bytecode has these kinds of low-level instructions, too. But
SQLite bytecode also contains some high-level operations that are
specific to the needs of a database engine. Here are just a few
examples:
* OP_Column - Extract the value from the N-th column of the
database row that a particular cursor is currently pointing at.
* OP_CreateBtree - Allocate space for a new B-Tree in the database
file.
* OP_ParseSchema - Reread and reparse all or part of the
sqlite_schema table and refresh internal symbol tables
accordingly.
* OP_SeekGE - Move a cursor on a particular B-Tree to the first
entry that is greater than or equal to a given key.
* OP_Next - Advance a cursor on a particular B-Tree to the next
entry in the B-Tree and jump, or fall through if there are no
more entries in that B-Tree.
In other words, the "bytecode" used by SQLite is not so much a set of
CPU instructions as it is a list of database primitives that are to
be run in a particular order.
1.3. Definition Of "Abstract Syntax Tree" or "AST"
An "Abstract Syntax Tree" or AST is a data structure that describes a
program or statement in some kind of formal language. In our context,
the formal language is SQL. An AST is typically implemented as a tree
of objects where each object represents one small part of the overall
SQL statement. ASTs emerge naturally from parsers for formal
languages. The usual technique is to use an LALR(1) parser. With such
a parser, each terminal symbol holds metadata that will become a leaf
of the AST, and each non-terminal symbol holds metadata that will
become a sub-branch of the overall AST. As rules of the grammar are
"reduced" by the parser, new nodes of the AST are allocated and
connected to subnodes. After the parse completes, the start-symbol of
the grammar is left holding the root of the AST.
An AST is a tree of objects. But an AST is not a suitable form for a
prepared statement. After being generated, an AST first needs to be
transformed in various ways before it can executed. Symbols need to
be resolved. Semantic rules need to be checked. Optimizations need to
be applied that transform input SQL statement into different forms
that execute more quickly. Finally, the AST needs to be translated
into an alternative representation that is more amenable to
execution.
Some people refer to the tree of objects that is used as the
executable form for MySQL and PostgreSQL as an AST. This is probably
a misuse of the term "AST", because by the time the tree of objects
is ready to be executed, it has been changed so much that it has
little resemblance to the original SQL text. The confusion arises in
part because both the final prepared statement object and the
original AST are both trees of objects. The usual technique is for
the original AST that comes directly out of the parser to be
transformed little by little, in multiple passes, until at the end it
is fully converted into an tree of objects that is no longer strictly
an AST but that can be evaluated to generate a result. There is not
necessarily a clear point during this process when the
tree-of-objects ceases to be an AST and becomes a prepared statement
instead. And because there is no clear boundary between an AST and a
prepared statement, people often refer to a prepared statement that
is represented as a tree of objects as an "AST", even though that
description is not precise.
1.4. Dataflow Programming
Dataflow programming is a style of programming in which individual
nodes specialize in doing one small part of the overall computation.
Each node receives inputs from other nodes and sends its output to
other nodes. Thus the nodes form a directed graph that carry inputs
into outputs.
A "dataflow program" is perhaps a better description than "AST" for
the tree of objects that an SQL database engine uses as a prepared
statement.
2. Advantages To Compiling Into Bytecode
SQLite compiles to bytecode, and the SQLite developers are very happy
with this approach. Here is why:
2.1. Bytecode Is Easier To Understand
A flat list of opcodes can be easily printed to see exactly how an
SQL statement is being implemented. This is what happens in SQLite
when you preface an SQL statement with the "EXPLAIN" keyword: Instead
of actually running the SQL, the result is a listing of the bytecode
that would have been used to implement that SQL.
Bytecode lends itself to this because a bytecode program is easily
represented as a table. In SQLite bytecode, each instruction has one
opcode and five operands. Thus a prepared statement can be rendered
as if it were a query against a six-column table.
A tree-of-objects representation is more difficult to publish in a
human-readable form. The objects that comprise the tree tend to all
be very different, and thus it is tricky to come up with a consistent
and simple table representation with which to display the objects.
Any such table representation that you do come up with would almost
certainly have more than six columns, probably many more. The problem
of rendering a tree-of-objects as a table is sufficiently difficult
that nobody does it, as far as I know. Hence, no tree-of-objects
database engine provides the level of detail in their "EXPLAIN"
output that SQLite provides.
2.2. Bytecode Is Easier To Debug
Bytecode provides a clear separation between front-end parsing and
analysis and back-end evaluation of an SQL statement. When problems
arise (incorrect answers and/or poor performance), the developers can
examine the bytecode to quickly determin if the source of the trouble
is either the front-end analysis or the back-end data storage section
of the product.
In debugging builds of SQLite, the PRAGMA vdbe_trace=ON; command will
cause a trace of the bytecode execution to appear on the console.
2.3. Bytecode Can Be Run Incrementally
SQL statements written in bytecode can be evaluated incrementally.
For example, a statement can be run until it generates just its first
row of output. The statement then pauses until it is stepped again.
It is not necessary to run the statement to completion before
examining the first row of output.
This is more difficult to achieve in a tree-of-objects design. When
the prepared statement is a tree-of-objects, execution is normally
accomplished by walking the tree. To pause the statement in the
middle of a computation means unwinding the stack back up to the
caller, all the while saving enough state to resume evaluation where
it last left off. This is not impossible to do, but is sufficiently
difficult that I have never seen it actually done.
Most SQL database engines do not really need to do incremental
execution of prepared statements because most SQL database engines
are client/server. In client/server engines, a single SQL statement
is sent to the server, then the complete reply comes back over the
wire all at once. Thus each statement runs to completion in a single
go. But SQLite is not client/server. SQLite is a library that runs in
the same address space and using the same stack as the application.
Being able to easily and reliably perform incremental execution of an
SQL statement is important to SQLite.
2.4. Bytecode Is Smaller
The bytecode generated by SQLite is usually smaller than the
corresponding AST coming out of the parser. During initial processing
of SQL text (during the call to sqlite3_prepare() and similar) both
the AST and the bytecode exist in memory at the same time, so more
memory is used then. But that is a transient state. The AST is
quickly discarded and its memory recycled, even before the call to
sqlite3_prepare() returns, so the resulting prepared statement ends
up consuming less memory in its bytecode representation than it did
as an AST. This is important because calls to sqlite3_prepare() are
transient, but prepared statements are often cached for possible
reuse and persist in memory for a long time.
2.5. Bytecode Is Faster
I believe that a bytecode representation of a prepared statement runs
faster, because fewer decisions need to be made for each step of the
computation. Emphasis on "believe" in the previous sentence - it is
difficult to verify this claim experimentally since nobody has ever
put in the multiple years of effort necessary to generate equivalent
bytecode and tree-of-object representations of a prepared statement
to see which one actually runs faster. We do know that SQLite is very
fast, but we do not have good side-by-side comparisons with other SQL
databases since the other databases spend a lot of time doing client/
server message processing, and it is difficult to untangle the
message round-trip overhead from the actual processing time.
3. Advantages Of Compiling Into A Tree Of Objects
The SQLite developers think that the bytecode approach is best, at
least for the use cases the SQLite tries to fill, but the
tree-of-objects approach to processing SQL does have some advantages
over bytecode. There are always tradeoffs.
3.1. Query Planning Decisions Can Be Deferred Until Runtime
When a prepared statement is bytecode, once the bytecode has been
generated, the algorithm is fixed and cannot be subsequently changed
without completely rewriting the bytecode. This is not the case with
a tree-of-objects prepared statement. A tree-of-objects is easier to
modify on-the-fly. The query plan is mutable and can be tweaked as it
is running, based on the progress of the query. Thus a query can be
dynamically self-tuning.
3.2. Dataflow Programs Are Easy To Parallelize
In a dataflow program, each processing node can be assigned to a
different thread. There needs to be some kind of threadsafe queuing
mechanism for transferring intermediate results from one node to the
next. But no synchronization primitives are typically needed within
each node of the program. Node schedule is trivial: A node becomes
eligible to run when it has data available and there is space in its
output queue.
This is an important consideration for database engines that are
designed to run large analytic queries (OLAP) on large multi-core
servers. The primary focus of SQLite is transaction processing (OLTP)
on the internet-of-things, so there is less need to represent
prepared statements as dataflow programs in SQLite.
This page last modified on 2024-04-30 18:42:17 UTC
*** DRAFT ***