https://github.com/wildarch/jepsen.rqlite/blob/main/doc/blog.md Skip to content Sign up * Product + Features + Mobile + Actions + Codespaces + Packages + Security + Code review + Issues + Integrations + GitHub Sponsors + Customer stories * Team * Enterprise * Explore + Explore GitHub + Learn and contribute + Topics + Collections + Trending + Learning Lab + Open source guides + Connect with others + The ReadME Project + Events + Community forum + GitHub Education + GitHub Stars program * Marketplace * Pricing + Plans + Compare plans + Contact Sales + Education [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} wildarch / jepsen.rqlite Public * Notifications * Fork 2 * Star 3 * Code * Issues 0 * Pull requests 2 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights Permalink main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags jepsen.rqlite/doc/blog.md Go to file * Go to file T * Go to line L * * Copy path * Copy permalink This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. @Nienkeeijsvogel Nienkeeijsvogel Update blog.md Latest commit 7bbb5c3 Apr 18, 2022 History 2 contributors Users who have contributed to this file @Nienkeeijsvogel @wildarch [ ] Testing consistency of rqlite What is rqlite? Test Setup Jepsen Tests Comments Sequential Register Nemeses Network partitioning Daemon pause Network delay Packet loss Results Conclusion 267 lines (201 sloc) 14.5 KB Raw Blame * Open with Desktop * View raw * * View blame Testing consistency of rqlite By Nienke Eijsvogel, Ruben van Baarle and Daan de Graaf. What is rqlite? Rqlite is an easy-to-use, lightweight, distributed relational database, which uses SQLlite as its storage engine. Each rqlite node exposes an HTTP API allowing data to be inserted into and read back from the database. An HTTP API is an API that uses Hypertext Transfer Protocol as the communication protocol between the two systems. Rqlite provides data durability via the Raft consensus system. Every change made to the SQLite database is written to the Raft log, and that log is by default persisted to disk. The Raft log contains SQL statements and is used to distribute SQL statements to all nodes. once a log entry reaches a majority of nodes, the SQL statement can be applied to each SQL state. The statements must be deterministic. In the event of a restart or recovery operation, the SQLite database is completely rebuilt from the data contained in the Raft log. Test Setup For the test a cluster is set up with 5 Vagrant virtual machines that run Debian 11 Bullseye. The machines are run on a fully connected private network. Rqlite version 7.3.1 is used. As a Test framework we use Jepsen and Knossos and the Rqlite-java client library. Jepsen Tests Comments The Comments test checks for strict serializability. Strict serializability guarantees that operations take place atomically: a transaction's sub-operations do not appear to interleave with sub-operations from other transactions. It checks for a strict serializability anomaly in which T1 < T2, but T2 is visible without T1. CREATE TABLE Test ( id INT PRIMARY KEY, key INT ); INSERT INTO Test VALUES ((0, 1), (0, 2), /* ... */, (5, 98 (5, 99)); We perform concurrent blind inserts across n tables, and meanwhile, perform reads of both tables in a transaction. To verify, we replay the history, tracking the writes which were known to have completed before the invocation of any write w_i. If w_i is visible, and some w_j < w_i is not visible, we've found a violation of strict serializability. Splits keys up onto different tables to make sure they fall in different shard ranges * Read: "SELECT id FROM table WHERE key = " * Write: "INSERT INTO table VALUES ('" id "," k "')" Comments example To now verify the correctness, the history is replayed to check for violations. This is done by tracking writes which are known to be finished before Write_i. If this Write_i is visible but any of the tracked writes is not, a violation is found Sequential The sequential test check for sequential consistency. Any execution is the same as if all read/write ops were executed in some global ordering, and the ops of each client process appear in the order specified by its program. If a process order enforces that x must be visible before y, we should always read both or neither. Sequential example "CREATE TABLE " t " (key varchar(255) primary key)"; We emit sequential integer keys for writes, logging the most recent n keys in the given atom, wrapping a PersistentQueue. * Write: "INSERT INTO " table " VALUES ('" k "')" We use the last-written atom to perform a read of a randomly selected recently written value. * Read: "SELECT key FROM (key->table table-count k) WHERE key = '" k "'" A custom implemented checker checks for violations of the history logs. Register The register test checks linearizability of accesses on independent registers, or individual rows in a table. We start by creating a simple table with a few rows: CREATE TABLE Test ( id INT PRIMARY KEY, val INT ); INSERT INTO Test VALUES ((0, 0), (1, 0), /* ... */, (99, 0)); We then concurrently issue a mix of read, write and compare-and-swap operations: * Read: SELECT val FROM Test WHERE id = 0 * Write: UPDATE Test SET val = x WHERE id = 0 * Compare-and-swap: UPDATE Test SET val = x WHERE id = 0 AND val = y All these operations are on the first row of Test for now, we will need the other rows later. In our current setup, these operations are issued concurrently by 5 different clients, so at most 5 transactions will be in flight at any given time. We record the start and end time of each transaction, producing a log similar to this: Timeline view for history generated by the register test The height of the box marks the duration of the corresponding transaction. Operations can be successful (blue), fail ( pink) or inconclusive (orange, not shown here). The only failures we see in the diagram above are compare-and-swap operations: this is because they only succeed if the expected value matches the one present in the database (the format is cas [ ]). Now that we have generated a transaction history, we must check if it is linearizable. Thankfully the knossos linearizability checker provides a compare-and-swap register model for just such a test setup, so we can define the checker as simply: (checker/linearizable { :model (model/cas-register 0) :algorithm :linear }) One caveat here is that linearizability checking is not cheap, and is in fact NP-Complete. If our history contains many transactions, knossos will be unable to verify linearizability. A way around is to periodically switch to a different register. Since transactions on different registers are independent, this reduces the search space significantly, while still allowing us to be nearly as effective at testing as the original approach. In our testing setup we switch registers after every 1000 operations, for which knossos can easily verify linearizability in a few seconds on the author's laptop. That concludes all of our tests for linearizability and sequential consistency. In the next section we cover how to inject faults into the tests, and test that rqlite can correctly handle them. Nemeses So far we have been testing the happy path of RQLite, where communication is reliable and fast, and nodes are diligently handling all of their requests immediately. These situations are frequent in deployments and regular integration tests, and thus are likely to be handled correctly by databases. However, we expect databases to uphold its consistency guarantees even in the presence of network issues or node crashes, so we augment our tests to inject faults using a so-called 'nemesis'. The configured nemesis toggles its effects every 5 seconds of the test run, so half of the time the test will run under ideal conditions, and the other half it will experience faults. We have implemented four different nemeses: * Network partitioning: randomly partitions the nodes in two groups, each of which can communicate amongst themselves but not to the other group. * Pause rqlited: randomly selects one of the nodes temporarily halts the rqlite daemon process. * Network delay: Adds an average 1 second delay to messages exchanged between nodes. * Flaky network: Randomly drops 20% of network packets sent between nodes. Next we discuss each of them in more detail. Network partitioning When this nemesis is activated, it creates two clusters of roughly the same size, and isolates them from each other. In our case we use 5 nodes, so there will be a cluster of three nodes, and a smaller one with just two. Within a cluster all nodes are fully connected with no network issues, and similarly connections between nodes and clients proceed uninterrupted. Between the two clusters however, existing connections are dropped immediately, and no new connections can be made for the duration of the nemesis. The picture below shows an illustration of two example clusters. [partition] This has important effects for the raft consensus algorithm: * If the current leader is in the smaller cluster, it will not be able to reach quorum, and new transactions will be stuck until a new leader is elected from the larger cluster (where a quorum is still possible). * If the larger cluster processes any transactions, data in the smaller cluster becomes stale. To remain consistent, the smaller cluster must therefore not serve any read requests (and obviously no write or compare-and-swap either). Nodes in the small cluster are effectively just spinning their wheels trying to elect a leader with quorum. * A faulty raft implementation may lead to a situation where two leaders are simultaneously active, either because a previous leader in the small cluster does not realise it has lost quorum, or a new leader is elected in the small cluster even though no quorum can be obtained. This is known as a split brain, and can quickly cause two clusters to diverge, with undefined results when the network is finally healed. * When the network is healed and nodes in the small cluster can communicate with the leader again, they must ensure to replicate any transactions processed while the network was partitioned. The Jepsen library provides us with an implementation of this nemesis, which we can use directly. Daemon pause Instead of interfering with the network, this nemesis only affects the rqlite daemon process. When activated, it will select one of the nodes and send a SIGSTOP signal to temporarily suspend it, rendering it unable to service any requests. Then after 5s it will resume execution by sending SIGCONT. That particular daemon may need to correctly handle uncommon situations: * If the node was a leader before suspension, it may have since lost that status, but it will resume believing it is still a leader. If it can successfully convince some nodes that it is a leader, it may even lead to a split brain situation like we saw before. * It may have missed some transactions while it was suspended that it is unaware of, which it will need to retrieve before it can process new requests. This nemesis is also provided by Jepsen. Network delay In our test setup all nodes are virtual machines on a single physical device, so their network latency will be extremely low even compared to a production environment. This nemesis adds an artificial delay of 1 second on average. Such a delay can help uncover replication bugs. For example, a write request sent to one node might normally be replicated to other nodes faster than it takes for a read request to arrive at a second node, masking a potential bug. With a delay introduced the second node would now be forced to either return stale data, indicating a bug, or wait until the changes have replicated. If the delay is high enough, it might also trigger a heartbeat timeout on one or more nodes, triggering a leader election. Jepsen provides some building blocks to create a delay nemesis, but those were found to crash on our particular nodes running Debian bullseye, so we implemented our own based on Linux Traffic Control. Packet loss Not only do production systems have higher latency than a virtual test setup, their networks are also less reliable, and sometimes drop packets entirely. This can lead to missed updates on some nodes, and trigger edge cases in the replication logic where messages need to be retransmitted. Our nemesis simulates an extreme case of this, where as much as 1 in 5 packets is dropped and not delivered. Much like the delay nemesis, the Jepsen library provided part of an implementation, but it did not work for our test setup, so we implemented a custom nemesis, again using Traffic Control APIs. Results We are happy to report that with strong read consistency enabled, Rqlite passed all our tests with flying colours. This is not a proof that rqlite provides linearizability, as it may violate those requirements in situations we have not tested here. Even though this is not explicitly documented, rqlite also appears to respect client order in the processing of transactions, as this is one of the assumptions required for the sequential test to pass. One thing we have observed is that with nemeses enabled, rqlite struggles to maintain acceptable tail latency. The graph below shows the throughput of the cluster during the register test, with a partitioning nemesis enabled. [register_r] Based on this graph, it appears rqlite achieves up to 40 transactions per second, which considerably less than the 50Hz at which our tests send transactions. The reason for this appears to be that a few transactions take extremely long, as shown in the graph below. [register_l] It appears that while rqlite correctly handles faults, it introduces high tail latency in doing so. Low tail latency is crucial for large-scale systems, so we do not recommend rqlite to be used there. Luckily rqlite is designed for smaller deployments, for which this may not be as important. Conclusion In this project we have created Jepsen tests for the rqlite distributed database to check linearizability and sequential consistency of the system. We conclude that for common situations as well as exceptional ones triggered by faults, rqlite maintains linearizability of record insertion and modification, provided strong read consistency is enabled, as evident from the tests and nemesis we have developed. This by no means constitutes a formal proof that rqlite indeed upholds linearizability in the general case, but it suggests that rqlite is designed to provide linearizability, and should provide users of the system with more confidence in the implementation. There is opportunity for more Jepsen tests to be added to our setup. Common examples from other Jepsen test suites include a 'bank' test for transfers between rows of a shared table and 'sets' for concurrent unique appends to a table. Similarly more nemesis could be introduced, such as a nemesis to randomly kill and restart nodes. We intend to discuss with rqlite's maintainers if the jepsen tests could be added to their testing setup and maintained there. Our implementation is available on GitHub. [ ] Go * (c) 2022 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.