http://muratbuffalo.blogspot.com/2022/03/cockroachdb-resilient-geo-distributed.html Skip to main content Search This Blog [ ] [Search] Metadata On distributed systems broadly defined and other curiosities. The opinions on this site are my own. CockroachDB: The Resilient Geo-Distributed SQL Database * Get link * Facebook * Twitter * Pinterest * Email * Other Apps - March 04, 2022 This paper appeared in Sigmod 2020. Here is a link to the 10 minute, but extremely useful, Sigmod presentation. There is also a 1 hour extended presentation. CockroachDB is open source available via GitHub. The core features of the database are under a Business Source License (BSL), which converts to a fully open-source Apache 2.0 license after three years. Storage layer CockroachDB (or CRDB for short) consists of a distributed SQL layer on top of a distributed key-value store. The key value store is laid out in a single monolithic ordered key space. The logical key space is physically realized by dividing it into contiguous ranges of keys which we call ranges. Ranges are about 64 megabytes in size. Ranges start empty, grow, and split when they get too large, and merge with their neighbors when they get too small. The ranges are sorted, and I will talk about why in the SQL discussion later. As we will discuss soon, CRDB uses multi-version concurrency control. Hybrid logical clocks timestamping is used for versioning. The values are never updated in place. Tombstones are used for deletion. This way the multi-version store provides a snapshot to each transaction. CRDB uses RocksDB (now Pebble) for storing of these key-value ranges at the storage layer. Ranges are the unit of replication. CRDB uses the Raft consensus protocol for replication and each range is a Raft group. Again, each range is a Raft group: replication is not at the level of nodes but at the level of ranges. This provides fine-grained control over data placement. For example here you can see see a database of dogs, where the orange range is a Raft group and is replicated over 3 nodes, the group is currently assigned to. [AVvXsEjPXzlc-0S3Mw2IqC_xZ_XFbxtlo5DWQ5_VrqBgwd7ZyGg__bfgmwdtky6fvgOCCn0ez_pBvbI41tZ4OPyx48] Raft provides a useful building block which is atomic replication. Commands are proposed by the leaseholder, which you can assume is the same as the Raft leader. The command is accepted when a quorum of replicas acknowledge. Distributed transactions Transactions can span arbitrary ranges and they support a conversational protocol that supports SQL where the full set of operations may not be known up front. Transactions in CockroachDB use serializable isolation level. To guarantee atomicity even for transactions that span multiple ranges, CockroachDB takes advantage of the range level atomicity of Raft. Each transaction is associated with a transaction record, which is stored and arranged. Just like other data updates, the transaction record go through RAFT which is how we support atomicity for transactions. Let's look at this simple example from the presentation to understand how transactions work. Consider the older suboptimal version of the algorithm first. This is basically 2PC over Paxos groups. An important point is the use of a transaction record that is atomically flipped to maintain the status of transaction in an atomic manner. [AVvXsEhaeuxYYChslIAK_f3WZraMD975wb6Jdaw5B7zht3yGtqVJFckvgeHf-fkVPDxhWtvhrSVMjbi98JkO8PlKre] What you see here is a cluster with three ranges spread across four nodes. The leaseholders for each range are highlighted with a black outline. This insert statement is inserting two rows into our dog's table Sunny and Ozzie. To begin the client connects to a gateway node which connects to the leaseholder for the range containing Sunny since Sunny is the first key written as part of the transaction. The transaction record is created on the range containing Sunny to replicate the transaction record the lease holder proposes a Raft command which writes the record to itself and the follower replicas. Once the lease holder and at least one of the followers accept the creation of the transaction record the transaction is in progress. Next the same lease holder proposes a Raft command that writes Sunny to itself and the followers. It again waits for a quorum before moving on. [AVvXsEirGRuU1mvB5EcSq77pAADgkCea5zPMPcHR9uemxyDYxQQILR6HbUaOZzdVy4wXT3DJG7JG88ZMhNTHCWsZjI] Next we write Ozzie. The lease holder for Ozzie will propose a Raft command to write the record to itself in its followers one of the followers will acknowledge and since we have a quorum the write is complete. As a final step the Gateway node commits the transaction by updating this transaction record from pending to committed via another round of consensus. Optimized transactions The above protocol requires multiple round trips. CockroachDB evolved into one that can commit a distributed transaction with a latency of a single round-trip with additional asynchronous round trips to perform cleanup. The idea is to pipeline the writes first and then do the transaction record writing as "staged" status and flipping it when quorums acknowledge the writes. [AVvXsEgzip_v5eW7a3mjrJnrQ9ChXPnZIrMMsprp1tRlKupyVoDTyToSQPEt8aLSljEipuSh0nX4DvtQUV6HYEBPeF] Here is the explanation from the presentation. The first write combines two operations: begin transaction and write Sunny. With the serial protocol we had to wait for each write to complete before initiating the next write: write the transaction record wait for it to complete write Sunny wait for it to complete. With pipelining we can fire off the write to Sunny but we don't need to wait for it to complete. The transaction record is also not written until later. This is safe because there's a small grace window where the transaction can have a non-existent transaction record, and that means it's in progress rather than aborted. With pipelining, we can initiate a second write before waiting for the first to complete and the commit is also pipelined. But we need to mark the transaction record in this case as staged rather than committed. Staged basically means that the transaction is finished but it's not clear yet whether the commit was successful. Once all the writes complete successfully, the Gateway node knows that the commit was successful and therefore it can return to the client. Concurrency control To achieve serializable transaction, concurrency control is required to detect conflicts and reorder transactions as needed. Below I use the description from the paper, in order not to make any mistakes in my description. An atomic commit for a transaction is achieved by considering all of its writes provisional until commit time. CRDB calls these provisional values write intents. An intent is a regular MVCC KV pair, except that it is preceded by metadata indicating that what follows is an intent. This metadata points to a transaction record, which is a special key (unique per transaction) that stores the current disposition of the transaction: pending, staging, committed or aborted. The transaction record serves to atomically change the visibility of all the intents at once, and is durably stored in the same Range as the first write of the transaction. As mentioned before, CRDB is an MVCC system and each transaction performs its reads and writes at its commit timestamp. This results in a total ordering of all transactions in the system, representing a serializable execution. However, conflicts between transactions may require adjustments of the commit timestamp. We describe the situations in which they arise below, and note that whenever the commit timestamp does change, the transaction typically tries to prove that its prior reads remain valid at the new timestamp (Section 3.4), in which case it can simply continue forward at the updated timestamp. Write-read conflicts. A read running into an uncommitted intent with a lower timestamp will wait for the earlier transaction to finalize. Waiting is implemented using in-memory queue structures. A read running into an uncommitted intent with a higher timestamp ignores the intent and does not need to wait. Read-write conflicts. A write to a key at timestamp ta cannot be performed if there's already been a read on the same key at a higher timestamp tb >= ta. CRDB forces the writing transaction to advance its commit timestamp past tb. Write-write conflicts. A write running into an uncommitted intent with a lower timestamp will wait for the earlier transaction to finalize (similar to write-read conflicts). If it runs into a committed value at a higher timestamp, it advances its timestamp past it (similar to read-write conflicts). Write-write conflicts may also lead to deadlocks in cases where different transactions have written intents in different orders. CRDB employs a distributed deadlock-detection algorithm to abort one transaction from a cycle of waiters. Certain types of conflicts described above require advancing the commit timestamp of a transaction. To maintain serializability, the read timestamp must be advanced to match the commit timestamp. Advancing a transaction's read timestamp from ta to tb > ta is possible if we can prove that none of the data that the transaction read at ta has been updated in the interval (ta,tb]. If the data has changed, the transaction needs to be restarted. To determine whether the read timestamp can be advanced, CRDB maintains the set of keys in the transaction's read set (up to a memory budget). A "read refresh" request validates that the keys have not been updated in a given timestamp interval (Algorithm 1, Lines 11 to 14). This involves re-scanning the read set and checking whether any MVCC values fall in the given interval. [AVvXsEi3ACLbSlcUt9hEPLrn4hONQ5Kps_yvTQ1ho6_ODM3tae1MsquX] Advancing the transaction's read timestamp is also required when a scan encounters an uncertain value. In this case, when a transaction encounters a value on a key at a timestamp above its provisional commit timestamp but within its uncertainty interval, it performs an uncertainty restart, moving its provisional commit timestamp above the uncertain value but keeping the upper bound of its uncertainty interval fixed. This corresponds to treating all values in a transaction's uncertainty window as past writes. As a result, the operations on each key performed by transactions take place in an order consistent with the real time ordering of those transactions. [AVvXsEh_pRsMf0X3Cf8kj6h7PBYprPq469ZSlYwymyxIJ2fuM5XaqFfR] Spanner versus CockroachDB CRDB commit is similar to Spanner commit protocol (2012) in that Spanner is also 2PL + 2PC. However, in Spanner, the timestamp doesn't move up. Spanner sets it once the locks are taken, and then the transaction commit waits out the clock uncertainty of commit timestamp as needed. They can afford to it by using atomic clocks with very small uncertainty intervals. CRDB does not rely on specialized hardware for clock synchronization, so it can run on off-the-shelf servers in public and private clouds with software-level clock synchronization services such as NTP. Each node within a CRDB cluster maintains a hybridlogical clock (HLC), which provides timestamps that are a combination of physical and logical time. HLCs provide causality tracking through their logical component upon each inter-node exchange. HLCs provide strict monotonicity within and across restarts on a single node. HLCs provide self-stabilization in the presence of isolated transient clock skew fluctuations. However, when the physical clock synchronization uncertainty is high, CRDB still requires the above techniques about read-refreshes to ensure serializability of transactions. SQL layer Ok, let's start by revisiting the question of why CRDB maintains sorted ranges. This is useful for SQL, in order to do efficient range-scans for SQL queries using indexing. The index is also maintained a range. But then how do we find the index? It is treated specially and maintained as the first range in the system. When you split the range, the same transaction that is updating range should also update the index. To use SQL over CRDB, SQL's tabular data is mapped to Key-Value data, with key being in the format of ///. For example, /inventory/primary/1. The prefix is stored as index for storage saving. Let's consider another index, and examples of it mapping to keys. INDEX name_idx(name) /inventory/name_idx/"Bat"/1 /inventory/name_idx/"Bat"/4 SQL execution layer turns transaction into series of key value executions. It has the following phases: Scan -> Filter -> Project -> Results. If there is index, the filter gets pushed into the scan. SQL query optimizer has phases: Parse -> Prep -> Search -> Execute. It can do cost based search. Lessons learned The paper has an interesting lessons learned section. CockroachDB doesn't support any lower isolation levels than Serializability. The paper says the following: Since CRDB was primarily designed for SERIALIZABLE, we initially expected that offering just snapshot isolation by removing the check for write skews would be simple. However, this proved not to be the case. The only safe mechanism to enforce strong consistency under snapshot isolation is pessimistic locking, via the explicit locking modifiers FOR SHARE and FOR UPDATE on queries. To guarantee strong consistency across concurrent mixed isolation levels, CRDB would need to introduce pessimistic locking for any row updates, even for SERIALIZABLE transactions. To avoid this pessimization of the common path, we opted to eschew true support for SNAPSHOT, keeping it as an alias to SERIALIZABLE instead. The long presentation includes a passing comment: "Raft is not any easier to implement than another Paxos protocol." The paper has a similar comment: "We initially chose Raft as the consensus algorithm for CRDB due to its supposed ease of use and the precise description of its implementation. In practice, we have found there are several challenges in using Raft in a complex system like CRDB." The lessons learned section discusses many Raft improvements. They make Raft less chatty by coalescing the heartbeat messages into one per node to save on the per-RPC overhead, and pausing Raft groups which have seen no recent write activity. They also implemented a change to Raft reconfiguration protocol, which they call Joint Consensus, similar to ZooKeeper's reconfiguration protocol. The discussion about Postgres compatibility is also interesting: "We chose to adopt PostgreSQL's SQL dialect and network protocol in CRDB to capitalize on the ecosystem of client drivers. This choice initially boosted adoption and still results today in enhanced focus and decision-making in the engineering team. However, CRDB behaves differently from PostgreSQL in ways that require intervention in client-side code. For example, clients must perform transaction retries after an MVCC conflict and configure result paging. Reusing PostgreSQL drivers as-is requires us to teach developers how to deploy CRDB-specific code at a higher level, anew in every application. This is a recurring source of friction which we had not anticipated. As a result, we are now considering the gradual introduction of CRDB-specific client drivers." databases distributed transactions distSQL newsql * Get link * Facebook * Twitter * Pinterest * Email * Other Apps Comments [blo] Tobin Baker said... I don't know how the claim that snapshot isolation requires pessimistic concurrency control ever got through review. There are plenty of MVCC databases that use OCC (I know because I've written one, but there are many others). To claim the contrary is not just inaccurate; it shows a misunderstanding of elementary database principles. There are legitimate reasons to use PCC rather than OCC for an MVCC database (e.g., for reducing wasted work due to aborts in highly contended workloads), but it is not remotely a requirement. March 5, 2022 at 1:02 PM [icon_delet] Post a Comment Popular posts from this blog Graviton2 and Graviton3 - December 04, 2021 Image What do modern cloud workloads look like? And what does that have to do with new chip designs? I found these gems in Peter DeSantis's ReInvent20 and ReInvent21 talks. These talks are very informative and educational. Me likey! The speakers at ReInvent are not just introducing new products/services, but they are also explaining the thought processes behind them. To come up with this summary, I edited the YouTube video transcripts slightly (mostly shortening it). The presentation narratives have been really well planned, so this makes a good read I think. Graviton2 This part is from the ReInvent2020 talk from Peter DeSantis. Graviton2 is the best performing general purpose processor in our cloud by a wide margin. It also offers significantly lower cost. And it's also the most power efficient processor we've ever deployed. Our plan was to build a processor that was optimized for AWS and modern cloud workloads. But, what do modern cloud workloads look like? Let's start by Read more Foundational distributed systems papers - February 27, 2021 I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.) I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section. Time and State in Distributed Systems Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978. Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985. Virtual Time and Global States of Distributed Systems. Mattern, F. 1988. Expository papers and blog posts There is No Now . Justin Sheehy, ACM Queue 2015 Why Logical Clock Read more Your attitude determines your success - March 13, 2021 This may sound like a cliche your dad used to tell, but after many years of going through new areas, ventures, and careers, I find this to be the most underrated career advice. This is the number one advice I would like my kids to internalize as they grow up. This is the most important idea I would like every one undertaking a new venture to know. If you think you are not good enough, it becomes a self-fulfilling prophecy. If you think you are not enjoying something, you start to hate it. I gave examples of this several times before. Let's suffice with this one : In graduate school, I had read "Hackers: Heroes of the Computer Revolution" from Steven Levy and enjoyed it a lot. (I still keep the dog eared paper copy with affection.) So, I should have read Steven Levy's Crypto book a long time ago. But for some reason, I didn't...even though I was aware of the book. I guess that was due to a stupid quirk of mine; I had some aversion to the security/cryptography res Read more Progress beats perfect - August 06, 2021 Image This is a favorite saying of mine. I use it to motivate myself when I feel disheartened about how much I have to learn and improve. If I do a little every day or every week, I will get there. If I get one percent better each day for one year, I'll end up thirty-seven times better by the end of the year. $1.01^{365}=37.78$ Years ago I had read this idea in one of John Ousterhouts life lessons, and it stuck with me. "A little bit of slope makes up for a lot of y-intercept" Recently I noticed another advantage of progress over perfect. The emotional advantage. Progress is better because it makes you feel better as you see improvement. You are getting there, you are making ... progress. Progress is growth mindset . You have an opportunity ahead of you. Perfect feels bad.. It puts you on defense. You have to defend the perfect, you have to keep the appearances. You can only go downwards from perfect, or maintain status quo. Progress gives you momentum. As long as you manag Read more Learning a technical subject - December 18, 2021 I love learning. I wanted to write about how I learn, so I can analyze if there is a method to this madness. I will first talk about what my learning process looks like in abstract terms, and then I'll give an analogy to make things more concrete and visual. Learning is a messy process for me I know some very clear thinkers. They are very organized and methodical. I am not like that. These tidy thinkers seem to learn a new subject quickly (and effortlessly) by studying the rules of the subject and then deriving everything about that subject from that set of rules. They speak in precise statements and have clear and hard-set opinions about the subject. They seem to thrive most in theoretical subjects. In my observation those tidy learners are in the minority. Maybe the tidy thinkers are able to pull this feat off because they come from a neighboring domain/ subject and map the context there to this subject quickly. But, again from my experience, it doesn't feel like that. It s Read more Cores that don't count - June 06, 2021 This paper is from Google and appeared at HotOS 2021 . There is also a very nice 10 minute video presentation for it. So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time. What are the causes of CEEs? This is mostly due to ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse Read more Learning about distributed systems: where to start? - June 10, 2020 This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that. If you are practical and coding oriented you may not like my advice much. You may object saying, "Shouldn't I learn distributed systems with coding and hands on? Why can I not get started by deploying a Hadoop cluster, or studying the Raft code." I think that is the wrong way to go about learning distributed systems, because seeing similar code and programming language constructs will make you think this is familiar territory, and will give you a false sense of security. But, nothing can be further from the truth. Distributed systems need radically different software than centralized systems do. --A. Tannenbaum This quotation is literally the first sentence in my distributed systems syllabus. Inst Read more Silent data corruptions at scale - June 12, 2021 Image This paper from Facebook (Arxiv Feb 2021) is referred in the Google fail-silent Corruption Execution Errors (CEEs) paper as the most related work. Both papers discuss the same phenomenon, and say that we need to update our belief about quality-tested CPUs not having logic errors, and that if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. This paper provides an account of how Facebook have observed CEEs over several years. After running a wide range of silent error test scenarios across 100K machines, they found that 100s of CPUs are identified as having these errors, showing that CEEs are a systemic issue across generations. This paper, as the Google paper, does not name specific vendor or chipset types. Also the ~1/1000 ratio reported here matches the ~1/1000 mercurial core ratio that the Google paper reports. The paper claims that silent data corruptions can occur due to device characteristics and are repeatable at scale Read more Paper review. Sharding the Shards: Managing Datastore Locality at Scale with Akkio - February 14, 2019 Image This paper by Facebook, which appeared in OSDI'18, describes the data locality management service, Akkio. Akkio has been in production use at Facebook since 2014. It manages over 100PB of data, and processes over 10 million data accesses per second. Why do we need to manage locality? Replicating all data to all datacenters is difficult to justify economically (due to the extra storage and WAN networking costs) when acceptable durability and request serving latency could be achieved with 3 replicas. It looks like Facebook had been doing full replication (at least for ViewState and AccessState applications discussed in the evaluation) to all the 6 datacenters back-in-the-day, but as the operation and the number of datacenters grew, this became untenable. So, let's find suitable home-bases for data, instead of fully replicating it to all datacenters. But the problem is access locality is not static. What was a good location/ configuration for the data ceases to become suita Read more Decoupled Transactions: Low Tail Latency Online Transactions Atop Jittery Servers (CIDR 2022) - January 28, 2022 Image This is a CIDR22 paper by Pat Helland. It is a long paper at 18 pages, followed by 12 pages of appendix. Since he wrote such a long paper, I won't apologize for writing a long review. This is a Pat Helland paper, so it is full of Hellandisms. Pat's papers are always remarkable, distinct. There is a lot of wisdom in them. So, in case there could be any doubt about this, let me preface this review by saying that I learn a lot from Pat's papers, and I am grateful to Pat for teaching me and the community. Problem and scope Jitter refers to probabilistic response times, message latencies in networks. In big data processing systems it is easy to deal with jitter by retrying stragglers, in fact this has been suggested in the MapReduce paper. But the same approach does not apply to databases, where we don't have the same idempotency affordances. Databases should be transactionally correct! The paper presents a hypothetical design (meaning this is not implemented). The design Read more Powered by Blogger Theme images by Michael Elkan Murat Demirbas My photo Murat I am a principal applied scientist at AWS S3 Automated Reasoning Group. On leave as a computer science and engineering professor at SUNY Buffalo. I work on distributed systems, distributed consensus, and cloud computing. You can follow me on Twitter. Visit profile Recent Posts * 2022 7 + March 1 o CockroachDB: The Resilient Geo-Distributed SQL Dat... + February 3 + January 3 * 2021 47 + December 5 + November 3 + October 6 + September 1 + August 4 + July 2 + June 12 + May 1 + April 1 + March 4 + February 4 + January 4 * 2020 76 + December 3 + November 7 + October 4 + September 1 + August 3 + July 6 + June 11 + May 9 + April 8 + March 8 + February 7 + January 9 * 2019 65 + December 10 + November 14 + October 6 + September 13 + July 3 + June 3 + May 4 + April 6 + March 2 + February 1 + January 3 * 2018 71 + December 4 + November 7 + October 2 + September 2 + August 8 + July 2 + June 4 + May 9 + April 6 + March 9 + February 5 + January 13 * 2017 77 + December 15 + November 15 + October 5 + September 8 + August 10 + July 3 + June 3 + May 3 + April 4 + February 4 + January 7 * 2016 42 + December 7 + November 9 + October 3 + September 1 + July 4 + June 5 + May 1 + April 4 + March 2 + February 2 + January 4 * 2015 34 + December 3 + November 2 + October 3 + September 2 + August 3 + June 1 + May 1 + April 6 + March 6 + February 4 + January 3 * 2014 29 + November 4 + October 4 + September 6 + August 2 + July 2 + June 3 + March 3 + February 4 + January 1 * 2013 25 + December 1 + November 2 + August 2 + July 4 + June 2 + May 5 + April 8 + January 1 * 2012 18 + December 1 + November 7 + October 1 + September 2 + August 1 + May 2 + March 1 + February 1 + January 2 * 2011 38 + December 3 + September 5 + June 1 + May 5 + April 5 + March 5 + February 9 + January 5 * 2010 31 + December 6 + November 9 + October 9 + September 7 * 2007 1 + August 1 Show more Show less Topics auditability5 automated reasoning5 Azure9 bestof5 big-data20 Blockchain39 book-review49 chaos2 cloud computing2 consistency23 Cosmos DB10 CosmosDB11 databases3 dataflow7 distributed consensus42 distributed transactions6 distSQL1 facebook14 failures16 fault-tolerance35 formal methods10 graph-processing1 humans10 indexing3 links2 mad-questions42 misc101 mlbegin7 mldl25 mobile2 my advice15 my-paper10 newsql2 paper-review134 paxos44 presenting4 programming5 reading-group23 research-advice46 research-question43 Rust3 scheduling3 seminar9 serverless1 smartphones2 sonification1 stabilization5 stream-processing10 teaching30 tensorflow11 time8 time synchronization1 tla38 trip-report20 wpaxos5 writing26 Show more Show less Pageviews