http://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html Skip to main content Search This Blog [ ] [Search] Metadata On distributed systems broadly defined and other curiosities. The opinions on this site are my own. Strict-serializability, but at what cost, for what purpose? * Get link * Facebook * Twitter * Pinterest * Email * Other Apps - August 03, 2022 Strict-serializability guarantees that transactions appear to occur in an order consistent with the "real-time" ordering of those transactions: If transaction T1 commits before transaction T2 is invoked, then the commit timestamp of T1 precedes the commit timestamp of T2. This is, in fact, the real-time constraint from linearizability, but applied across transactions not just per-key. A strict-serializability system satisfies both serializability (transactions appear to occur as if they are executed one at a time in isolation) and linearizability per key (after all single-key reads /writes are transactions over one item). Below figure is from https:/ /jepsen.io/consistency. [AVvXsEgDxQ2mnmH2lMCeD8Mm1cJ8HFM_PI1XtzNUQXppIyI9wpS4ol] However, this is a one-way implication, the other direction does not hold. You can satisfy both serializability per transactions and linearizability per key, but fail to satisfy strict-serializability. (Below I give an example accompanied with a TLA+ specification to check it.) This is because, in strict-serializability, linearizability should apply across transactions, not just per key. Linearizability alone can be cheap, because it is per key. It is often achieved by assigning a key to a single shard and serializing access to that shard via Paxos/Raft. In contrast, strict-serializability is expensive, because it needs to hold across any arbitrary key. Now you need to coordinate across all shards in the system to ensure that the assignment of timestamps does not violate the real-time ordering of those transactions. We will consider the cost of doing this under the "at what cost discussion" section. This raises the question, "What application requires strict-consistency"? We should give customers strict-consistency, if they need it. But does the customer really need it, or do they want to cover all bases and be safe just in case? Reasoning about isolation and consistency is especially hard for distributed databases, so customers may just be opting out to go with the highest level offered. We will discuss this under the "for what purpose" section. An execution that is serializable but not strict-serializable Consider two keys, x and y. Initially, x=0 and y=0. Consider three transactions. T1: r(y)r(x) T2: w(x,1) T3: w(y,1) Suppose, this is their timestamped ordering: T1 gets start-ts=1 T2 gets start-ts=1 and commit-ts=2 T3 gets start-ts=3 and commit-ts=4 T1 gets commit-ts=5 Assume T1 returns y=1 and x=0 when it commits. This is not acceptable under strict-serializable execution. If T1 saw the y=1 update from T3, it should have seen the x=1 update from T2, because T2 has a commit timestamp smaller than T3 and this real-time ordering should be respected. On the other hand, it is possible to find a serializable execution for T1, T2, and T3, where T1 reads y=1 and x=0. Simple, pretend as if the order is: 1. T3 starts and commits 2. T1 starts and commits 3. T2 starts and commits All serializability provides is the guarantee that transactions appear to occur as if they are executed one at a time in isolation. Serializability is not concerned with the real-time ordering timestamps of the transactions, so it is allowed to order T3 before T2, ignoring the timestamps assigned to their commits. Who knows, maybe this is the real ordering, and timestamps had errors in them. y is assigned to a shard whose host has a clock that is ahead of "real-time" and so T3 is assigned (start-ts=3 and commit-ts= 4) even though it happens first. T1 with clocks close to real-time executes on another host and gets (start-ts=1 and commit-ts=5) and reads T3's update. Finally T2 starts on another host, which has a clock that is behind "real time" and it gets assigned (start-ts=1 and commit-ts=2). This is plausible, because it is hard to tightly synchronize clocks across all hosts. However, if the client had faith in the timestamps and expected results to follow the timestamp ordering, this behavior will look anomalous to the client. This anomaly is called the causal-reverse anomaly and this post has a detailed discussion of how that applies to CockroachDB in practice. It is all about managing expectations Maybe the right way to think about this is as follows. What is an anomaly? Anomaly means that your mental model does not match the isolation model provided by the database. Even strict-serializability have anamolies. Yes! Some users think if they start T1 first, then start T2, the commit order should be T1 followed by T2, and are surprised when strict-serializability does not provide it. In this case, T1 and T2 are concurrent (T1's commit does not come before T2 starts), so strict-serializability does not apply here. This case does not even fall under the jurisdiction of "external consistency", which guarantees T1 comes before T2, if T1 commits before T2 starts to commit. Yet, the user is surprised, so this is an anomaly to them. The user will then have to solve the inadequacy of strict-serializability here by realizing that if T2 is semantically related to T1, they should not start T2 before T1 commits. They can insert a synchronization read in T2 to see T1 take effect. Serializability has the reverse-causal anomaly as we discussed in detail in the T1, T2, T3 example above. There would be no anomaly to user here if the user did not put blind trust in the commit-timestamps. But if this is important to the user, the user will close the gap here, by include a read T1 within T2, if T2 is semantically related to T1. Another way to have this is use a type system, like parent-child relationship with semantically related keys. In other words, the user makes this semantic relation explicit, and achieve the desired purpose while living in serializability isolation. (Instead of the abstract/conceptual definition of serializability, here I am thinking of the underlying sharded distributed database model including linearizability per key, which gives Strong Partition Serializability.) Snapshot isolation has write-skew anomaly as we discussed in an earlier post. The solution, is yes, again the same. If T2 is semantically related to T1, include a read of updated key in T1 within T2, and make this semantic relation explicit. The bottom line is, it is all about managing expectations. You need to choose an isolation level you can live with and learn about the "anomalies" it allows. Remember, even strict-serializability has anomalies if you had different expectations than what it provides. There is no shortcut around being educated about the isolation levels. Strict-serializability, but at what cost? At this point, we have a pretty good grounding about what strict-serializability provides and what it does not provide. Now, let's consider the cost of strict-serializability. Strict serializability is often implemented by imposing a bounded clock skew between any pair of hosts. This assumption is needed to guarantee that if transaction T1 commits before transaction T2 is invoked, then the commit timestamp of T1 precedes the commit timestamp of T2. Spanner takes this approach with True-Time timestamps which uses atomic clocks. Atomic clocks provide small clock skews, but inevitably networking over switches introduce nondeterminism in timing and increases the clock uncertainty bound. The Spanner paper gave this as 7 milliseconds, even with some networking layer protocol integrations. This has definitely improved in the 10 years since publication of Spanner, but it is expensive to do full stack integration to drive the clock uncertainty bound really low. The protocol cost of using synchronized clocks is that Spanner has to wait for the maximum clock uncertainty bound to pass before committing a transaction. This has to be really pessimistic, worse case clock uncertainty bound. And even then this is essentially a probabilistic guarantee. Due to some faults, it may be possible to violate the maximum clock uncertainty bound and leak a causal reverse. Another way of implementing this is by having a logically-centralized ordering service. This could be implemented via event dependency graph tracking or via quorum-based timestamping service. In any case, in this approach, you get a timestamp from the ordering service when starting a transaction and check-in back with the ordering service when committing a transaction for the service to keep track of dependencies. So this inevitably introduces coordination (hence latency) across all hosts managing different shards, even when the keys may not be semantically related. I think the clock bounding approach is more preferable to this one. Network latency does not improve well with hardware improvements, but tighter clock synchronization is possible with hardware improvements as we have been witnessing. Calvin is similar in gist to the second approach of logically centralized ordering, but it comes with deterministic execution twist. Calvin pre-orders transactions in logs in batches, and then deterministically executes this log in the same order in replicas. Deterministic execution helps eliminate the need for distributed commit protocols. The downsides are the loss of conversational transaction execution as in SQL and high latency due to this batch based processing. As the price of determinism, Calvin can only do non-conversational transactions. Transactions which must perform reads in order to determine their full read/write sets are not natively supported in Calvin since Calvin's deterministic locking protocol requires advance knowledge of all transactions' read/write sets before transaction execution can begin. Strict-serializability, but for what purpose? What application requires strict-serializability? What is the business reason to insist on linearizability across all keys? What is the reason to require precise real-time ordering across any item updates, because any update may be semantically related? If strict-serializability is needed, it is likely applicable to a subset of items. The entire system should not be paying the cost of waiting maximum clock uncertainty bounds just to provide that for a subset of items. The application developer would know those items, and can add synchronization explicitly via a read as we discussed in the "managing expectations" section. (Or, it would be nice if the system can give strict-serializability on demand for certain operation types or data types.) Is there a graph-based application that may require strict-serializability potentially across all items? I won't argue against a convenience argument. Maybe the customer likes to pay the price of strict serializability at the OLTP service, so they don't need to be bothered with hassles in analytics. But what is that application in real-world? My intuition is that even graph analytics don't go around taking consistent snapshots using timestamps and evaluating predicates on consistent snapshots. Maybe there is an argument for "build it and they will come" for providing strict-serializability. But, since strict-serializable distributed SQL systems have been around for some time, some applications must have arrived by now, right? Have you seen someone using strict-serializability over a distributed SQL system, whose purpose would not be served by using serializability? I am asking for real-world in production use cases, and not conceptual/hypothetical scenarios. If you know one, DM me on Twitter or email me. We talked about the technical cost of strict-serializability in terms of latency and throughput. There is also a monetary cost associated with it. If you can tell me about your use case, I can tell you whether you really need it or avoid it without any inconvenience and save money by going with lesser isolation level. Win-win, I say. Appendix: TLA+ model Here is the TLA+ model for the example I gave above. It uses the client-centric isolation checking module I discussed last time. You can see that the execution I discussed above satisfies serializability but fails to satisfy strict-serializability. At the end of the model, check what happens if I make T2 and T3 slightly overlap! With the timestamps2 assignment, strict-serializability of the execution becomes possible again, because it is possible for T3 to update before T2, and for T1 to get serialized for commit in that brief instance. Strict-consistency being this fickle and fragile with respect to ordering requirements, it is hard for applications to take hard-dependencies on real-time requirements. [AVvXsEhk98LblcaQF71SQ4cs5lu7_L4-N_yELjkyKxCxsIIJ2F1pTsyVaI0pXTfb7rcckqbgUCJtRjphsryCfpT5T] cloud computing consistency databases distributed transactions distSQL * Get link * Facebook * Twitter * Pinterest * Email * Other Apps Comments [blo] Tobin Baker said... Note that you can get strict serializability WRT begin timestamps by using classic timestamp ordering. (Timestamp ordering actually turns out to be useful for non-monotonic commit timestamps, e.g. derived from read/write sets ala TicToc/Sundial, or from loosely synchronized local clocks ala Silo/Cicada/Meerkat). August 3, 2022 at 4:43 PM [icon_delet] Post a Comment Popular posts from this blog Graviton2 and Graviton3 - December 04, 2021 Image What do modern cloud workloads look like? And what does that have to do with new chip designs? I found these gems in Peter DeSantis's ReInvent20 and ReInvent21 talks. These talks are very informative and educational. Me likey! The speakers at ReInvent are not just introducing new products/services, but they are also explaining the thought processes behind them. To come up with this summary, I edited the YouTube video transcripts slightly (mostly shortening it). The presentation narratives have been really well planned, so this makes a good read I think. Graviton2 This part is from the ReInvent2020 talk from Peter DeSantis. Graviton2 is the best performing general purpose processor in our cloud by a wide margin. It also offers significantly lower cost. And it's also the most power efficient processor we've ever deployed. Our plan was to build a processor that was optimized for AWS and modern cloud workloads. But, what do modern cloud workloads look like? Let's start by Read more >> Foundational distributed systems papers - February 27, 2021 I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.) I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section. Time and State in Distributed Systems Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978. Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985. Virtual Time and Global States of Distributed Systems. Mattern, F. 1988. Expository papers and blog posts There is No Now . Justin Sheehy, ACM Queue 2015 Why Logical Clock Read more >> Learning a technical subject - December 18, 2021 I love learning. I wanted to write about how I learn, so I can analyze if there is a method to this madness. I will first talk about what my learning process looks like in abstract terms, and then I'll give an analogy to make things more concrete and visual. Learning is a messy process for me I know some very clear thinkers. They are very organized and methodical. I am not like that. These tidy thinkers seem to learn a new subject quickly (and effortlessly) by studying the rules of the subject and then deriving everything about that subject from that set of rules. They speak in precise statements and have clear and hard-set opinions about the subject. They seem to thrive most in theoretical subjects. In my observation those tidy learners are in the minority. Maybe the tidy thinkers are able to pull this feat off because they come from a neighboring domain/ subject and map the context there to this subject quickly. But, again from my experience, it doesn't feel like that. It s Read more >> Learning about distributed systems: where to start? - June 10, 2020 This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that. If you are practical and coding oriented you may not like my advice much. You may object saying, "Shouldn't I learn distributed systems with coding and hands on? Why can I not get started by deploying a Hadoop cluster, or studying the Raft code." I think that is the wrong way to go about learning distributed systems, because seeing similar code and programming language constructs will make you think this is familiar territory, and will give you a false sense of security. But, nothing can be further from the truth. Distributed systems need radically different software than centralized systems do. --A. Tannenbaum This quotation is literally the first sentence in my distributed systems syllabus. Inst Read more >> CockroachDB: The Resilient Geo-Distributed SQL Database - March 04, 2022 Image This paper appeared in Sigmod 2020. Here is a link to the 10 minute, but extremely useful, Sigmod presentation . There is also a 1 hour extended presentation . CockroachDB is open source available via GitHub. The core features of the database are under a Business Source License (BSL), which converts to a fully open-source Apache 2.0 license after three years. Storage layer CockroachDB (or CRDB for short) consists of a distributed SQL layer on top of a distributed key-value store. The key value store is laid out in a single monolithic ordered key space. The logical key space is physically realized by dividing it into contiguous ranges of keys which we call ranges. Ranges are about 64 megabytes in size. Ranges start empty, grow, and split when they get too large, and merge with their neighbors when they get too small. The ranges are sorted, and I will talk about why in the SQL discussion later. As we will discuss soon, CRDB uses multi-version concurrency control. Hybrid logical clocks Read more >> Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes - March 20, 2022 Image Amazon Aurora is a high-throughput cloud-native relational database. I will summarize its design as covered by the Sigmod 17 and Sigmod 18 papers from the Aurora team. Aurora uses MySQL or PostgreSQL for the database instance at top, and decouples the storage to a multi-tenant scale-out storage service. In this decoupled architecture, each database instance acts as a SQL endpoint and supports query processing, access methods, transactions, locking, buffer caching, and undo management. Some database functions, including redo logging, materialization of data blocks, garbage collection, and backup/ restore, are offloaded to the storage nodes. A big innovation in Aurora is to do the replication among the storage nodes by pushing the redo log; this reduces networking traffic and enables fault-tolerant storage that heals without database involvement. In contrast to CockroachDB and FoundationDB , Aurora manages not to use consensus at all. It uses a primary secondary failover at the comp Read more >> Warp: Lightweight Multi-Key Transactions for Key-Value Stores - May 04, 2022 Image This paper introduces a simple yet powerful idea to provide efficient multi-key transactions with ACID semantics on top of a sharded NoSQL data store. The Warp protocol prevents serializability cycles forming between concurrent transactions by forcing them to serialize via a chain communication pattern rather than using a parallel 2PC fan-out/ fan-in communication. This avoids hotspots associated with fan-out/ fan-in communication and prevents wasted parallel work from contacting multiple other servers when traversing them in serial would surface an invalidation/abortion early on in the serialization. I love the elegance of this idea. As far as I can see, this paper did not get published in any conference. The authors published a followup paper to this in NSDI 16, called "The Design and Implementation of the Warp Transactional Filesystem." But that paper does not talk about the internals of Warp protocol like this archive report, rather talks about the Warp Transactional File Read more >> Anna: A Key-Value Store For Any Scale - April 29, 2022 Image This paper (ICDE'18) introduces Anna, a CALM / CRDT implementation of a distributed key-value system both at the data structure level as well as system architecture and transaction protocol levels. Anna is a partitioned, multi-mastered key-value system that achieves high performance and elasticity via wait-free execution and coordination-free consistency. Anna employs coordination-free actors that perform state update via merge of lattice-based composite data structures. I love the strongly opinionated introduction of this paper. This is what papers should be about: opinionated, challenging conventions, making bets, and doing hypothesis testing in the small. Conventional wisdom says that software designed for one scale point needs to be rewritten when scaling up by 10x. Anna sets out to disprove this by showing how a key-value storage (KVS) system can be architected to scale across many orders of magnitude. (Spoiler Anna can give you only upto causal consistency, but cannot pro Read more >> Your attitude determines your success - March 13, 2021 This may sound like a cliche your dad used to tell, but after many years of going through new areas, ventures, and careers, I find this to be the most underrated career advice. This is the number one advice I would like my kids to internalize as they grow up. This is the most important idea I would like every one undertaking a new venture to know. If you think you are not good enough, it becomes a self-fulfilling prophecy. If you think you are not enjoying something, you start to hate it. I gave examples of this several times before. Let's suffice with this one : In graduate school, I had read "Hackers: Heroes of the Computer Revolution" from Steven Levy and enjoyed it a lot. (I still keep the dog eared paper copy with affection.) So, I should have read Steven Levy's Crypto book a long time ago. But for some reason, I didn't...even though I was aware of the book. I guess that was due to a stupid quirk of mine; I had some aversion to the security/cryptography res Read more >> Powered by Blogger Theme images by Michael Elkan Murat Demirbas My photo Murat I am a principal applied scientist at AWS. On leave as a computer science and engineering professor at SUNY Buffalo. I work on distributed systems, distributed consensus, and cloud computing. You can follow me on Twitter. Visit profile Pageviews Recent Posts * August1 * July6 * June3 * May3 * April4 * March3 * February3 * January3 * December5 * November3 * October6 * September1 * August4 * July2 * June12 * May1 * April1 * March4 * February4 * January4 * December3 * November7 * October4 * September1 * August3 * July6 * June11 * May9 * April8 * March8 * February7 * January9 * December10 * November14 * October6 * September13 * July3 * June3 * May4 * April6 * March2 * February1 * January3 * December4 * November7 * October2 * September2 * August8 * July2 * June4 * May9 * April6 * March9 * February5 * January13 * December15 * November15 * October5 * September8 * August10 * July3 * June3 * May3 * April4 * February4 * January7 * December7 * November9 * October3 * September1 * July4 * June5 * May1 * April4 * March2 * February2 * January4 * December3 * November2 * October3 * September2 * August3 * June1 * May1 * April6 * March6 * February4 * January3 * November4 * October4 * September6 * August2 * July2 * June3 * March3 * February4 * January1 * December1 * November2 * August2 * July4 * June2 * May5 * April8 * January1 * December1 * November7 * October1 * September2 * August1 * May2 * March1 * February1 * January2 * December3 * September5 * June1 * May5 * April5 * March5 * February9 * January5 * December6 * November9 * October9 * September7 * August1 Show more Show less Topics auditability5 automated reasoning5 aws1 Azure9 benchmarks1 bestof5 big-data21 Blockchain39 book-review49 chaos2 cloud computing6 consistency24 Cosmos DB10 CosmosDB11 databases15 dataflow7 distributed consensus42 distributed transactions16 distSQL3 facebook15 failures16 fault-tolerance37 formal methods11 graph-processing1 humans10 indexing3 links2 mad-questions42 misc102 mlbegin7 mldl25 mobile2 my advice15 my-paper10 newsql2 paper-review 144 paxos46 presenting4 programming5 reading-group23 research-advice 47 research-question43 Rust3 scheduling3 seminar9 serverless1 smartphones2 sonification1 stabilization5 stream-processing10 teaching30 tensorflow11 time8 time synchronization1 tla39 trip-report 22 wpaxos5 writing26 Show more Show less