http://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html

Skip to main content
 

Search This Blog

[                    ]
[Search]
Metadata

On distributed systems broadly defined and other curiosities. The
opinions on this site are my own.

Strict-serializability, but at what cost, for what purpose?

  * Get link
  * Facebook
  * Twitter
  * Pinterest
  * Email
  * Other Apps

- August 03, 2022

Strict-serializability guarantees that transactions appear to occur
in an order consistent with the "real-time" ordering of those
transactions: If transaction T1 commits before transaction T2 is
invoked, then the commit timestamp of T1 precedes the commit
timestamp of T2.

This is, in fact, the real-time constraint from linearizability, but
applied across transactions not just per-key. A
strict-serializability system satisfies both serializability
(transactions appear to occur as if they are executed one at a time
in isolation) and linearizability per key (after all single-key reads
/writes are transactions over one item). Below figure is from https:/
/jepsen.io/consistency.

[AVvXsEgDxQ2mnmH2lMCeD8Mm1cJ8HFM_PI1XtzNUQXppIyI9wpS4ol]


However, this is a one-way implication, the other direction does not
hold. You can satisfy both serializability per transactions and
linearizability per key, but fail to satisfy strict-serializability.
(Below I give an example accompanied with a TLA+ specification to
check it.) This is because, in strict-serializability,
linearizability should apply across transactions, not just per key.

Linearizability alone can be cheap, because it is per key. It is
often achieved by assigning a key to a single shard and serializing
access to that shard via Paxos/Raft. In contrast,
strict-serializability is expensive, because it needs to hold across
any arbitrary key. Now you need to coordinate across all shards in
the system to ensure that the assignment of timestamps does not
violate the real-time ordering of those transactions. We will
consider the cost of doing this under the "at what cost discussion"
section.

This raises the question, "What application requires
strict-consistency"? We should give customers strict-consistency, if
they need it. But does the customer really need it, or do they want
to cover all bases and be safe just in case? Reasoning about
isolation and consistency is especially hard for distributed
databases, so customers may just be opting out to go with the highest
level offered. We will discuss this under the "for what purpose"
section.


An execution that is serializable but not strict-serializable

Consider two keys, x and y. Initially, x=0 and y=0. Consider three
transactions.
T1: r(y)r(x)  
T2: w(x,1)
T3: w(y,1)

Suppose, this is their timestamped ordering:
T1 gets start-ts=1
T2 gets start-ts=1 and commit-ts=2
T3 gets start-ts=3 and commit-ts=4
T1 gets commit-ts=5

Assume T1 returns y=1 and x=0 when it commits. This is not acceptable
under strict-serializable execution. If T1 saw the y=1 update from
T3, it should have seen the x=1 update from T2, because T2 has a
commit timestamp smaller than T3 and this real-time ordering should
be respected.

On the other hand, it is possible to find a serializable execution
for T1, T2, and T3, where T1 reads y=1 and x=0. Simple, pretend as if
the order is:
1. T3 starts and commits
2. T1 starts and commits
3. T2 starts and commits

All serializability provides is the guarantee that transactions
appear to occur as if they are executed one at a time in isolation.
Serializability is not concerned with the real-time ordering
timestamps of the transactions, so it is allowed to order T3 before
T2, ignoring the timestamps assigned to their commits.

Who knows, maybe this is the real ordering, and timestamps had errors
in them. y is assigned to a shard whose host has a clock that is
ahead of "real-time" and so T3 is assigned (start-ts=3 and commit-ts=
4) even though it happens first. T1 with clocks close to real-time
executes on another host and gets (start-ts=1 and commit-ts=5) and
reads T3's update. Finally T2 starts on another host, which has a
clock that is behind "real time" and it gets assigned (start-ts=1 and
commit-ts=2). This is plausible, because it is hard to tightly
synchronize clocks across all hosts.

However, if the client had faith in the timestamps and expected
results to follow the timestamp ordering, this behavior will look
anomalous to the client. This anomaly is called the causal-reverse
anomaly and this post has a detailed discussion of how that applies
to CockroachDB in practice.


It is all about managing expectations

Maybe the right way to think about this is as follows. What is an
anomaly? Anomaly means that your mental model does not match the
isolation model provided by the database.

Even strict-serializability have anamolies. Yes! Some users think if
they start T1 first, then start T2, the commit order should be T1
followed by T2, and are surprised when strict-serializability does
not provide it. In this case, T1 and T2 are concurrent (T1's commit
does not come before T2 starts), so  strict-serializability does not
apply here. This case does not even fall under the jurisdiction of
"external consistency", which guarantees T1 comes before T2, if T1
commits before T2 starts to commit. Yet, the user is surprised, so
this is an anomaly to them. The user will then have to solve the
inadequacy of strict-serializability here by realizing that if T2 is
semantically related to T1, they should not start T2 before T1
commits. They can insert a synchronization read in T2 to see T1 take
effect.

Serializability has the reverse-causal anomaly as we discussed in
detail in the T1, T2, T3 example above. There would be no anomaly to
user here if the user did not put blind trust in the
commit-timestamps. But if this is important to the user, the user
will close the gap here, by include a read T1 within T2, if
T2 is semantically related to T1. Another way to have this is use a
type system, like parent-child relationship with semantically related
keys. In other words, the user makes this semantic relation explicit,
and achieve the desired purpose while living in serializability
isolation. (Instead of the abstract/conceptual definition of
serializability, here I am thinking of the underlying sharded
distributed database model including linearizability per key, which
gives Strong Partition Serializability.)

Snapshot isolation has write-skew anomaly as we discussed in an
earlier post. The solution, is yes, again the same. If T2 is
semantically related to T1, include a read of updated key in T1
within T2, and make this semantic relation explicit.

The bottom line is, it is all about managing expectations. You need
to choose an isolation level you can live with and learn about the
"anomalies" it allows. Remember, even strict-serializability has
anomalies if you had different expectations than what it provides.
There is no shortcut around being educated about the isolation
levels.


Strict-serializability, but at what cost?

At this point, we have a pretty good grounding about what
strict-serializability provides and what it does not provide. Now,
let's consider the cost of strict-serializability.

Strict serializability is often implemented by imposing a bounded
clock skew between any pair of hosts. This assumption is needed to
guarantee that if transaction T1 commits before transaction T2 is
invoked, then the commit timestamp of T1 precedes the commit
timestamp of T2.

Spanner takes this approach with True-Time timestamps which uses
atomic clocks. Atomic clocks provide small clock skews, but
inevitably networking over switches introduce nondeterminism in
timing and increases the clock uncertainty bound. The Spanner paper
gave this as 7 milliseconds, even with some networking layer protocol
integrations. This has definitely improved in the 10 years since
publication of Spanner, but it is expensive to do full stack
integration to drive the clock uncertainty bound really low. The
protocol cost of using synchronized clocks is that Spanner has to
wait for the maximum clock uncertainty bound to pass before
committing a transaction. This has to be really pessimistic, worse
case clock uncertainty bound. And even then this is essentially a
probabilistic guarantee. Due to some faults, it may be possible to
violate the maximum clock uncertainty bound and leak a causal
reverse.

Another way of implementing this is by having a logically-centralized
ordering service. This could be implemented via event dependency
graph tracking or via quorum-based timestamping service. In any case,
in this approach, you get a timestamp from the ordering service when
starting a transaction and check-in back with the ordering service
when committing a transaction for the service to keep track of
dependencies. So this inevitably introduces coordination (hence
latency) across all hosts managing different shards, even when the
keys may not be semantically related. I think the clock bounding
approach is more preferable to this one. Network latency does not
improve well with hardware improvements, but tighter clock
synchronization is possible with hardware improvements as we have
been witnessing.

Calvin is similar in gist to the second approach of logically
centralized ordering, but it comes with deterministic execution
twist. Calvin pre-orders transactions in logs in batches, and then
deterministically executes this log in the same order in replicas.
Deterministic execution helps eliminate the need for distributed
commit protocols. The downsides are the loss of conversational
transaction execution as in SQL and high latency due to this batch
based processing. As the price of determinism, Calvin can only do
non-conversational transactions. Transactions which must perform
reads in order to determine their full read/write sets are not
natively supported in Calvin since Calvin's deterministic locking
protocol requires advance knowledge of all transactions' read/write
sets before transaction execution can begin.


Strict-serializability, but for what purpose?

What application requires strict-serializability? What is the
business reason to insist on linearizability across all keys? What is
the reason to require precise real-time ordering across any item
updates, because any update may be semantically related?

If strict-serializability is needed, it is likely applicable to a
subset of items. The entire system should not be paying the cost of
waiting maximum clock uncertainty bounds just to provide that for a
subset of items. The application developer would know those items,
and can add synchronization explicitly via a read as we discussed in
the "managing expectations" section. (Or, it would be nice if the
system can give strict-serializability on demand for certain
operation types or data types.)   

Is there a graph-based application that may require
strict-serializability potentially across all items? I won't argue
against a convenience argument. Maybe the customer likes to pay the
price of strict serializability at the OLTP service, so they don't
need to be bothered with hassles in analytics. But what is that
application in real-world? My intuition is that even graph analytics
don't go around taking consistent snapshots using timestamps and
evaluating predicates on consistent snapshots.

Maybe there is an argument for "build it and they will come" for
providing strict-serializability. But, since strict-serializable
distributed SQL systems have been around for some time, some
applications must have arrived by now, right? Have you seen someone
using strict-serializability over a distributed SQL system, whose
purpose would not be served by using serializability? I am asking for
real-world in production use cases, and not conceptual/hypothetical
scenarios. If you know one, DM me on Twitter or email me. We talked
about the technical cost of strict-serializability in terms of
latency and throughput. There is also a monetary cost associated with
it. If you can tell me about your use case, I can tell you whether
you really need it or avoid it without any inconvenience and save
money by going with lesser isolation level. Win-win, I say.


Appendix: TLA+ model

Here is the TLA+ model for the example I gave above. It uses the
client-centric isolation checking module I discussed last time. 

You can see that the execution I discussed above satisfies
serializability but fails to satisfy strict-serializability.

At the end of the model, check what happens if I make T2 and T3
slightly overlap! With the timestamps2 assignment,
strict-serializability of the execution becomes possible again,
because it is possible for T3 to update before T2, and for T1 to get
serialized for commit in that brief instance. Strict-consistency
being this fickle and fragile with respect to ordering requirements,
it is hard for applications to take hard-dependencies on real-time
requirements.


[AVvXsEhk98LblcaQF71SQ4cs5lu7_L4-N_yELjkyKxCxsIIJ2F1pTsyVaI0pXTfb7rcckqbgUCJtRjphsryCfpT5T]





 
cloud computing consistency databases distributed transactions
distSQL

  * Get link
  * Facebook
  * Twitter
  * Pinterest
  * Email
  * Other Apps

Comments

[blo]
Tobin Baker said...
Note that you can get strict serializability WRT begin timestamps by
using classic timestamp ordering. (Timestamp ordering actually turns
out to be useful for non-monotonic commit timestamps, e.g. derived
from read/write sets ala TicToc/Sundial, or from loosely synchronized
local clocks ala Silo/Cicada/Meerkat).
August 3, 2022 at 4:43 PM [icon_delet]
Post a Comment

Popular posts from this blog

Graviton2 and Graviton3

- December 04, 2021
Image
What do modern cloud workloads look like? And what does that have to
do with new chip designs? I found these gems in Peter DeSantis's
ReInvent20 and ReInvent21 talks. These talks are very informative and
educational. Me likey! The speakers at ReInvent are not just
introducing new products/services, but they are also explaining the
thought processes behind them. To come up with this summary, I edited
the YouTube video transcripts slightly (mostly shortening it). The
presentation narratives have been really well planned, so this makes
a good read I think. Graviton2 This part is from the ReInvent2020
talk from Peter DeSantis.   Graviton2 is the best performing general
purpose processor in our cloud by a wide margin. It also offers
significantly lower cost. And it's also the most power efficient
processor we've ever deployed. Our plan was to build a processor that
was optimized for AWS and modern cloud workloads. But, what do modern
cloud workloads look like? Let's start by
 
Read more >>

Foundational distributed systems papers

- February 27, 2021
I talked about the importance of reading foundational papers last
week. To followup, here is my compilation of foundational papers in
the distributed systems area. (I focused on the core distributed
systems area, and did not cover networking, security, distributed
ledgers, verification work etc. I even left out distributed
transactions, I hope to cover them at a later date.)  I classified
the papers by subject, and listed them in chronological order. I also
listed expository papers and blog posts at the end of each section.
Time and State in Distributed Systems Time, Clocks, and the Ordering
of Events in a Distributed System. Leslie Lamport, Commn. of the
ACM,  1978. Distributed Snapshots: Determining Global States of a
Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions
on Computer Systems, 1985. Virtual Time and Global States of
Distributed Systems.  Mattern, F. 1988. Expository papers and blog
posts There is No Now . Justin Sheehy, ACM Queue 2015 Why Logical
Clock
 
Read more >>

Learning a technical subject

- December 18, 2021
I love learning. I wanted to write about how I learn, so I can
analyze if there is a method to this madness. I will first talk about
what my learning process looks like in abstract terms, and then I'll
give an analogy to make things more concrete and visual.   Learning
is a messy process for me I know some very clear thinkers. They are
very organized and methodical. I am not like that. These tidy
thinkers seem to learn a new subject quickly (and effortlessly) by
studying the rules of the subject and then deriving everything about
that subject from that set of rules. They speak in precise statements
and have clear and hard-set opinions about the subject. They seem to
thrive most in theoretical subjects. In my observation those tidy
learners are in the minority. Maybe the tidy thinkers are able to
pull this feat off because they come from a neighboring domain/
subject and map the context there to this subject quickly. But, again
from my experience, it doesn't feel like that. It s
 
Read more >>

Learning about distributed systems: where to start?

- June 10, 2020
This is definitely not a "learn distributed systems in 21 days" post.
I recommend a principled, from the foundations-up, studying of
distributed systems, which will take a good three months in the first
pass, and many more months to build competence after that. If you are
practical and coding oriented you may not like my advice much. You
may object saying, "Shouldn't I learn distributed systems with coding
and hands on? Why can I not get started by deploying a Hadoop
cluster, or studying the Raft code." I think that is the wrong way to
go about learning distributed systems, because seeing similar code
and programming language constructs will make you think this is
familiar territory, and will give you a false sense of security. But,
nothing can be further from the truth. Distributed systems need
radically different software than centralized systems do.  --A.
Tannenbaum This quotation is literally the first sentence in my
distributed systems syllabus. Inst
 
Read more >>

CockroachDB: The Resilient Geo-Distributed SQL Database

- March 04, 2022
Image
This paper appeared in Sigmod 2020. Here is a link to the 10 minute,
but extremely useful, Sigmod presentation . There is also a 1 hour
extended presentation . CockroachDB is open source available via
GitHub. The core features of the database are under a Business Source
License (BSL), which converts to a fully open-source Apache 2.0
license after three years. Storage layer CockroachDB (or CRDB for
short) consists of a distributed SQL layer on top of a distributed
key-value store. The key value store is laid out in a single
monolithic ordered key space. The logical key space is physically
realized by dividing it into contiguous ranges of keys which we call
ranges. Ranges are about 64 megabytes in size. Ranges start empty,
grow, and split when they get too large, and merge with their
neighbors when they get too small. The ranges are sorted, and I will
talk about why in the SQL discussion later. As we will discuss soon,
CRDB uses multi-version concurrency control. Hybrid logical clocks
 
Read more >>

Amazon Aurora: Design Considerations + On Avoiding Distributed
Consensus for I/Os, Commits, and Membership Changes

- March 20, 2022
Image
Amazon Aurora is a high-throughput cloud-native relational database.
I will summarize its design as covered by the Sigmod 17 and Sigmod 18
papers from the Aurora team. Aurora uses MySQL or PostgreSQL for the
database instance at top, and decouples the storage to a multi-tenant
scale-out storage service.  In this decoupled architecture, each
database instance acts as a SQL endpoint and supports query
processing, access methods, transactions, locking, buffer caching,
and undo management. Some database functions, including redo logging,
materialization of data blocks, garbage collection, and backup/
restore, are offloaded to the storage nodes. A big innovation in
Aurora is to do the replication among the storage nodes by pushing
the redo log; this reduces networking traffic and enables
fault-tolerant storage that heals without database involvement. In
contrast to CockroachDB and FoundationDB , Aurora manages not to use
consensus at all. It uses a primary secondary failover at the comp
 
Read more >>

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

- May 04, 2022
Image
This paper introduces a simple yet powerful idea to provide efficient
multi-key transactions with ACID semantics on top of a sharded NoSQL
data store. The Warp protocol prevents serializability cycles forming
between concurrent transactions by forcing them to serialize via a
chain communication pattern rather than using a parallel 2PC fan-out/
fan-in communication. This avoids hotspots associated with fan-out/
fan-in communication and prevents wasted parallel work from
contacting multiple other servers when traversing them in serial
would surface an invalidation/abortion early on in the serialization.
I love the elegance of this idea. As far as I can see, this paper did
not get published in any conference. The authors published a followup
paper to this in NSDI 16, called "The Design and Implementation of
the Warp Transactional Filesystem." But that paper does not talk
about the internals of Warp protocol like this archive report, rather
talks about the Warp Transactional File
 
Read more >>

Anna: A Key-Value Store For Any Scale

- April 29, 2022
Image
This paper (ICDE'18) introduces Anna, a CALM / CRDT implementation of
a distributed key-value system both at the data structure level as
well as system architecture and transaction protocol levels. Anna is
a partitioned, multi-mastered key-value system that achieves high
performance and elasticity via wait-free execution and
coordination-free consistency. Anna employs coordination-free actors
that perform state update via merge of lattice-based composite data
structures. I love the strongly opinionated introduction of this
paper. This is what papers should be about: opinionated, challenging
conventions, making bets, and doing hypothesis testing in the small.
Conventional wisdom says that software designed for one scale point
needs to be rewritten when scaling up by 10x. Anna sets out to
disprove this by showing how a key-value storage (KVS) system can be
architected to scale across many orders of magnitude. (Spoiler Anna
can give you only upto causal consistency, but cannot pro
 
Read more >>

Your attitude determines your success

- March 13, 2021
This may sound like a cliche your dad used to tell, but after many
years of going through new areas, ventures, and careers, I find this
to be the most underrated career advice. This is the number one
advice I would like my kids to internalize as they grow up. This is
the most important idea I would like every one undertaking a new
venture to know.  If you think you are not good enough, it becomes a
self-fulfilling prophecy. If you think you are not enjoying
something, you start to hate it.  I gave examples of this several
times before. Let's suffice with this one : In graduate school, I had
read "Hackers: Heroes of the Computer Revolution" from Steven Levy
and enjoyed it a lot. (I still keep the dog eared paper copy with
affection.) So, I should have read Steven Levy's Crypto book a long
time ago. But for some reason, I didn't...even though I was aware of
the book. I guess that was due to a stupid quirk of mine; I had some
aversion to the security/cryptography res
 
Read more >>
Powered by Blogger
Theme images by Michael Elkan
Murat Demirbas
My photo

Murat
    I am a principal applied scientist at AWS. On leave as a computer
    science and engineering professor at SUNY Buffalo. I work on
    distributed systems, distributed consensus, and cloud computing.
    You can follow me on Twitter.

Visit profile

Pageviews

Recent Posts

  * August1
  * July6
  * June3
  * May3
  * April4
  * March3
  * February3
  * January3
  * December5
  * November3

  * October6
  * September1
  * August4
  * July2
  * June12
  * May1
  * April1
  * March4
  * February4
  * January4
  * December3
  * November7
  * October4
  * September1
  * August3
  * July6
  * June11
  * May9
  * April8
  * March8
  * February7
  * January9
  * December10
  * November14
  * October6
  * September13
  * July3
  * June3
  * May4
  * April6
  * March2
  * February1
  * January3
  * December4
  * November7
  * October2
  * September2
  * August8
  * July2
  * June4
  * May9
  * April6
  * March9
  * February5
  * January13
  * December15
  * November15
  * October5
  * September8
  * August10
  * July3
  * June3
  * May3
  * April4
  * February4
  * January7
  * December7
  * November9
  * October3
  * September1
  * July4
  * June5
  * May1
  * April4
  * March2
  * February2
  * January4
  * December3
  * November2
  * October3
  * September2
  * August3
  * June1
  * May1
  * April6
  * March6
  * February4
  * January3
  * November4
  * October4
  * September6
  * August2
  * July2
  * June3
  * March3
  * February4
  * January1
  * December1
  * November2
  * August2
  * July4
  * June2
  * May5
  * April8
  * January1
  * December1
  * November7
  * October1
  * September2
  * August1
  * May2
  * March1
  * February1
  * January2
  * December3
  * September5
  * June1
  * May5
  * April5
  * March5
  * February9
  * January5
  * December6
  * November9
  * October9
  * September7
  * August1

Show more Show less

Topics

auditability5 automated reasoning5 aws1 Azure9 benchmarks1 bestof5 
big-data21 Blockchain39 book-review49 chaos2
cloud computing6 consistency24 Cosmos DB10 CosmosDB11 databases15 
dataflow7 distributed consensus42 distributed transactions16 distSQL3
facebook15 failures16 fault-tolerance37 formal methods11 
graph-processing1 humans10 indexing3 links2 mad-questions42 misc102 
mlbegin7 mldl25 mobile2 my advice15 my-paper10 newsql2 paper-review
144 paxos46 presenting4 programming5 reading-group23 research-advice
47 research-question43 Rust3 scheduling3 seminar9 serverless1 
smartphones2 sonification1 stabilization5 stream-processing10 
teaching30 tensorflow11 time8 time synchronization1 tla39 trip-report
22 wpaxos5 writing26
Show more Show less