[HN Gopher] Anna: A key-value store for any scale
___________________________________________________________________
Anna: A key-value store for any scale
Author : ingve
Score : 57 points
Date : 2022-04-29 17:03 UTC (5 hours ago)
(HTM) web link (muratbuffalo.blogspot.com)
(TXT) w3m dump (muratbuffalo.blogspot.com)
| sdfgdfgbsdfg wrote:
| I am very confused by Table I, claiming Redis and HStore have a
| multi-key consistency of "Serializable". What meaning is it
| trying to convey? Serializable is a level of isolation, not of
| consistency, so how are they on the same category?
| staticassertion wrote:
| Are there existing, production ready KV stores that take
| advantage of CALM properties? I'm currently having to layer CALM
| optimizations on top of other dbs because my _data model_ is
| built on associative and commutative operations but databases
| seem totally split between "transactional" and "eventually
| consistent" with seemingly no support for strong eventually
| consistent operations natively.
| noncoml wrote:
| > The codebase (including the lattice library, all the
| consistency levels, the server code, and client proxy code)
| amounts to about 2000 lines of C++
|
| Impressively low
| stingraycharles wrote:
| I was impressed reading that as well, but after looking at the
| code, I wonder if it's really 2000? It seems more.
|
| Regardless, the code is clearly well-organized, easy to read
| and to the point. I like it.
|
| https://github.com/hydro-project/anna
| junon wrote:
| Appears abandoned?
| staticassertion wrote:
| Most code written for a paper is effectively abandoned before
| or soon after the paper is released.
| HNHatesUsers wrote:
| rektide wrote:
| It's such a pity development rolled off. Very bold paper.
|
| Other pieces of hydro-project seem ongoingly active.
|
| https://github.com/hydro-project/anna
| rmbyrro wrote:
| > Following the Dynamo design, Anna applies a CRC32 hash on the
| id to assign the actor to a position on the hash ring and applies
| the same hash function to a key in order to determine the actors
| responsible for storing the key.
|
| I think this choice is what led to the "hot-partition" hell that
| early versions of DynamoDB suffered from, is that correct?
|
| Anyone knows how DynamoDB handles this nowadays? What I know is
| they eliminated the hot-partition issue, at least from the
| user/developer perspective.
| dastbe wrote:
| (this may be out of date)
|
| dynamodb differs from the dynamo architecture quite a bit.
| ddb's partitioning of the hash ring is contiguous and each
| partition is backed by a set of replicas. when a partition gets
| hot for whatever reason the partition is split and each half
| gets its own set of replicas. this historically led to two
| issues
|
| 1. splitting also split RCUs/WCUs proportionally, so if you
| split a bunch then scaled down your db would have a very low
| amount of RCU/WCUs available per partition leading to
| surprisingly throttling. if you did a load test right before
| your launch, you would be better off trashing the db to avoid
| this.
|
| 2. responsiveness to hot spotting. dynamodb wasn't as good
| about burst capacity management and so between when you hit a
| hot spot and dynamodb reacted, you'd be up a creek.
| hintymad wrote:
| I also have a similar question: how could a consistent hash
| ring "infinitely scale"? As the number of nodes increases, the
| chances that some nodes flip-flop their membership will
| increase, and the system will become less reliable. Such
| temporary instability is very annoying for operations (alert
| mitigation in particular), especially when the requests per
| second is high.
|
| I think there's a reason that large-scale KV stores use range-
| based partitions, such as FB's Tectonic and Google's Colossus
| (Colossus uses Spanner for metadata, and Spanner uses
| hierarchical range-based partitioning).
| luhn wrote:
| Fundamentally they can do it because partitions are relatively
| small, so you'll have hundreds of partitions from hundreds of
| different tables on a single machine and a handful of hot
| partitions are only a tiny blip. (There is of course a limit to
| how hot a partition can be, they'll cut you off before you
| start consuming a large amount of resources on the host.)
|
| Unfortunately single-tenant systems are more constrained in
| this, because you don't have millions of partitions to spread
| across thousands of machines.
| endisneigh wrote:
| With respect to key value stores I'm not sure why foundationDB
| isn't the clear winner - strongly consistent, scalable, and ACID.
| Sure it's not the fastest but it seems like for most use cases
| that's really all you need given it's still pretty fast.
| sdfgdfgbsdfg wrote:
| FoundationDB is operationally very complex; there's a multitude
| of processes, no reference architectures of at-scale
| deployments (althought those do exist; at least at Snowflake
| and Apple), no docker images with helm charts, and no cloud
| managed solution.
|
| It's also extremely low level, at the query layer, it's simply
| an ordered map of (byte[], byte[]). The sibling comment also
| mentions optimistic concurrency control as a built-in primitive
| that can scale poorly with high contention. These two issues
| and similar others are left for the query layer to resolve.
| Foundationdb is not really intended to be used directly as a
| general purpose key/value store but to provide a distributed
| transaction and storage layer for a query layer to be
| implemented on top of it.
|
| I think it's interesting to think about foundationdb as a
| higher-layer from rocksdb which has quickly become the de-facto
| standard for the single node storage-layer. For building highly
| scalable, globally distributed ACID systems, foundationdb could
| very well be the clear winner like you say, although systems in
| this category have opted to implement their own consensus and
| transaction layers. At least, all the spanner and calvin
| derivatives I know of (cloud spanner, yugabyte, cockroach,
| faunadb) implement their own. But for use cases where strict
| serializability is not required, there's ample room for
| innovation and I think this is where things like Anna can
| potentially fit in
| vvern wrote:
| Here's two reasons which come to my mind:
|
| * Operationally, there's no turn-key, hosted solution. Getting
| it set up locally for testing and CI is not trivial. You have
| to really want to use it. This criticism 100% applies to Anna
| too.
|
| * FoundationDB provides serializable transactions (yay!), but
| it doesn't provide the tools to make this work well in the face
| of contention. FoundationDB uses fully optimistic concurrency
| control. Consider you have a workload where, say, you want
| users to do something as simple as increasing a counter. In
| FoundationDB, you're going to have a bad time if there are lots
| of concurrent writes. You might instead need to reorganize the
| problem into having the increments each just write a row and
| then in the background, read the older rows and turn them into
| an increment in a single-threaded way. In Anna, you can define
| your counter data structure inside the system. That's cool.
| More generally though, Anna pushes the problem of consistency
| further into the database, which, while cool, is also not super
| convenient. It does scale. Other databases use pessimistic
| locking to deal with contention. That is going to be lower
| throughput because the work will need to serialize. It'll still
| be higher throughput and lower latency than a fully-optimistic
| approach.
| opendomain wrote:
| Dr9m the docs: Spoiler Anna can give you only upto causal
| consistency, but cannot provide strong consistency at key level,
| and nothing stronger than read-committed at the multi-key level.)
|
| this is not scalable
| jchook wrote:
| Does this mean Anna favors availability and partition tolerance
| over consistency?
| staticassertion wrote:
| This means it is highly available and eventually consistent,
| but with additional consistency guarantees.
|
| Anna is _strongly eventually consistent_. Meaning that,
| regardless of the order or number of times operations are
| replayed, the answer will always converge to the same value
| given time. For example, imagine you have a default value of
| "False", and a convergent value of "True". You may observe
| "False" after the value has been set to True, but you will
| never observe True and _then_ have to worry about it flipping
| back to False.
|
| Another example might be an integer that only grows. You can
| write to that integer:
|
| 1, 2, 3, 4
|
| In any order, in parallel, or 100x each. But eventually the
| value will be 4 - not 1, 2, or 3. It may be 1, 2, or 3 at any
| time _before_ it is 4, but the system will _eventually_
| converge to 4.
|
| This can be very useful. Let's say you have a query "X < 3".
| This query happens after a write of 4, but you _observe_ 3.
| You know that, given that the value only ever grows, X could
| be 3 or higher, but it _definitely_ isn 't _less_ than 3.
|
| So you can answer that query with the stale read. In an
| eventually consistent system without _strong_ eventual
| consistency, after 4 gets written another write may get
| replayed after and make it go back to 2.
|
| This has some obvious benefits. You can cache values forever,
| without invalidation logic. If I had read '3' from a cache I
| could answer the query directly. If the read had returned '2'
| I would have to go fetch the latest value to know if it had
| grown since then.
|
| You may be asking "but what if I need to know the value right
| then". The answer is that you can put a strongly consistent
| store behind your strongly eventually consistent store. This
| is what is proposed in the CURP paper that I can't find this
| very moment.
|
| nvm got it
| https://www.usenix.org/conference/nsdi19/presentation/park
| carterschonwald wrote:
| When do you need anything stronger than causal consistency? Eg
| when I was a jpmorgan, actual customer financial transactions
| in practice just needed causal consistency. Anything stronger
| than that really was about consistent snapshot reads for
| reporting purposes.
| ctvo wrote:
| > this is not scalable
|
| what
| rmbyrro wrote:
| I disagree. Most user-facing apps I've seen in my career don't
| really need strong consistency.
|
| The question is the performance of _eventuality_ for
| propagating changes. If it 's sub-second level, I'd say 99% of
| apps (of those which don't need strong consistency) can live
| with that. Single-digit seconds could still serve many
| purposes.
| treebot wrote:
| Plenty of user facing apps do need strong consistency. We use
| Cassandra at Ripple, for API access to the XRP ledger, and
| strong consistency is a requirement. Records refer to other
| records, and not being able to reason about when things are
| going to show up in the database would lead to a ton of
| complexity. For instance, I read a transaction, then I want to
| read the account that sent or received that transaction. I need
| to know the record for the account is going to be there in the
| database, and is consistent with the transaction, as opposed to
| stale. Or, a client might be paging through data over several
| calls, where we return a marker to allow the client to resume.
| This wouldn't work if some nodes might not know about the
| marker.
___________________________________________________________________
(page generated 2022-04-29 23:00 UTC)