[HN Gopher] Erasure Coding versus Tail Latency
___________________________________________________________________
Erasure Coding versus Tail Latency
Author : usrme
Score : 53 points
Date : 2024-03-27 08:47 UTC (1 days ago)
(HTM) web link (brooker.co.za)
(TXT) w3m dump (brooker.co.za)
| siscia wrote:
| Nice to see this talked about here and Marc being public about
| it.
|
| AWS is such a big place that even after a bit of tenure you still
| got place to look to find interesting technical approaches and
| when I was introduced to this schema for Lambda storage I was
| surprised.
|
| As Marc mentions it is such a simple and powerful idea that is
| definitely not mentioned enough.
| mjb wrote:
| If folks are interested in more details about Lambda's
| container storage scheme, they can check out our ATC'23 paper
| here:
| https://www.usenix.org/conference/atc23/presentation/brooker
| pjdesno wrote:
| Note that the fast_read option on erasure-coded Ceph pools
| uses the same technique. I'm not sure which release it was
| in, but I see commits referencing it from 2015.
| dmw_ng wrote:
| Nice to see this idea written about in detail. I had thought
| about it in the context of terrible availability bargain bucket
| storage (iDrive E2), where the cost of (3,2) erasure coding an
| object and distributing each segment to one of 3 regions would
| still be dramatically lower than paying for more expensive and
| more reliable storage.
|
| Say 1 chunk lives in Germany, Ireland and the US each. Client
| races GETs to all 3 regions and cancels the request to the
| slowest to respond (which may also be down). Final client latency
| is equivalent to that of the 2nd slowest region, with
| substantially better availability due to the ability to tolerate
| any single region being down
|
| Still wouldn't recommend using E2 for anything important, but ^
| was one potential approach to dealing with its terribleness. It
| still doesn't address the reality of when E2 regions go down, it
| is often for days and reportedly sometimes weeks at a time. So
| reliable writing in this scenario would necessitate some kind of
| queue with capacity for weeks of storage
|
| There are variants of this scheme where you could potentially
| balance the horrible reliability storage with some expensive
| reliable storage as part of the same system, but I never got that
| far in thinking about how it would work
| ddorian43 wrote:
| > Client races GETs to all 3 regions and cancels the request to
| the slowest to respond
|
| Dropbox does this internally in magic pocket (see their eng
| blog)
| derefr wrote:
| Having not heard of E2 before (it never seems to come up in the
| comparisons the other object-storage providers do to make
| themselves look good) -- do you know _why_ it has such bad
| availability? "Weeks of downtime" sounds crazy for a business
| focused on storage.
|
| If they aren't _also_ failing at durability, then it wouldn 't
| be any of the classical problems associated with running a
| storage cluster. Do they just... not bother with online
| maintenance / upgrades / hardware transitions?
| dmw_ng wrote:
| My best guess is their primary product is a Windows (I think)
| backup client that has a place for smarts allowing them to
| paper over problems, or something along those lines. Feels
| like an "oh Germany is down again, when is Frank back from
| holiday so he can catch a plane to replace the switch?" type
| affair. If you google around the Reddit data hoarder
| communities you'll find plenty of war stories about E2
| ot wrote:
| It is worth noting that this does not come for free, and it would
| have been nice for the article to mention the trade-off:
| reconstruction is not cheap on CPU, if you use something like
| Reed-Solomon.
|
| Usually the codes used for erasure coding are in _systematic_
| form: there are k "preferential" parts out of M that are just
| literal fragments of the original blob, so if you get those you
| can just concatenate them to get the original data. If you get
| any other k-subset, you need to perform expensive reconstruction.
| mjb wrote:
| It's true that Reed-Solomon isn't free.
|
| The first two codes (N of N+1 and N of N+2) are nearly trivial
| and can be done very fast indeed. On my hardware, the N of N+1
| code (which is an XOR) can be arranged to be nearly as fast a
| memcpy (which obviously isn't free either). They can also be
| done in a streaming way which can save the memcpy if you're
| feeding a stream into a parser (e.g. JSON or something) or
| decryption.
|
| > Usually the codes used for erasure coding are in systematic
| form: there are k "preferential" parts out of M that are just
| literal fragments of the original blob, so if you get those you
| can just concatenate them to get the original data.
|
| Yeah, that's true. If you're CPU bound, it may be worth waiting
| a little longer for these 'diagonal' components to come back.
| pjdesno wrote:
| There are new Intel instructions (GFNI) which accelerate
| things a lot, as well as various hacks to make it go fast.
| See https://www.reddit.com/r/ceph/comments/17z1w08/but_is_my_
| cpu... for some quick and dirty benchmarks on jerasure, one
| of the EC plugins for Ceph, IIRC not using GFNI. (TLDR:
| 25GB/s on a Ryzen 7700X)
| dragontamer wrote:
| Reed Solomon is closer to "perfect" but is unnecessary.
|
| IIRC, Turbo Codes and LDPCs are less-perfect (they cannot
| offer strict guarantees like Reed-Solomon can), but as XOR-
| based simple operations, they are extremely extremely fast to
| implement.
|
| LDPC has high-probabilities of fixing errors (near Reed-
| Solomon level), which is good enough in practice. Especially
| since LDPC's simple XOR-based operation is far faster and
| like O(n) instead of Reed-Solomon's matrix-multiplication
| (O(n^2)) algorithm.
|
| The state of the art has moved forward. Reed Solomon is great
| for proving the practice and providing strict assurances
| (likely better for storage where you have strict size limits
| and need strong guarantees for MTBF or other such
| statistics). But for a "faster" algorithm (ie: trying to
| prevent repeated packets in a communication stream like TCP
| or similar protocol), LDPC and/or Turbo codes are likely a
| better solution.
|
| -----
|
| Reed Solomon is probably best for "smaller" codes where the
| matrix is smaller and O(n^2) hasn't gotten out of hand yet.
| But as codes increase in size, the O(n) "less than perfect"
| codes (such as Turbo codes or LDPC codes) become better-and-
| better ideas.
|
| That being said: I can imagine some crazy GPU / SIMD
| algorithm where we have such cheap compute and low bandwidth
| where the O(n^2) operation might serve as a better basis than
| the cheap XOR operation. The future of computers is going to
| be more compute and less relative memory bandwidth after all,
| so the pendulum may swing the other way depending on how
| future machines end up.
| mjb wrote:
| That's true too, this approach isn't limited to Reed-
| Solomon (or MDS codes). For non-MDS codes the fetch logic
| becomes a little more complicated (you need to wait for a
| subset you can reconstruct from rather than just the first
| k), but that's not a huge increase in complexity.
| nsguy wrote:
| I haven't heard about LDPCs before. Thanks!
|
| Do they serve the same use case though? With Reed-Solomon
| the idea is to recover from complete loss of a fragment of
| data (erasure coding), isn't LPDC strictly for error
| correction/"noise" (e.g. certain bits flipping but the data
| overall exists)?
| dragontamer wrote:
| I admit that I haven't thought it all the way through,
| but in general, all error-correction codes I'm aware of
| have a simpler erasure-code version available too.
|
| Reed Solomon traditionally is an error-correction code,
| for example. But has common implementations in its
| simplified erasure-only code. (Ex: fixing "lost data" is
| far easier than fixing "contradictory data").
|
| I'm fairly certain that LDPC erasure codes is as simple
| as "Is there only one missing erasure in this particular
| code??" and "answer is LDPC XOR (other data) == missing-
| data".
|
| EDIT: The "hard part" is the exact composition of (other
| data), of which there's many styles and different
| methodologies with tons of different tradeoffs.
| aero_code wrote:
| I'm interested in using LDPC (or Turbo Codes) in software
| for error correction, but most of the resources I've found
| only cover soft-decision LDPC. When I've found LDPC papers,
| it's hard for me to know how efficient the algorithms are
| and whether it's worth spending time on them. Reed-Solomon
| has more learning resources that are often more
| approachable (plus open source libraries). Do you have more
| information on how to implement LDPC decoding using XOR-
| based operations?
| dragontamer wrote:
| Alas, no. I'm mostly spitballing here since I know that
| XOR-based codes are obviously much faster than Reed-
| solomon erasure code / matrix solving methodology.
|
| There's a lot of names thrown out for practical LDPC
| erasure codes. Raptor Codes, Tornado Codes, and the like.
| Hopefully those names can give you a good starting point?
|
| EDIT: I also remember reading a paper on a LDPC Fountain
| Code (ex: keep sending data + LDPC checkbits until the
| other side got enough to reconstruct the data), as a kind
| of "might as well keep sending data while waiting for the
| ACK", kind of thing, which should cut down on latency.
|
| --------
|
| I'm personally on the "Finished reading my book on Reed-
| Solomon codes. Figuring out what to study next" phase.
| There's a lot of codes out there, and LDPC is a huge
| class...
|
| Then again, the project at work that I had that benefited
| from these error (erm... erasure) correcting codes was
| complete and my Reed-solomon implementation was good
| enough and doesn't really need to be touched anymore. So
| its not like I have a real reason to study this stuff
| anymore. Just a little bit of extra data that allowed the
| protocol to cut off some latency and reduce the number of
| resends in a very noisy channel we had. The good ol' "MVP
| into shelved code" situation, lol. Enough code to prove
| it works, made a nice demo that impressed the higher-ups,
| and then no one ended up caring for the idea.
|
| If I were to productize the concept, I'd research these
| more modern, faster codes (like LDPC, Raptor, Tornado,
| etc. etc.) and implement a state-of-the-art erasure
| correction solution, ya know? But at this point, the
| projects just dead.
|
| But honestly, the blog-post's situation (cut down on
| latency with forward error correction) is seemingly a
| common problem that's solved again and again in our
| industry. But at the same time, there's so much to learn
| in the world of Comp. Sci that sometimes its important to
| "be lazy" and "learn it when its clearly going to be
| useful" (and not learning it to hypothetically improve a
| dead project, lol).
| loeg wrote:
| Yeah. And you get the storage for free if your distributed design
| also uses the erasure-encoded chunks for durability. Facebook's
| Warm Storage infrastructure does something very similar to what
| this article describes.
| sujayakar wrote:
| this is a really cool idea.
|
| one followup I was thinking of is whether this can generalize to
| queries other than key value point lookups. if I'm understanding
| correctly, the article is suggesting to take a key value store,
| and for every `(key, value)` in the system, split `value` into
| fragments that are stored on different shards with some `k` of
| `M` code. then at query time, we can split a query for `key` into
| `k` subqueries that we send to the relevant shards and reassemble
| the query results into `value`.
|
| so, if we were to do the same business for an ordered map with
| range queries, we'd need to find a way to turn a query for
| `interval: [start, end]` into some number of subqueries that we
| could send to the different shards and reassemble into the final
| result. any ideas?
| ddorian43 wrote:
| > so, if we were to do the same business for an ordered map
| with range queries, we'd need to find a way to turn a query for
| `interval: [start, end]` into some number of subqueries that we
| could send to the different shards and reassemble into the
| final result. any ideas?
|
| Dbs that are backed by s3-like-storage, the storage does this
| for you, but for blocks of, say, 1MB, and not per-kv (high
| overhead).
|
| Think you use rocksdb in your db, and erasure-code the
| sstables.
| ddorian43 wrote:
| There https://ydb.tech/ open source db that uses erasure coding
| for replication in single zone/region.
| eivanov89 wrote:
| In YDB with block 4+2 erasure coding, you need half the disk
| space compared to mirror-3-dc schema. Meanwhile CPU usage is
| just a little bit higher, thus in high throughput tests
| mirror-3-dc wins. Indeed as mentioned in the post there might
| be a tail latency win in latency runs, but if your task is
| throughput with a reasonable latencies, replication might be
| a better choice.
| pjdesno wrote:
| If you only care about throughput, just fetch the data and
| read should be the same speed as triple replicated.
|
| For writing, triple-rep has to write 2x as much data or
| more, so it's going to be slower unless your CPUs are
| horribly slow compared to your drives.
| ddorian43 wrote:
| I expect it to save a lot of CPU by only needing 1/3x of
| compactions. You might want to do a benchmark on that ;).
| An example is quickwit (building inverted indexes is very
| expensive).
| jeffbee wrote:
| I do not follow. How is it possible that the latency is lower in
| a 4-of-5 read of a coded stripe compared to a 1-of-4 replicated
| stripe?
| ec109685 wrote:
| Aren't they comparing 4 of 5 with 4 of 4?
| jeffbee wrote:
| If so that doesn't make a great deal of sense because the
| alternative to a coded stripe isn't a unique unreplicated
| stripe, it's a replicated one. So the alternative would be to
| race multiple reads of all known replicas, or to hedge them.
| ec109685 wrote:
| Good point. I guess it's really 2 out of 2 versus 4 out of
| 6.
| benlivengood wrote:
| Full replication is probably always lower latency but costs N
| times as many copies that you want to store whereas erasure
| coding costs N/M. The cost of replicating write traffic is
| similar.
| jeffbee wrote:
| Certainly, but there are other tradeoffs. You can write a
| byte to a replicated stripe, but encoded stripes either
| demand full-stripe writes or writing a byte is hilariously
| expensive. It's not clear in the blog if we are talking about
| sealed read-only data or what.
| benlivengood wrote:
| Practically all modern storage has a minimum sector write
| size; now commonly 4096 bytes for HDD and megabytes for
| NAND. Most systems with fancy erasure codes address this by
| gathering multiple writes into single stripes and are
| effectively append-only filesystems with garbage collection
| to support snapshots.
|
| A large number of RS erasure codes are over GF(2^8) which
| could allow individual byte updates to a stripe for e.g.
| Octane as storage. Generally, of course, most filesystems
| checksum at a block level and so byte-writes are rarely
| going to be implemented as such.
| vlovich123 wrote:
| Small correction I think. The 4KiB for HDDs is fairly new
| and afaik most NAND is going to also be 4KiB with the
| exception of QLC which last I saw had a 64KiB minimum
| sector write size. If you have a source that NAND flash
| is using MiB minimum sector size, can you point me to
| that link?
| benlivengood wrote:
| The next level of efficiency is using nested erasure codes. The
| outer code can be across regions/zones/machines/disks while the
| inner code is across chunks of a stripe. Chunk unavailability is
| fast to correct with an extra outer chunk and bit rot or
| corruption can be fixed by the inner code without an extra fetch.
| In the fast path only data chunks need to be fetched.
| dragontamer wrote:
| That's just LDPC but not as extreme.
|
| If you're going to "nest" erasure codes, might as well make
| them XOR-based (fastest operation on modern CPUs and/or
| hardware), calculate a randomization scheme that has very-very
| high probability (99.9%+) of fixing errors, and other such
| benefits.
|
| To provide assurances that you have enough LDPC, you then run
| LDPC on your LDPC-check bits. Then you run LDPC on your LDPC-
| LDPC-check bits. Then you run LDPC on your LPDC-LDPC-LDPC-check
| bits, until you've got the target probability / chance of
| fixing errors that you desire.
|
| --------
|
| LDPC's brilliance is that XOR-as-erasure-code is exceptionally
| fast, and this "repeated hierarchy" of error-correction codes
| leads to high-probabilities (99.9999%+) of successful
| correction of erasures.
___________________________________________________________________
(page generated 2024-03-28 23:01 UTC)