[HN Gopher] Erasure Coding versus Tail Latency
       ___________________________________________________________________
        
       Erasure Coding versus Tail Latency
        
       Author : usrme
       Score  : 53 points
       Date   : 2024-03-27 08:47 UTC (1 days ago)
        
 (HTM) web link (brooker.co.za)
 (TXT) w3m dump (brooker.co.za)
        
       | siscia wrote:
       | Nice to see this talked about here and Marc being public about
       | it.
       | 
       | AWS is such a big place that even after a bit of tenure you still
       | got place to look to find interesting technical approaches and
       | when I was introduced to this schema for Lambda storage I was
       | surprised.
       | 
       | As Marc mentions it is such a simple and powerful idea that is
       | definitely not mentioned enough.
        
         | mjb wrote:
         | If folks are interested in more details about Lambda's
         | container storage scheme, they can check out our ATC'23 paper
         | here:
         | https://www.usenix.org/conference/atc23/presentation/brooker
        
           | pjdesno wrote:
           | Note that the fast_read option on erasure-coded Ceph pools
           | uses the same technique. I'm not sure which release it was
           | in, but I see commits referencing it from 2015.
        
       | dmw_ng wrote:
       | Nice to see this idea written about in detail. I had thought
       | about it in the context of terrible availability bargain bucket
       | storage (iDrive E2), where the cost of (3,2) erasure coding an
       | object and distributing each segment to one of 3 regions would
       | still be dramatically lower than paying for more expensive and
       | more reliable storage.
       | 
       | Say 1 chunk lives in Germany, Ireland and the US each. Client
       | races GETs to all 3 regions and cancels the request to the
       | slowest to respond (which may also be down). Final client latency
       | is equivalent to that of the 2nd slowest region, with
       | substantially better availability due to the ability to tolerate
       | any single region being down
       | 
       | Still wouldn't recommend using E2 for anything important, but ^
       | was one potential approach to dealing with its terribleness. It
       | still doesn't address the reality of when E2 regions go down, it
       | is often for days and reportedly sometimes weeks at a time. So
       | reliable writing in this scenario would necessitate some kind of
       | queue with capacity for weeks of storage
       | 
       | There are variants of this scheme where you could potentially
       | balance the horrible reliability storage with some expensive
       | reliable storage as part of the same system, but I never got that
       | far in thinking about how it would work
        
         | ddorian43 wrote:
         | > Client races GETs to all 3 regions and cancels the request to
         | the slowest to respond
         | 
         | Dropbox does this internally in magic pocket (see their eng
         | blog)
        
         | derefr wrote:
         | Having not heard of E2 before (it never seems to come up in the
         | comparisons the other object-storage providers do to make
         | themselves look good) -- do you know _why_ it has such bad
         | availability?  "Weeks of downtime" sounds crazy for a business
         | focused on storage.
         | 
         | If they aren't _also_ failing at durability, then it wouldn 't
         | be any of the classical problems associated with running a
         | storage cluster. Do they just... not bother with online
         | maintenance / upgrades / hardware transitions?
        
           | dmw_ng wrote:
           | My best guess is their primary product is a Windows (I think)
           | backup client that has a place for smarts allowing them to
           | paper over problems, or something along those lines. Feels
           | like an "oh Germany is down again, when is Frank back from
           | holiday so he can catch a plane to replace the switch?" type
           | affair. If you google around the Reddit data hoarder
           | communities you'll find plenty of war stories about E2
        
       | ot wrote:
       | It is worth noting that this does not come for free, and it would
       | have been nice for the article to mention the trade-off:
       | reconstruction is not cheap on CPU, if you use something like
       | Reed-Solomon.
       | 
       | Usually the codes used for erasure coding are in _systematic_
       | form: there are k  "preferential" parts out of M that are just
       | literal fragments of the original blob, so if you get those you
       | can just concatenate them to get the original data. If you get
       | any other k-subset, you need to perform expensive reconstruction.
        
         | mjb wrote:
         | It's true that Reed-Solomon isn't free.
         | 
         | The first two codes (N of N+1 and N of N+2) are nearly trivial
         | and can be done very fast indeed. On my hardware, the N of N+1
         | code (which is an XOR) can be arranged to be nearly as fast a
         | memcpy (which obviously isn't free either). They can also be
         | done in a streaming way which can save the memcpy if you're
         | feeding a stream into a parser (e.g. JSON or something) or
         | decryption.
         | 
         | > Usually the codes used for erasure coding are in systematic
         | form: there are k "preferential" parts out of M that are just
         | literal fragments of the original blob, so if you get those you
         | can just concatenate them to get the original data.
         | 
         | Yeah, that's true. If you're CPU bound, it may be worth waiting
         | a little longer for these 'diagonal' components to come back.
        
           | pjdesno wrote:
           | There are new Intel instructions (GFNI) which accelerate
           | things a lot, as well as various hacks to make it go fast.
           | See https://www.reddit.com/r/ceph/comments/17z1w08/but_is_my_
           | cpu... for some quick and dirty benchmarks on jerasure, one
           | of the EC plugins for Ceph, IIRC not using GFNI. (TLDR:
           | 25GB/s on a Ryzen 7700X)
        
           | dragontamer wrote:
           | Reed Solomon is closer to "perfect" but is unnecessary.
           | 
           | IIRC, Turbo Codes and LDPCs are less-perfect (they cannot
           | offer strict guarantees like Reed-Solomon can), but as XOR-
           | based simple operations, they are extremely extremely fast to
           | implement.
           | 
           | LDPC has high-probabilities of fixing errors (near Reed-
           | Solomon level), which is good enough in practice. Especially
           | since LDPC's simple XOR-based operation is far faster and
           | like O(n) instead of Reed-Solomon's matrix-multiplication
           | (O(n^2)) algorithm.
           | 
           | The state of the art has moved forward. Reed Solomon is great
           | for proving the practice and providing strict assurances
           | (likely better for storage where you have strict size limits
           | and need strong guarantees for MTBF or other such
           | statistics). But for a "faster" algorithm (ie: trying to
           | prevent repeated packets in a communication stream like TCP
           | or similar protocol), LDPC and/or Turbo codes are likely a
           | better solution.
           | 
           | -----
           | 
           | Reed Solomon is probably best for "smaller" codes where the
           | matrix is smaller and O(n^2) hasn't gotten out of hand yet.
           | But as codes increase in size, the O(n) "less than perfect"
           | codes (such as Turbo codes or LDPC codes) become better-and-
           | better ideas.
           | 
           | That being said: I can imagine some crazy GPU / SIMD
           | algorithm where we have such cheap compute and low bandwidth
           | where the O(n^2) operation might serve as a better basis than
           | the cheap XOR operation. The future of computers is going to
           | be more compute and less relative memory bandwidth after all,
           | so the pendulum may swing the other way depending on how
           | future machines end up.
        
             | mjb wrote:
             | That's true too, this approach isn't limited to Reed-
             | Solomon (or MDS codes). For non-MDS codes the fetch logic
             | becomes a little more complicated (you need to wait for a
             | subset you can reconstruct from rather than just the first
             | k), but that's not a huge increase in complexity.
        
             | nsguy wrote:
             | I haven't heard about LDPCs before. Thanks!
             | 
             | Do they serve the same use case though? With Reed-Solomon
             | the idea is to recover from complete loss of a fragment of
             | data (erasure coding), isn't LPDC strictly for error
             | correction/"noise" (e.g. certain bits flipping but the data
             | overall exists)?
        
               | dragontamer wrote:
               | I admit that I haven't thought it all the way through,
               | but in general, all error-correction codes I'm aware of
               | have a simpler erasure-code version available too.
               | 
               | Reed Solomon traditionally is an error-correction code,
               | for example. But has common implementations in its
               | simplified erasure-only code. (Ex: fixing "lost data" is
               | far easier than fixing "contradictory data").
               | 
               | I'm fairly certain that LDPC erasure codes is as simple
               | as "Is there only one missing erasure in this particular
               | code??" and "answer is LDPC XOR (other data) == missing-
               | data".
               | 
               | EDIT: The "hard part" is the exact composition of (other
               | data), of which there's many styles and different
               | methodologies with tons of different tradeoffs.
        
             | aero_code wrote:
             | I'm interested in using LDPC (or Turbo Codes) in software
             | for error correction, but most of the resources I've found
             | only cover soft-decision LDPC. When I've found LDPC papers,
             | it's hard for me to know how efficient the algorithms are
             | and whether it's worth spending time on them. Reed-Solomon
             | has more learning resources that are often more
             | approachable (plus open source libraries). Do you have more
             | information on how to implement LDPC decoding using XOR-
             | based operations?
        
               | dragontamer wrote:
               | Alas, no. I'm mostly spitballing here since I know that
               | XOR-based codes are obviously much faster than Reed-
               | solomon erasure code / matrix solving methodology.
               | 
               | There's a lot of names thrown out for practical LDPC
               | erasure codes. Raptor Codes, Tornado Codes, and the like.
               | Hopefully those names can give you a good starting point?
               | 
               | EDIT: I also remember reading a paper on a LDPC Fountain
               | Code (ex: keep sending data + LDPC checkbits until the
               | other side got enough to reconstruct the data), as a kind
               | of "might as well keep sending data while waiting for the
               | ACK", kind of thing, which should cut down on latency.
               | 
               | --------
               | 
               | I'm personally on the "Finished reading my book on Reed-
               | Solomon codes. Figuring out what to study next" phase.
               | There's a lot of codes out there, and LDPC is a huge
               | class...
               | 
               | Then again, the project at work that I had that benefited
               | from these error (erm... erasure) correcting codes was
               | complete and my Reed-solomon implementation was good
               | enough and doesn't really need to be touched anymore. So
               | its not like I have a real reason to study this stuff
               | anymore. Just a little bit of extra data that allowed the
               | protocol to cut off some latency and reduce the number of
               | resends in a very noisy channel we had. The good ol' "MVP
               | into shelved code" situation, lol. Enough code to prove
               | it works, made a nice demo that impressed the higher-ups,
               | and then no one ended up caring for the idea.
               | 
               | If I were to productize the concept, I'd research these
               | more modern, faster codes (like LDPC, Raptor, Tornado,
               | etc. etc.) and implement a state-of-the-art erasure
               | correction solution, ya know? But at this point, the
               | projects just dead.
               | 
               | But honestly, the blog-post's situation (cut down on
               | latency with forward error correction) is seemingly a
               | common problem that's solved again and again in our
               | industry. But at the same time, there's so much to learn
               | in the world of Comp. Sci that sometimes its important to
               | "be lazy" and "learn it when its clearly going to be
               | useful" (and not learning it to hypothetically improve a
               | dead project, lol).
        
       | loeg wrote:
       | Yeah. And you get the storage for free if your distributed design
       | also uses the erasure-encoded chunks for durability. Facebook's
       | Warm Storage infrastructure does something very similar to what
       | this article describes.
        
       | sujayakar wrote:
       | this is a really cool idea.
       | 
       | one followup I was thinking of is whether this can generalize to
       | queries other than key value point lookups. if I'm understanding
       | correctly, the article is suggesting to take a key value store,
       | and for every `(key, value)` in the system, split `value` into
       | fragments that are stored on different shards with some `k` of
       | `M` code. then at query time, we can split a query for `key` into
       | `k` subqueries that we send to the relevant shards and reassemble
       | the query results into `value`.
       | 
       | so, if we were to do the same business for an ordered map with
       | range queries, we'd need to find a way to turn a query for
       | `interval: [start, end]` into some number of subqueries that we
       | could send to the different shards and reassemble into the final
       | result. any ideas?
        
         | ddorian43 wrote:
         | > so, if we were to do the same business for an ordered map
         | with range queries, we'd need to find a way to turn a query for
         | `interval: [start, end]` into some number of subqueries that we
         | could send to the different shards and reassemble into the
         | final result. any ideas?
         | 
         | Dbs that are backed by s3-like-storage, the storage does this
         | for you, but for blocks of, say, 1MB, and not per-kv (high
         | overhead).
         | 
         | Think you use rocksdb in your db, and erasure-code the
         | sstables.
        
         | ddorian43 wrote:
         | There https://ydb.tech/ open source db that uses erasure coding
         | for replication in single zone/region.
        
           | eivanov89 wrote:
           | In YDB with block 4+2 erasure coding, you need half the disk
           | space compared to mirror-3-dc schema. Meanwhile CPU usage is
           | just a little bit higher, thus in high throughput tests
           | mirror-3-dc wins. Indeed as mentioned in the post there might
           | be a tail latency win in latency runs, but if your task is
           | throughput with a reasonable latencies, replication might be
           | a better choice.
        
             | pjdesno wrote:
             | If you only care about throughput, just fetch the data and
             | read should be the same speed as triple replicated.
             | 
             | For writing, triple-rep has to write 2x as much data or
             | more, so it's going to be slower unless your CPUs are
             | horribly slow compared to your drives.
        
             | ddorian43 wrote:
             | I expect it to save a lot of CPU by only needing 1/3x of
             | compactions. You might want to do a benchmark on that ;).
             | An example is quickwit (building inverted indexes is very
             | expensive).
        
       | jeffbee wrote:
       | I do not follow. How is it possible that the latency is lower in
       | a 4-of-5 read of a coded stripe compared to a 1-of-4 replicated
       | stripe?
        
         | ec109685 wrote:
         | Aren't they comparing 4 of 5 with 4 of 4?
        
           | jeffbee wrote:
           | If so that doesn't make a great deal of sense because the
           | alternative to a coded stripe isn't a unique unreplicated
           | stripe, it's a replicated one. So the alternative would be to
           | race multiple reads of all known replicas, or to hedge them.
        
             | ec109685 wrote:
             | Good point. I guess it's really 2 out of 2 versus 4 out of
             | 6.
        
         | benlivengood wrote:
         | Full replication is probably always lower latency but costs N
         | times as many copies that you want to store whereas erasure
         | coding costs N/M. The cost of replicating write traffic is
         | similar.
        
           | jeffbee wrote:
           | Certainly, but there are other tradeoffs. You can write a
           | byte to a replicated stripe, but encoded stripes either
           | demand full-stripe writes or writing a byte is hilariously
           | expensive. It's not clear in the blog if we are talking about
           | sealed read-only data or what.
        
             | benlivengood wrote:
             | Practically all modern storage has a minimum sector write
             | size; now commonly 4096 bytes for HDD and megabytes for
             | NAND. Most systems with fancy erasure codes address this by
             | gathering multiple writes into single stripes and are
             | effectively append-only filesystems with garbage collection
             | to support snapshots.
             | 
             | A large number of RS erasure codes are over GF(2^8) which
             | could allow individual byte updates to a stripe for e.g.
             | Octane as storage. Generally, of course, most filesystems
             | checksum at a block level and so byte-writes are rarely
             | going to be implemented as such.
        
               | vlovich123 wrote:
               | Small correction I think. The 4KiB for HDDs is fairly new
               | and afaik most NAND is going to also be 4KiB with the
               | exception of QLC which last I saw had a 64KiB minimum
               | sector write size. If you have a source that NAND flash
               | is using MiB minimum sector size, can you point me to
               | that link?
        
       | benlivengood wrote:
       | The next level of efficiency is using nested erasure codes. The
       | outer code can be across regions/zones/machines/disks while the
       | inner code is across chunks of a stripe. Chunk unavailability is
       | fast to correct with an extra outer chunk and bit rot or
       | corruption can be fixed by the inner code without an extra fetch.
       | In the fast path only data chunks need to be fetched.
        
         | dragontamer wrote:
         | That's just LDPC but not as extreme.
         | 
         | If you're going to "nest" erasure codes, might as well make
         | them XOR-based (fastest operation on modern CPUs and/or
         | hardware), calculate a randomization scheme that has very-very
         | high probability (99.9%+) of fixing errors, and other such
         | benefits.
         | 
         | To provide assurances that you have enough LDPC, you then run
         | LDPC on your LDPC-check bits. Then you run LDPC on your LDPC-
         | LDPC-check bits. Then you run LDPC on your LPDC-LDPC-LDPC-check
         | bits, until you've got the target probability / chance of
         | fixing errors that you desire.
         | 
         | --------
         | 
         | LDPC's brilliance is that XOR-as-erasure-code is exceptionally
         | fast, and this "repeated hierarchy" of error-correction codes
         | leads to high-probabilities (99.9999%+) of successful
         | correction of erasures.
        
       ___________________________________________________________________
       (page generated 2024-03-28 23:01 UTC)