[HN Gopher] Reed-Solomon error recovery in RAID-6
       ___________________________________________________________________
        
       Reed-Solomon error recovery in RAID-6
        
       Author : signa11
       Score  : 95 points
       Date   : 2021-01-07 03:13 UTC (19 hours ago)
        
 (HTM) web link (anadoxin.org)
 (TXT) w3m dump (anadoxin.org)
        
       | miahi wrote:
       | The talk about striping (the fact that all the disks contain both
       | data and the two levels of parity) is glossed over, but what it
       | means is that there are not 7 types of failure/recovery, there
       | are just two - when one drive fails, and when two drives fail.
       | And the recovery actually contains all cases all the time (case
       | 1-3 for 1 disk, case 4-7 for two disks).
        
       | redis_mlc wrote:
       | I'm a storage engineer.
       | 
       | There's a huge difference bewtween theory and practise.
       | 
       | - rebuilds can timeout
       | 
       | - load can really cause timeouts
       | 
       | - raid controllers can have tested modes that work, and some that
       | don't
       | 
       | - batches of drives can all fail at the same time, esp. during
       | rebuilds
       | 
       | - the nvram might not be portable across controllers.
        
         | mchristen wrote:
         | More than once I've lost a RAID array during a rebuild. That's
         | what backups are for, I suppose.
        
           | sliken wrote:
           | 1st lesson on using a RAID should be that RAID is not a
           | replacement for backups.
        
           | klodolph wrote:
           | This is to be expected. RAID 5 is not safe. It is common for
           | a RAID 5 array to be running in a degraded state during
           | "normal" operation without the operator's knowledge. Then,
           | when one drive fails, rebuilding the array is impossible; you
           | have already lost data.
           | 
           | This is common because there is often no good way to repair a
           | RAID 5 system which is running in degraded state, and there
           | is often no monitoring to respond to array degradation (if
           | you can't fix it, why monitor it?)
           | 
           | In other words, RAID 5 does not protect very well against
           | drive failures. RAID 6 is more durable but inefficient (there
           | are systems which are both more durable and more efficient
           | than RAID 6). RAID is generally optimized for implementation
           | simplicity over other concerns.
        
             | tutfbhuf wrote:
             | > there are systems which are both more durable and more
             | efficient than RAID 6
             | 
             | RAID 10?
        
               | speakeron wrote:
               | RAID 10 is more efficient (if that equates to performant)
               | than RAID 6, but not more durable. In fact it's less.
               | 
               | As an example, a 4-disk array (which has the same
               | capacity in both RAID 6 and RAID 10) has 4 2-disk failure
               | modes. RAID 6 can handle all of these failure modes
               | without losing data, whereas RAID 10 can only handle 2 of
               | them.
        
               | klodolph wrote:
               | Efficiency is how much overhead you have. RAID 10 has 50%
               | efficiency (50% of your storage is used for data), RAID 6
               | has N/N+2 efficiency. Since RAID 10 means using 4 drives,
               | RAID 10 is never more efficient than RAID 6 for the same
               | number of drives.
               | 
               | The three main things we can optimize for are storage
               | efficiency, I/O performance, and data durability. For
               | most situations, RAID 6 is not Pareto-optimal. In other
               | words, you can come up with solutions that have better
               | storage efficiency, better I/O performance, or better
               | data durability, without sacrificing anything.
               | 
               | The only reason you'd use RAID 6 is because it's easy to
               | use. It's measurably worse on any other axis. Combine
               | this with various low-quality RAID implementations and
               | RAID is even worse. Software RAID in the Linux kernel is
               | fairly robust but there are many low-quality hardware
               | RAID implementations.
        
               | jandrese wrote:
               | Efficiency is not just drive efficiency but controller
               | time. The 4 disk RAID1_0 array may have the same drive
               | space overhead as the RAID6 version, but it may be much
               | slower due to the controller overhead.
               | 
               | One of the often overlooked aspects of RAID is that the
               | more complexity there is in the controller the higher the
               | risk of losing your data due to
               | hardware/firmware/software error. The worst part is these
               | errors often don't appear until your array is already
               | compromised in some way, so everything is fine until a
               | single disk failure triggers a previously unknown bug
               | that causes the rebuild procedure to clobber all of the
               | data on all of the drives.
               | 
               | I've known some professionals who use RAID1_0 whenever
               | they can get away with it because they don't trust RAID
               | controllers to handle anything more complex.
        
               | Dylan16807 wrote:
               | RAID 6 is pretty darn simple, I'm not sure what
               | controller issues you'd have.
               | 
               | That said, RAID 10 is great for many professional use
               | cases. It's very fast, scores well on reliability, the
               | cost premium isn't too bad, and you need backups anyway.
        
               | klodolph wrote:
               | > Efficiency is not just drive efficiency but controller
               | time.
               | 
               | Sorry, I should be clear. When I say efficiency, I mean
               | "storage efficiency" only. Controller time I count as a
               | part of performance. Perhaps my comment makes more sense
               | with that definition in mind.
        
       | eps wrote:
       | There's also this paper, which is quite good -
       | 
       | https://mirrors.edge.kernel.org/pub/linux/kernel/people/hpa/...
       | 
       | Also there are simpler, xor-based encoding schemes that allow
       | recovering from double failures (e.g. EvenOdd and such), but the
       | problem is that they are all patented.
        
         | [deleted]
        
         | est31 wrote:
         | This E-Mail about increased numbers of parities is also
         | interesting: https://www.mail-archive.com/linux-
         | btrfs@vger.kernel.org/msg...
        
           | huhtenberg wrote:
           | That's the SnapRAID dev!
           | 
           | Also, check out this bit -                 >> Could you
           | perhaps describe how the Cauchy matrix is        >> derived,
           | and under what conditions it would become       >> singular?
           | > The Cauchy matrix has the mathematical property ...       >
           | ...       > Besides the mathematical proof, I've also
           | inverted all        > the 377,342,351,231 possible
           | submatrices for up to 6        > parities and 251 data disks,
           | and got an experimental        > confirmation of this.
           | 
           | "Trust, but verify" :)
           | 
           | [0] https://www.mail-archive.com/linux-
           | btrfs@vger.kernel.org/msg...
        
       | alanfranz wrote:
       | Is there any system for error recovery in userspace? Something
       | like, I've got directory DATA, i want to create a recovery file
       | for all the contents in such directory. Ideally I'd like to be
       | able to tune the size of the recovery, larger recovery = recover
       | more errors
        
         | sliken wrote:
         | This isn't a turn key utility, but it's relatively easy to call
         | from Go. It allows you to set the number of data shards and
         | parity shards, it's pretty fast. It's at
         | https://github.com/klauspost/reedsolomon
         | 
         | Seems like a pretty common solution, I suspect there's other
         | libraries for popular language/platforms.
        
         | auxym wrote:
         | We used to use this when creating backups on CD-Rs and later
         | DVDs, before hard drives became cheap...
         | 
         | https://en.wikipedia.org/wiki/Parchive
        
           | bestham wrote:
           | I used a program named Dvdisaster for the same use case. You
           | could use it I two modes: (1) One disc with additional out of
           | band reed-solomon ECC. Create an iso that are smaller by a
           | certain percentage than the medium you want to burn it to.
           | Dvdisaster will create a new ISO that uses the full capacity
           | of the medium and fill the remaining space with parity-
           | information that protects the whole file system and all user
           | files. (2) Put your parity information on the next disc you
           | burn. A parity file is crated for the first disc you burn,
           | this is later added to the file system of the second disc you
           | burn. Repeat indefinitely. You always have the last yet-to-
           | be-burned parity file on your computer and can thus recover
           | bit-errors in every disc you have burned.
        
         | dmayle wrote:
         | What you're looking for is called Parchive. It uses Reed-
         | Solomon coding in order to generate redundant files that can be
         | used to recover the original. Look for PAR2 utilities (the
         | current major version of the format). It has multiple
         | implementations, open source and closed, across platforms and
         | architectures.
         | 
         | https://en.wikipedia.org/wiki/Parchive
        
           | thomasmg wrote:
           | Another option is to use an IBLT (Inverted Bloom Lookup
           | Table). it is easy to implement. I recently wrote a file
           | repair tool that uses it: https://github.com/thomasmueller/ti
           | nyStats/blob/master/src/m... (not production ready)
        
           | brnt wrote:
           | I wrote a gui for it, tot help deal with many files being in
           | various states: https://github.com/brenthuisman/par2deep
        
       | dragontamer wrote:
       | On the one hand, this is a great "brute force" method to learn
       | how one implementation of RAID6 could happen. This blogpost
       | achieves understanding of error-correction in the 2-parity case
       | through brute force alone, and that's actually quite commendable.
       | 
       | On the other hand, its a very, very bad way to learn Reed Solomon
       | codes. Reed Solomon codes "in general" are a concept that's
       | broader than just 2-parity symbols... and also broader than just
       | "protection vs erasures". Reed Solomon is designed to correct
       | errors as well (a far "harder" computation than simply an
       | erasure).
       | 
       | I've fortunately found a good "ELI5" version of Reed Solomon:
       | https://www.youtube.com/watch?v=xE4jEKx9fTM
       | 
       | That's a very non-technical talk, but that talk gives the full
       | concepts of "Polynomial in Finite Field" that Reed Solomon codes
       | are based off of.
       | 
       | For a proper computer-based Reed Solomon code, you choose the
       | GF(2^8) extension field for 8-bit symbols (Pararchive uses 2^16
       | for 16-bit symbols). Extension fields are extremely complicated
       | to understand however, this will take a few days of Abstract
       | Algebra study to understand how extension fields work.
       | 
       | But once you know how a finite extension field works, you then
       | extend a polynomial into the extension field, and bam, you have
       | Reed Solomon codes.
       | 
       | That is: a Quadratic equation can protect 3-data. All quadratic
       | equations are ax^2 + bx + c, so any 3-points define a quadratic
       | equation. Generate (x,y) coordinates on a graph as (0, data0),
       | (1, data1), (2, data2).
       | 
       | Then, solve for a quadratic equation f(x) = ax^2 + bx + c.
       | 
       | Then, to generate parity, f(3) is your first parity symbol, f(4)
       | is your next parity symbol, f(5) is the parity symbol after that.
       | Etc. etc. Note in a 8-bit finite field, you can only go up to
       | f(255) before "looping back" to f(0).
       | 
       | From there: 2-check symbols will protect against either
       | 2-erasures or 1-error. So Quadratic Equation with 4-check symbols
       | (aka: 4 "other points" of the quadratic equation passed with the
       | data) can protect against 4 erasures... or 2-errors, or
       | 2-erasures + 1 error.
        
         | klodolph wrote:
         | Note that the size of the field only has to be as large as the
         | size of the redundancy set (data + code). So if you are
         | striping across 16 or fewer disks, you can use GF(2^4), but if
         | you are striping across 17 disks you can't use GF(2^4).
         | 
         | Well, if I remember the math correctly.
        
           | dragontamer wrote:
           | > Well, if I remember the math correctly.
           | 
           | You're off by one. There's a "minus 1" because Reed Solomon
           | is over the cyclical multiplication group, not over the
           | cyclical additive group. (TL;DR: The number 0 cannot
           | participate in... some portion of the formulation of the
           | polynomials)
           | 
           | GF(2^3) is over blocksize of 7. GF(2^4) is over blocksize of
           | 15. GF(2^8) is over 255.
           | 
           | ---------
           | 
           | GF(2^8) is the smallest size in practice, because most
           | people's data is organized into 8-bits-at-a-time today. (aka:
           | a byte). But you're mostly right (once you account for the
           | off-by-one error): there's a limit to 255 elements per block
           | in GF(2^8) Reed Solomon.
           | 
           | But GF(2^3) or GF(2^4) are great for hand-exercises, the
           | smaller size makes it far easier to compute calculations by
           | hand and allow for the human to gain understanding. (aka:
           | Octal 2^3 numbers or Hexadecimal 2^4 numbers). GF(2^2) is
           | technically possible... but all the matrix math is over 1x1
           | matricies and not really a good opportunity to learn from.
           | GF(2^1) is literally impossible for Reed Solomon to work in.
           | So GF(2^3) is the smallest practical hand-exercise size
           | (RS(7, 5) over GF(2^3) has a 5x7 generator matrix and a 7x2
           | check matrix. Small enough to do by hand though a bit
           | tedious).
           | 
           | If you actually wanted to "split symbols up" so that you
           | better detect bit-errors, then LDPC or Turbocodes are
           | straight up superior over Reed Solomon for "small symbol
           | sets". Reed Solomon is best used over "large symbols" (aka:
           | 2^8 or larger).
           | 
           | ----------
           | 
           | Any Reed Solomon code can be punctured (aka: fewer parity
           | symbols), or shortened (aka: fewer data symbols) by simply
           | setting locations of the Reed-Solomon generator and/or check
           | matrix to 0.
           | 
           | As such, if you organize "smaller sets" into 2^8 (aka: 8-bit
           | bytes), its just very natural to use a punctured + shortened
           | Reed Solomon code instead of trying to get a 2^4 Reed Solomon
           | code to work.
        
             | klodolph wrote:
             | You're right! Off by one.
             | 
             | > GF(2^8) is the smallest size in practice, because most
             | people's data is organized into 8-bits-at-a-time today.
             | 
             | IIRC GF(2^4) is actually used in practice. Long story
             | short, the CPU overhead of rebuilding data is a real
             | concern.
        
               | dragontamer wrote:
               | > IIRC GF(2^4) is actually used in practice. Long story
               | short, the CPU overhead of rebuilding data is a real
               | concern.
               | 
               | Interesting. I don't know about how these systems are
               | implemented in practice, but I'd be surprised if GF(2^8)
               | was "too big".
               | 
               | GF(2^8) was small enough that it was implemented in 1977
               | on the Voyager probe project. Yeah, it took a literal
               | supercomputer to decode that RS-code in the 1980s, but
               | modern computers are basically as fast as 1980s-era
               | supercomputers.
               | 
               | Modern papers are talking about GF(2^32) to GF(2^128)
               | these days.
               | (https://www.ccis.northeastern.edu/home/alina/papers/tos-
               | gf.p...). Most implementations don't need anything near
               | that size, so GF(2^8) is sufficient.
               | 
               | ---------
               | 
               | I mean, I've done GF(2^4) RS codes by hand. That's a
               | very, very, very small system, human calculable with just
               | pen and paper. So you're right that its far easier to
               | calculate. I'm just kind of surprised that anyone would
               | bother with a computer RS code of that miniscule size.
               | 
               | Ex: RS(15, 9) is a 9x15 generator matrix and a 6x15 check
               | matrix over GF(2^4). Its on the large-side of hand-doable
               | calculations, but absolutely doable by pen-and-paper (I
               | recommend graph paper). Every symbol is a 2^4 hexadecimal
               | symbol, so a single hand-written character
               | (0123456789ABCDEF) is stored on a single cell in graph
               | paper.
               | 
               | -------
               | 
               | Are you sure those systems are working in GF(16) fields?
               | Not GF(2^16) AKA gf(65536) fields? GF(2^16) is more of
               | the size I'd expect a computer to do, more so than
               | GF(2^4).
        
               | klodolph wrote:
               | Yes, I am sure that it's GF(2^4) and not GF(2^16). The
               | problem is not that GF(2^8) or GF(2^16) are "too big",
               | it's just that GF(2^4) is faster and uses less CPU.
               | 
               | The purpose of using GF(2^128) or other large sizes is
               | usually cryptographic, which has different tradeoffs. For
               | example, Galois counter mode. The linked paper describes
               | secure storage mechanisms.
               | 
               | Also note that you're not just doing one field operation
               | per byte, but several. If I'm remembering correctly,
               | let's say you're using an (11,8) code, then you're doing
               | 24 multiply and 21 addition operations to calculate the
               | encoded message. You might think of various ways to
               | accelerate this which benefit from a smaller field size,
               | especially if you are doing the calculations on commodity
               | CPUs. The only point I'm making here is that the
               | computation consumes some nontrivial amount of CPU
               | resources, and so it can be fruitful to optimize.
               | 
               | The fact that space probes like Voyager use a large field
               | size is just because the cost of resources is different.
               | For space probes, bandwidth is a precious resource. If
               | you spent $250M on a single space probe, you can afford
               | to use a bit more CPU to encode and decode your puny 100
               | kHz downlink. On the other hand, if you are running a
               | data center, the power consumption of CPUs is one of your
               | largest costs.
        
               | dragontamer wrote:
               | > Yes, I am sure that it's GF(2^4) and not GF(2^16). The
               | problem is not that GF(2^8) or GF(2^16) are "too big",
               | it's just that GF(2^4) is faster and uses less CPU.
               | 
               | A multiply in GF(2^8) is just a single multiplication
               | lookup. 256 x 256 == 64kB lookup table. I'm not sure how
               | much faster you can get than a single 8-bit lookup on a
               | table small enough to fit in a typical CPU's cache.
               | 
               | GF(2^8) is so small that you don't even need to do the
               | log-table / antilog-table trick. Just store the entire
               | multiplication table, since the multiplication table fits
               | in L2 cache.
               | 
               | A GF(2^4) multiply would be a single multiplication
               | lookup 16 x 16 or 256 byte (0.25kB) lookup table. Very,
               | very tiny, probably no faster than the GF(2^8) lookup.
               | 
               | > Also note that you're not just doing one field
               | operation per byte, but several. If I'm remembering
               | correctly, let's say you're using an (11,8) code, then
               | you're doing 24 multiply and 21 addition operations to
               | calculate the encoded message.
               | 
               | That RS(11,8) code (probably shortened / punctured) could
               | be a GF(2^8) code. The question is: do you want to
               | organize your data in 8-bit bytes in GF(2^8), or 4-bit
               | nibbles in GF(2^4)?
               | 
               | I expect that 8-bit bytes is actually faster for modern
               | computers than 4-bit nibbles. Even if you are using a
               | severely shortened / punctured code, there are strong
               | benefits to sticking with 8-bit bytes on a modern
               | computer.
               | 
               | That's the thing: computers usually have 8-bit bytes as
               | the smallest pragmatic unit to use. GF(2^8) lines up with
               | that perfectly, and is used in a variety of high-
               | performance applications (including AES).
               | 
               | > The fact that space probes like Voyager use a large
               | field size is just because the cost of resources is
               | different. For space probes, bandwidth is a precious
               | resource. If you spent $250M on a single space probe, you
               | can afford to use a bit more CPU to encode and decode
               | your 100 kHz downlink.
               | 
               | The Voyager probe launched in 1977, it was with an utter
               | crap CPU by today's standards. Your USB power-cable has
               | more CPU power and RAM than the voyager probe... that
               | tiny USB CPU is used to negotiate the voltage level (5V
               | vs 12V USB power negotiation)
               | 
               | --------
               | 
               | Voyager launched with 32k 18-bit words of RAM and 4MHz
               | for its 6x parallel CPUs (handling different parts of the
               | craft: flight vs imaging unit vs radio unit... etc.
               | etc.). I think you're severely overestimating the
               | strength of Voyager probe, or the level of technology
               | available to 1977.
        
               | klodolph wrote:
               | > A multiply in GF(2^8) is just a single multiplication
               | lookup. 256 x 256 == 64kB lookup table. I'm not sure how
               | much faster you can get than a single 8-bit lookup on a
               | table small enough to fit in a typical CPU's cache.
               | 
               | 64 KB often won't fit in L1 cache anyway.
               | 
               | > A GF(2^4) multiply would be a single multiplication
               | lookup 16 x 16 or 256 byte (0.25kB) lookup table. Very,
               | very tiny, probably no faster than the GF(2^8) lookup.
               | 
               | Or you can do _three_ operations using a single 16 x 16 x
               | 16 x 16 lookup table, and introduce fewer data
               | dependencies into your CPU pipeline.
               | 
               | The operations work on half as many bits, but complete
               | three times as many steps. Sounds like a win to me.
               | 
               | > I expect that 8-bit bytes is actually faster for modern
               | computers than 4-bit nibbles. Even if you are using a
               | severely shortened / punctured code, there are strong
               | benefits to sticking with 8-bit bytes on a modern
               | computer.
               | 
               | You keep saying this. I understand how CPUs are byte-
               | addressed. When you operate on smaller sizes, you have to
               | do packing / unpacking. If your CPU core spends most of
               | its time doing table lookups, then a little bit of
               | packing / unpacking is nearly free, since your ALUs are
               | otherwise fairly idle, assuming you have a couple
               | registers free.
        
               | dragontamer wrote:
               | > Or you can do three operations using a single 16 x 16 x
               | 16 x 16 lookup table, and introduce fewer data
               | dependencies into your CPU pipeline.
               | 
               | Gotcha.
               | 
               | So to summarize: you've seen GF(2^4) calculations
               | performed SIMD style. A singular 16-bit lookup can be 4x
               | GF(2^4) lookups in parallel.
               | 
               | Seems legit to me. A non-obvious implementation, but
               | clearly possible and faster.
        
       | layoutIfNeeded wrote:
       | Edit: I was wrong
        
         | bfuclusion wrote:
         | Correct me if I'm wrong, but I think that scheme will protect
         | you from at most one erasure.
         | 
         | Fail D1 and P1 There's no way to recover D1s data since P2 has
         | no data from D1 in it.
        
           | jakubp wrote:
           | There is no mention of "P1" in the entire article. What did
           | you mean?
        
           | layoutIfNeeded wrote:
           | Actually, you're right!
        
       ___________________________________________________________________
       (page generated 2021-01-07 23:01 UTC)