[HN Gopher] Reed-Solomon error recovery in RAID-6
___________________________________________________________________
Reed-Solomon error recovery in RAID-6
Author : signa11
Score : 95 points
Date : 2021-01-07 03:13 UTC (19 hours ago)
(HTM) web link (anadoxin.org)
(TXT) w3m dump (anadoxin.org)
| miahi wrote:
| The talk about striping (the fact that all the disks contain both
| data and the two levels of parity) is glossed over, but what it
| means is that there are not 7 types of failure/recovery, there
| are just two - when one drive fails, and when two drives fail.
| And the recovery actually contains all cases all the time (case
| 1-3 for 1 disk, case 4-7 for two disks).
| redis_mlc wrote:
| I'm a storage engineer.
|
| There's a huge difference bewtween theory and practise.
|
| - rebuilds can timeout
|
| - load can really cause timeouts
|
| - raid controllers can have tested modes that work, and some that
| don't
|
| - batches of drives can all fail at the same time, esp. during
| rebuilds
|
| - the nvram might not be portable across controllers.
| mchristen wrote:
| More than once I've lost a RAID array during a rebuild. That's
| what backups are for, I suppose.
| sliken wrote:
| 1st lesson on using a RAID should be that RAID is not a
| replacement for backups.
| klodolph wrote:
| This is to be expected. RAID 5 is not safe. It is common for
| a RAID 5 array to be running in a degraded state during
| "normal" operation without the operator's knowledge. Then,
| when one drive fails, rebuilding the array is impossible; you
| have already lost data.
|
| This is common because there is often no good way to repair a
| RAID 5 system which is running in degraded state, and there
| is often no monitoring to respond to array degradation (if
| you can't fix it, why monitor it?)
|
| In other words, RAID 5 does not protect very well against
| drive failures. RAID 6 is more durable but inefficient (there
| are systems which are both more durable and more efficient
| than RAID 6). RAID is generally optimized for implementation
| simplicity over other concerns.
| tutfbhuf wrote:
| > there are systems which are both more durable and more
| efficient than RAID 6
|
| RAID 10?
| speakeron wrote:
| RAID 10 is more efficient (if that equates to performant)
| than RAID 6, but not more durable. In fact it's less.
|
| As an example, a 4-disk array (which has the same
| capacity in both RAID 6 and RAID 10) has 4 2-disk failure
| modes. RAID 6 can handle all of these failure modes
| without losing data, whereas RAID 10 can only handle 2 of
| them.
| klodolph wrote:
| Efficiency is how much overhead you have. RAID 10 has 50%
| efficiency (50% of your storage is used for data), RAID 6
| has N/N+2 efficiency. Since RAID 10 means using 4 drives,
| RAID 10 is never more efficient than RAID 6 for the same
| number of drives.
|
| The three main things we can optimize for are storage
| efficiency, I/O performance, and data durability. For
| most situations, RAID 6 is not Pareto-optimal. In other
| words, you can come up with solutions that have better
| storage efficiency, better I/O performance, or better
| data durability, without sacrificing anything.
|
| The only reason you'd use RAID 6 is because it's easy to
| use. It's measurably worse on any other axis. Combine
| this with various low-quality RAID implementations and
| RAID is even worse. Software RAID in the Linux kernel is
| fairly robust but there are many low-quality hardware
| RAID implementations.
| jandrese wrote:
| Efficiency is not just drive efficiency but controller
| time. The 4 disk RAID1_0 array may have the same drive
| space overhead as the RAID6 version, but it may be much
| slower due to the controller overhead.
|
| One of the often overlooked aspects of RAID is that the
| more complexity there is in the controller the higher the
| risk of losing your data due to
| hardware/firmware/software error. The worst part is these
| errors often don't appear until your array is already
| compromised in some way, so everything is fine until a
| single disk failure triggers a previously unknown bug
| that causes the rebuild procedure to clobber all of the
| data on all of the drives.
|
| I've known some professionals who use RAID1_0 whenever
| they can get away with it because they don't trust RAID
| controllers to handle anything more complex.
| Dylan16807 wrote:
| RAID 6 is pretty darn simple, I'm not sure what
| controller issues you'd have.
|
| That said, RAID 10 is great for many professional use
| cases. It's very fast, scores well on reliability, the
| cost premium isn't too bad, and you need backups anyway.
| klodolph wrote:
| > Efficiency is not just drive efficiency but controller
| time.
|
| Sorry, I should be clear. When I say efficiency, I mean
| "storage efficiency" only. Controller time I count as a
| part of performance. Perhaps my comment makes more sense
| with that definition in mind.
| eps wrote:
| There's also this paper, which is quite good -
|
| https://mirrors.edge.kernel.org/pub/linux/kernel/people/hpa/...
|
| Also there are simpler, xor-based encoding schemes that allow
| recovering from double failures (e.g. EvenOdd and such), but the
| problem is that they are all patented.
| [deleted]
| est31 wrote:
| This E-Mail about increased numbers of parities is also
| interesting: https://www.mail-archive.com/linux-
| btrfs@vger.kernel.org/msg...
| huhtenberg wrote:
| That's the SnapRAID dev!
|
| Also, check out this bit - >> Could you
| perhaps describe how the Cauchy matrix is >> derived,
| and under what conditions it would become >> singular?
| > The Cauchy matrix has the mathematical property ... >
| ... > Besides the mathematical proof, I've also
| inverted all > the 377,342,351,231 possible
| submatrices for up to 6 > parities and 251 data disks,
| and got an experimental > confirmation of this.
|
| "Trust, but verify" :)
|
| [0] https://www.mail-archive.com/linux-
| btrfs@vger.kernel.org/msg...
| alanfranz wrote:
| Is there any system for error recovery in userspace? Something
| like, I've got directory DATA, i want to create a recovery file
| for all the contents in such directory. Ideally I'd like to be
| able to tune the size of the recovery, larger recovery = recover
| more errors
| sliken wrote:
| This isn't a turn key utility, but it's relatively easy to call
| from Go. It allows you to set the number of data shards and
| parity shards, it's pretty fast. It's at
| https://github.com/klauspost/reedsolomon
|
| Seems like a pretty common solution, I suspect there's other
| libraries for popular language/platforms.
| auxym wrote:
| We used to use this when creating backups on CD-Rs and later
| DVDs, before hard drives became cheap...
|
| https://en.wikipedia.org/wiki/Parchive
| bestham wrote:
| I used a program named Dvdisaster for the same use case. You
| could use it I two modes: (1) One disc with additional out of
| band reed-solomon ECC. Create an iso that are smaller by a
| certain percentage than the medium you want to burn it to.
| Dvdisaster will create a new ISO that uses the full capacity
| of the medium and fill the remaining space with parity-
| information that protects the whole file system and all user
| files. (2) Put your parity information on the next disc you
| burn. A parity file is crated for the first disc you burn,
| this is later added to the file system of the second disc you
| burn. Repeat indefinitely. You always have the last yet-to-
| be-burned parity file on your computer and can thus recover
| bit-errors in every disc you have burned.
| dmayle wrote:
| What you're looking for is called Parchive. It uses Reed-
| Solomon coding in order to generate redundant files that can be
| used to recover the original. Look for PAR2 utilities (the
| current major version of the format). It has multiple
| implementations, open source and closed, across platforms and
| architectures.
|
| https://en.wikipedia.org/wiki/Parchive
| thomasmg wrote:
| Another option is to use an IBLT (Inverted Bloom Lookup
| Table). it is easy to implement. I recently wrote a file
| repair tool that uses it: https://github.com/thomasmueller/ti
| nyStats/blob/master/src/m... (not production ready)
| brnt wrote:
| I wrote a gui for it, tot help deal with many files being in
| various states: https://github.com/brenthuisman/par2deep
| dragontamer wrote:
| On the one hand, this is a great "brute force" method to learn
| how one implementation of RAID6 could happen. This blogpost
| achieves understanding of error-correction in the 2-parity case
| through brute force alone, and that's actually quite commendable.
|
| On the other hand, its a very, very bad way to learn Reed Solomon
| codes. Reed Solomon codes "in general" are a concept that's
| broader than just 2-parity symbols... and also broader than just
| "protection vs erasures". Reed Solomon is designed to correct
| errors as well (a far "harder" computation than simply an
| erasure).
|
| I've fortunately found a good "ELI5" version of Reed Solomon:
| https://www.youtube.com/watch?v=xE4jEKx9fTM
|
| That's a very non-technical talk, but that talk gives the full
| concepts of "Polynomial in Finite Field" that Reed Solomon codes
| are based off of.
|
| For a proper computer-based Reed Solomon code, you choose the
| GF(2^8) extension field for 8-bit symbols (Pararchive uses 2^16
| for 16-bit symbols). Extension fields are extremely complicated
| to understand however, this will take a few days of Abstract
| Algebra study to understand how extension fields work.
|
| But once you know how a finite extension field works, you then
| extend a polynomial into the extension field, and bam, you have
| Reed Solomon codes.
|
| That is: a Quadratic equation can protect 3-data. All quadratic
| equations are ax^2 + bx + c, so any 3-points define a quadratic
| equation. Generate (x,y) coordinates on a graph as (0, data0),
| (1, data1), (2, data2).
|
| Then, solve for a quadratic equation f(x) = ax^2 + bx + c.
|
| Then, to generate parity, f(3) is your first parity symbol, f(4)
| is your next parity symbol, f(5) is the parity symbol after that.
| Etc. etc. Note in a 8-bit finite field, you can only go up to
| f(255) before "looping back" to f(0).
|
| From there: 2-check symbols will protect against either
| 2-erasures or 1-error. So Quadratic Equation with 4-check symbols
| (aka: 4 "other points" of the quadratic equation passed with the
| data) can protect against 4 erasures... or 2-errors, or
| 2-erasures + 1 error.
| klodolph wrote:
| Note that the size of the field only has to be as large as the
| size of the redundancy set (data + code). So if you are
| striping across 16 or fewer disks, you can use GF(2^4), but if
| you are striping across 17 disks you can't use GF(2^4).
|
| Well, if I remember the math correctly.
| dragontamer wrote:
| > Well, if I remember the math correctly.
|
| You're off by one. There's a "minus 1" because Reed Solomon
| is over the cyclical multiplication group, not over the
| cyclical additive group. (TL;DR: The number 0 cannot
| participate in... some portion of the formulation of the
| polynomials)
|
| GF(2^3) is over blocksize of 7. GF(2^4) is over blocksize of
| 15. GF(2^8) is over 255.
|
| ---------
|
| GF(2^8) is the smallest size in practice, because most
| people's data is organized into 8-bits-at-a-time today. (aka:
| a byte). But you're mostly right (once you account for the
| off-by-one error): there's a limit to 255 elements per block
| in GF(2^8) Reed Solomon.
|
| But GF(2^3) or GF(2^4) are great for hand-exercises, the
| smaller size makes it far easier to compute calculations by
| hand and allow for the human to gain understanding. (aka:
| Octal 2^3 numbers or Hexadecimal 2^4 numbers). GF(2^2) is
| technically possible... but all the matrix math is over 1x1
| matricies and not really a good opportunity to learn from.
| GF(2^1) is literally impossible for Reed Solomon to work in.
| So GF(2^3) is the smallest practical hand-exercise size
| (RS(7, 5) over GF(2^3) has a 5x7 generator matrix and a 7x2
| check matrix. Small enough to do by hand though a bit
| tedious).
|
| If you actually wanted to "split symbols up" so that you
| better detect bit-errors, then LDPC or Turbocodes are
| straight up superior over Reed Solomon for "small symbol
| sets". Reed Solomon is best used over "large symbols" (aka:
| 2^8 or larger).
|
| ----------
|
| Any Reed Solomon code can be punctured (aka: fewer parity
| symbols), or shortened (aka: fewer data symbols) by simply
| setting locations of the Reed-Solomon generator and/or check
| matrix to 0.
|
| As such, if you organize "smaller sets" into 2^8 (aka: 8-bit
| bytes), its just very natural to use a punctured + shortened
| Reed Solomon code instead of trying to get a 2^4 Reed Solomon
| code to work.
| klodolph wrote:
| You're right! Off by one.
|
| > GF(2^8) is the smallest size in practice, because most
| people's data is organized into 8-bits-at-a-time today.
|
| IIRC GF(2^4) is actually used in practice. Long story
| short, the CPU overhead of rebuilding data is a real
| concern.
| dragontamer wrote:
| > IIRC GF(2^4) is actually used in practice. Long story
| short, the CPU overhead of rebuilding data is a real
| concern.
|
| Interesting. I don't know about how these systems are
| implemented in practice, but I'd be surprised if GF(2^8)
| was "too big".
|
| GF(2^8) was small enough that it was implemented in 1977
| on the Voyager probe project. Yeah, it took a literal
| supercomputer to decode that RS-code in the 1980s, but
| modern computers are basically as fast as 1980s-era
| supercomputers.
|
| Modern papers are talking about GF(2^32) to GF(2^128)
| these days.
| (https://www.ccis.northeastern.edu/home/alina/papers/tos-
| gf.p...). Most implementations don't need anything near
| that size, so GF(2^8) is sufficient.
|
| ---------
|
| I mean, I've done GF(2^4) RS codes by hand. That's a
| very, very, very small system, human calculable with just
| pen and paper. So you're right that its far easier to
| calculate. I'm just kind of surprised that anyone would
| bother with a computer RS code of that miniscule size.
|
| Ex: RS(15, 9) is a 9x15 generator matrix and a 6x15 check
| matrix over GF(2^4). Its on the large-side of hand-doable
| calculations, but absolutely doable by pen-and-paper (I
| recommend graph paper). Every symbol is a 2^4 hexadecimal
| symbol, so a single hand-written character
| (0123456789ABCDEF) is stored on a single cell in graph
| paper.
|
| -------
|
| Are you sure those systems are working in GF(16) fields?
| Not GF(2^16) AKA gf(65536) fields? GF(2^16) is more of
| the size I'd expect a computer to do, more so than
| GF(2^4).
| klodolph wrote:
| Yes, I am sure that it's GF(2^4) and not GF(2^16). The
| problem is not that GF(2^8) or GF(2^16) are "too big",
| it's just that GF(2^4) is faster and uses less CPU.
|
| The purpose of using GF(2^128) or other large sizes is
| usually cryptographic, which has different tradeoffs. For
| example, Galois counter mode. The linked paper describes
| secure storage mechanisms.
|
| Also note that you're not just doing one field operation
| per byte, but several. If I'm remembering correctly,
| let's say you're using an (11,8) code, then you're doing
| 24 multiply and 21 addition operations to calculate the
| encoded message. You might think of various ways to
| accelerate this which benefit from a smaller field size,
| especially if you are doing the calculations on commodity
| CPUs. The only point I'm making here is that the
| computation consumes some nontrivial amount of CPU
| resources, and so it can be fruitful to optimize.
|
| The fact that space probes like Voyager use a large field
| size is just because the cost of resources is different.
| For space probes, bandwidth is a precious resource. If
| you spent $250M on a single space probe, you can afford
| to use a bit more CPU to encode and decode your puny 100
| kHz downlink. On the other hand, if you are running a
| data center, the power consumption of CPUs is one of your
| largest costs.
| dragontamer wrote:
| > Yes, I am sure that it's GF(2^4) and not GF(2^16). The
| problem is not that GF(2^8) or GF(2^16) are "too big",
| it's just that GF(2^4) is faster and uses less CPU.
|
| A multiply in GF(2^8) is just a single multiplication
| lookup. 256 x 256 == 64kB lookup table. I'm not sure how
| much faster you can get than a single 8-bit lookup on a
| table small enough to fit in a typical CPU's cache.
|
| GF(2^8) is so small that you don't even need to do the
| log-table / antilog-table trick. Just store the entire
| multiplication table, since the multiplication table fits
| in L2 cache.
|
| A GF(2^4) multiply would be a single multiplication
| lookup 16 x 16 or 256 byte (0.25kB) lookup table. Very,
| very tiny, probably no faster than the GF(2^8) lookup.
|
| > Also note that you're not just doing one field
| operation per byte, but several. If I'm remembering
| correctly, let's say you're using an (11,8) code, then
| you're doing 24 multiply and 21 addition operations to
| calculate the encoded message.
|
| That RS(11,8) code (probably shortened / punctured) could
| be a GF(2^8) code. The question is: do you want to
| organize your data in 8-bit bytes in GF(2^8), or 4-bit
| nibbles in GF(2^4)?
|
| I expect that 8-bit bytes is actually faster for modern
| computers than 4-bit nibbles. Even if you are using a
| severely shortened / punctured code, there are strong
| benefits to sticking with 8-bit bytes on a modern
| computer.
|
| That's the thing: computers usually have 8-bit bytes as
| the smallest pragmatic unit to use. GF(2^8) lines up with
| that perfectly, and is used in a variety of high-
| performance applications (including AES).
|
| > The fact that space probes like Voyager use a large
| field size is just because the cost of resources is
| different. For space probes, bandwidth is a precious
| resource. If you spent $250M on a single space probe, you
| can afford to use a bit more CPU to encode and decode
| your 100 kHz downlink.
|
| The Voyager probe launched in 1977, it was with an utter
| crap CPU by today's standards. Your USB power-cable has
| more CPU power and RAM than the voyager probe... that
| tiny USB CPU is used to negotiate the voltage level (5V
| vs 12V USB power negotiation)
|
| --------
|
| Voyager launched with 32k 18-bit words of RAM and 4MHz
| for its 6x parallel CPUs (handling different parts of the
| craft: flight vs imaging unit vs radio unit... etc.
| etc.). I think you're severely overestimating the
| strength of Voyager probe, or the level of technology
| available to 1977.
| klodolph wrote:
| > A multiply in GF(2^8) is just a single multiplication
| lookup. 256 x 256 == 64kB lookup table. I'm not sure how
| much faster you can get than a single 8-bit lookup on a
| table small enough to fit in a typical CPU's cache.
|
| 64 KB often won't fit in L1 cache anyway.
|
| > A GF(2^4) multiply would be a single multiplication
| lookup 16 x 16 or 256 byte (0.25kB) lookup table. Very,
| very tiny, probably no faster than the GF(2^8) lookup.
|
| Or you can do _three_ operations using a single 16 x 16 x
| 16 x 16 lookup table, and introduce fewer data
| dependencies into your CPU pipeline.
|
| The operations work on half as many bits, but complete
| three times as many steps. Sounds like a win to me.
|
| > I expect that 8-bit bytes is actually faster for modern
| computers than 4-bit nibbles. Even if you are using a
| severely shortened / punctured code, there are strong
| benefits to sticking with 8-bit bytes on a modern
| computer.
|
| You keep saying this. I understand how CPUs are byte-
| addressed. When you operate on smaller sizes, you have to
| do packing / unpacking. If your CPU core spends most of
| its time doing table lookups, then a little bit of
| packing / unpacking is nearly free, since your ALUs are
| otherwise fairly idle, assuming you have a couple
| registers free.
| dragontamer wrote:
| > Or you can do three operations using a single 16 x 16 x
| 16 x 16 lookup table, and introduce fewer data
| dependencies into your CPU pipeline.
|
| Gotcha.
|
| So to summarize: you've seen GF(2^4) calculations
| performed SIMD style. A singular 16-bit lookup can be 4x
| GF(2^4) lookups in parallel.
|
| Seems legit to me. A non-obvious implementation, but
| clearly possible and faster.
| layoutIfNeeded wrote:
| Edit: I was wrong
| bfuclusion wrote:
| Correct me if I'm wrong, but I think that scheme will protect
| you from at most one erasure.
|
| Fail D1 and P1 There's no way to recover D1s data since P2 has
| no data from D1 in it.
| jakubp wrote:
| There is no mention of "P1" in the entire article. What did
| you mean?
| layoutIfNeeded wrote:
| Actually, you're right!
___________________________________________________________________
(page generated 2021-01-07 23:01 UTC)