[HN Gopher] Non-ECC memory corrupted my hard drive image [video]
___________________________________________________________________
Non-ECC memory corrupted my hard drive image [video]
Author : zeristor
Score : 82 points
Date : 2022-12-25 11:17 UTC (11 hours ago)
(HTM) web link (www.youtube.com)
(TXT) w3m dump (www.youtube.com)
| tibbydudeza wrote:
| Got a Gen 10 HPE Microserver for my NAS run some AMD dual core
| SoC - factory fitted with ECC memory running Unraid.
|
| Think it mysteriously crashed once or twice in the 4 years I had
| it and the HP diagnostic light came on.
| encryptluks2 wrote:
| I watched the full video. It was long but very informative. The
| humor at times made up for the length and the presenter showed a
| lot of deep knowledge that most people won't have. My biggest
| gripe is that they just didn't try replacing the RAM sticks in
| the first place. I get that they wanted to do a root cause
| analysis, but geez the time and patience they had to do all those
| memory tests. No wonder they did a video about it, cause
| otherwise that lost time would have been painful. I was baffled
| as well that dd and ddrescue work differently in how they utilize
| the RAM. Caught me offguard.
|
| Onto the discussion of ECC RAM. In a perfect world, all memory
| would be ECC... but try finding some high performance 16GB sticks
| of ECC DDR4 RAM like what you'll see on gaming computers. I don't
| even think they make anything comparable in terms of speed and
| definitely not costs. I guess you don't really know that you
| needed ECC until it's too late.
| morelikeborelax wrote:
| > " I guess you don't really know that you needed ECC until
| it's too late."
|
| I spent many years on hardware consultation and was amazed at
| the all the times I had to explain it was just a what if
| insurance like any other things their business was mitigating
| against. Sometimes they'd even decided they needed to save
| costs in non-ecc ram when it was $4 a gb in difference, or
| (during the FB-DIMM era) there wasn't even an option to avoid
| it.
|
| Never really understood the resistance towards it.
|
| Maybe the lack of evidence before the Google study and people
| thinking RAM manufacturers were trying to rip them off or
| something.
|
| The "never had a problem so why would I need" it attitude with
| no way to know if an issue was caused by a bit flip was most
| baffling.
| simoncion wrote:
| > ...but try finding some high performance 16GB sticks of ECC
| DDR4 RAM like what you'll see on gaming computers.
|
| Here ya go:
|
| https://nemixram.com/16gb-ddr4-3200-pc4-25600-ecc-udimm-2rx8...
|
| It doesn't have pretty lights on it, but it does seem to be in
| the same speed class that gets called "gaming RAM" by a _whole_
| bunch of retailers.
| jeroenhd wrote:
| It's good that with DDR5 consumer memory will get some super
| basic ECC on die, so hopefully the next generation of memory will
| make the problematic sticks more obvious (or prevent damage in
| the very least). ECC won't save you from memory corruption, but
| it'll save your data at least.
|
| Personally, I would've just checksummed the individual failing
| files rather than the disk image and only back up the bad files
| separately. There are all kinds of ways for a disk image to fail
| and I wouldn't spend a second longer on it than absolutely
| necessary. The whole memtest permutation setup also would've been
| too much work for. E, I would just declare the motherboard faulty
| when two sticks that otherwise pass the test fail in specific
| configurations. A new motherboard is cheaper than super specific
| RAM sticks.
| tibbydudeza wrote:
| Afaik ECC memory is slower than normal memory, so it does not
| impress the folks who base their purchase decisions on benchmark
| scores rather than utility and best bang for the buck.
| ksec wrote:
| This is especially true for NAS. And why you need BTRFS or
| preferably ZFS. Unfortunately none of the consumer NAS offers
| ZFS, and BTRFS is still not a default option. Neither Synology or
| Qnap seems to care.
| aborsy wrote:
| Synology offers ECC in 2023 consumer modes such as 923.
|
| Still the experience that synology btrfs provides is nowhere as
| good as ZFS (due to a lot of limitations).
| metadat wrote:
| I bought a 6 disk Synology a few months ago and it came with
| ECC by default. I did a cursory web search about this just
| now and ECC support appears to be the norm for 22 (as in the
| year 2022) model revisions and newer (thankfully!).
| layer8 wrote:
| It's because they use AMD CPUs now (already in the 21
| models). The trade-off is that those CPUs have worse
| hardware codec support than the Intel ones they previously
| used, if you want to do video transcoding.
| [deleted]
| gmokki wrote:
| I had RAM go bad after running 18T (5 HDDs in raid1) btrfs
| system in closet for years. Btrfs of course noticed it and
| fixed most of them automatically when some of blocks were
| corrupt. But eventually the system failed: the tree that
| contains the checksums for all the other trees corrupted itself
| on both copies of one node. Fixed the HW problem and then had
| to use hex editor to set the checksum manually to correct value
| (I modified the kernel to print the expected value). Now the
| system has been again stable for 3 years.
| [deleted]
| moloch-hai wrote:
| We don't have ECC mainly because Intel has long been hostile to
| "consumer" access to ECC.
|
| _Apparently_ this was conceived as a market segmentation scheme:
| people outfitting servers could get ECC when they pay a huge
| premium. They would thereby not be tempted to cheap out and buy
| consumer-grade equipment, otherwise wholly adequate to meet all
| their needs at a radically cheaper price.
|
| That we cannot get laptops or even desk machines with ECC, and so
| have them crash frequently, is seen as a trivial side effect of
| the strategy. If you did not hate Intel enough before, you may
| increase your hatred accordingly. Intel doesn't hate you back;
| they simply care not even a little how you feel.
|
| (Historically, just running Microsoft software was overwhelmingly
| more likely to be the cause of a crash than a memory bit-flip;
| and there were orders of magnitude fewer RAM bits at risk.
| Microsoft succeeded in getting customers to accept and even
| expect frequent crashes; before MS, a program crashing was
| grounds for a refund.)
| helf wrote:
| I see this a lot but I really don't think it was/is that
| simplistic.
|
| It's added complexity and cost for something that rarely would
| benefit most consumers. Now, you can argue that the complexity
| and cost is a nonissue on modern setups and I would probably
| agree.
|
| But Intel has long had desktop grade hardware with ECC support.
| The 440GX chipset supported ECC and I ran a Dell GX1 SFF with
| 768MB of ECC PC100 for yeeeears with a 450 MHz P3 and later
| upgraded to 1.4ghz tualatin-256 via a slotket adapter.
|
| The 440HX /socket 7/ chipset supported ECC. And that's a
| Pentium 1 chipset.
|
| The 440BX/GX and 450NX supported ECC and that's with desktop
| pentium 2 and 3 chips.
|
| The 820/820E/840 supported ECC with desktop celeron and
| pentium2/3 chips
|
| 845/845e/850/850e/860 pentium4 chipsets support ECC
|
| 875/e7205/e7221/e7230 did with desktop pentium 4 and pentium d
| chips
|
| 925/925xe/955x/975x did with desktop pentium 4/pentium d/core 2
|
| It's more sparse now that they moved to the IMC, granted. But
| Intel has long had multiple chipsets per generation with ECC
| support for desktop grade hardware.
| eternityforest wrote:
| I used to be a big fan of Intel, up until the latest chips from
| other companies that seem to have beat them on
| performance/watt. My next laptop will probably be AMD if the
| situation hasn't changed.
| vladvasiliu wrote:
| I'm happy with my AMD based laptop. But I haven't seen any
| that support ECC.
|
| But I did see a Lenovo model, IIRC, that had some kind of
| Xeon and ECC. Not sure what the noise and battery life
| situations on that thing were, though.
| OJFord wrote:
| I realise that's blurrier when it comes to laptops, but
| AIUI it's more a case of whether the motherboard than
| supports it than about the AMD chip. i.e. given a desktop
| CPU, as far as I know you can put it in a motherboard that
| either does or does not support ECC RAM.
| eternityforest wrote:
| I'm surprised nobody has made RAM with the ECC logic
| built into the ram itself, that just looks like normal
| ram to the CPU.
| IYasha wrote:
| Having ECC being checked inside the CPU is actually
| useful as data loss may be induced by EMI (and other
| factors) on PCB data lines.
| AdrianB1 wrote:
| It is called DDR5 - it has ECC built in the module
| itself. Making such a module does not make much sense if
| you cannot report the rate of errors, so if it is just
| hiding you have a bad RAM stick there is only so much
| value in having ECC.
| simoncion wrote:
| It is my understanding that the ECC that you're talking
| about only protects data-in-flight between the module and
| whatever is reading or writing the data. It does not
| protect against corruption of data-at-rest, which is what
| is protected with ECC in DDR4 and older.
|
| It's also my understanding that the DDR5 data-in-flight
| ECC is a _mandatory_ feature because the link between the
| memory modules and everything else is so error-prone that
| the system would simply not function without it.
| eternityforest wrote:
| Taking a quick glace at the articles, I think it's the
| opposite, DDR5 protects data at rest only, because they
| want to make the chips so unreliable it can't work
| without it, not the bus.
|
| But in practice, it will probably be more reliable than
| DDR4 without ECC, since now you need 2 cosmic ray flips,
| or 1 plus a manufacturing defect flip, and the defect
| flips will probably be uncommon-ish.
|
| It's too bad data in flight isn't protected without old
| fashioned ECC on top of that, but it will probably be a
| big step up, the same way that flash memory is now very
| reliable even though the actual uncorrected errors are
| probably worse under the hood.
| toast0 wrote:
| The problem with the DDR5 approach is there's no
| reporting mechanism, so while it will reduce the error
| rate of a marginal module, it doesn't let you know so you
| can replace it. In my experience with ECC modules, a
| module with some errors is a lot more likely to get more
| errors than one that's operating with zero errors.
| my123 wrote:
| Ryzen APUs, which include almost all AMD laptops,
| actually have ECC fused off in silicon unless you buy the
| "Pro" variant.
| vladvasiliu wrote:
| Huh. I didn't know that.
|
| My particular laptop does have a "pro" CPU. However, I
| would be surprised to no end to learn that it supports
| ECC. This particular model sports an MBP-level price tag
| [0], but is absurdly cheaply built. Even for "customer
| facing components", that are easy to compare, such as the
| screen (terrible colors) and case (creaks if you look at
| it wrong). HP doesn't offer ECC RAM, not even as an
| upgrade, so I really don't think the additional lines are
| physically present.
|
| ---
|
| [0] I don't remember the specific number, but it was
| within 100 EUR of a 14" M1 MBP with 32 GB RAM and 512 GB
| SSD. That's counting a RAM (8 -> 32) and SSD (256 -> 512)
| upgrade which were made with components bought separately
| (though they were rather high-end).
| vladvasiliu wrote:
| That's right, but seeing how laptops seem to do the bare-
| minimum, I would be really surprised to learn than a
| random model, _which doesn 't advertise it_, actually
| supports it.
| erk__ wrote:
| > That we cannot get laptops or even desk machines with ECC
|
| The Xeon series of laptop processors does support ECC just at a
| quite large premium.
| gruez wrote:
| > That we cannot get laptops or even desk machines with ECC,
| and so have them crash frequently, is seen as a trivial side
| effect of the strategy
|
| I'm not sure what you mean by "frequently", but my non-ECC
| machines definitely do not crash "frequently".
|
| > before MS, a program crashing was grounds for a refund
|
| Source?
| Brian_K_White wrote:
| The problem with untrusworthy memory (or any other component)
| is not that your system crashes, it's that it _doesn 't_.
| gruez wrote:
| I don't doubt that non-ECC hardware experiences some non-
| zero number of bitflips per year. I'm just doubting the
| parent commenter's claim that non-ECC ram is causing
| computers to crash "frequently".
| layer8 wrote:
| And the parent is pointing out that _not_ crashing on bit
| flips is exactly the problem.
| AdrianB1 wrote:
| "frequently" is very subjective or relative in this context.
| 25 years ago I had a crash per hour on almost any regular
| computer, but zero crashes per month on servers with ECC. In
| the past couple of years I think I had a few cases of frozen
| apps, but I don't remember of any OS level problem. At the
| same time, on servers I see from time to time ECC fixing a
| bit, but on the desktop or laptop I have no idea how many
| times corrupted bits went undetected and what is the
| consequence.
| gruez wrote:
| >25 years ago I had a crash per hour on almost any regular
| computer, but zero crashes per month on servers with ECC
|
| If it's crashing once per hour, it's probably unstable
| drivers/software or flaky hardware that needs to be RMAed,
| not random bitflips.
| navjack27 wrote:
| I think sometime else is wrong if you've had a "crash" per
| hour.
| IYasha wrote:
| For this reason for my first truly made-from-scratch home NAS I
| went AMD64 with ECC UDIMMs. It was some very basic Athlon64,
| but it COULD do ECC. Since then I moved to Opterons and Xeons
| but I still remember that choice.
| Gordonjcp wrote:
| > That we cannot get laptops or even desk machines with ECC,
| and so have them crash frequently, is seen as a trivial side
| effect of the strategy.
|
| How frequently would you say you encounter a crash that you can
| pin down to a lack of ECC memory in your laptop or desktop?
| Filligree wrote:
| You can't, that's the thing, right?
|
| I have a Ryzen desktop with ECC, and it registers about one
| bit-flip per week. I don't know how many of those would
| become crashes, but I'm more worried about the ones that
| wouldn't.
| dale_glass wrote:
| Yup, been there.
|
| Way back I had a Pentium 133 doing firewall duty in a closet. It
| did approximately nothing besides iptables, but of course any
| machine has logs, updates and so on going on.
|
| After running fine for months one day it suddenly died. I
| rebooted it. A few days later it died again. Another reboot. Then
| it died for the last time and failed to boot at all. Examination
| showed the disk was corrupt and couldn't be mounted. Further
| examination showed that one of the memory modules was loose for
| some reason, could be that it was never firmly in and I just
| bumped the box when messing with something else.
|
| Then came the wasted weekend of dealing with that my normal
| internet connection relied on the thing that was now completely
| broken.
|
| And that was the luckiest case I can imagine, when the broken
| machine contains no data of actual value. Since then I'm very
| paranoid, always run memtest on any new RAM I buy overnight, and
| have ECC where it's possible to have it.
| vladvasiliu wrote:
| > Since then I'm very paranoid, always run memtest on any new
| RAM I buy overnight, and have ECC where it's possible to have
| it.
|
| Yeah, I do the same, but I've learned that you have to do it
| regularly.
|
| In one of my desktop machines, the RAM ran fine for like two
| years. Then, all of sudden, random Firefox segfaults, etc.
|
| Whipped up a memtest ISO, and sure enough, one of the sticks
| was bad.
| dale_glass wrote:
| That's the nice thing about ECC, it acts like an always
| running memory test.
|
| You normally have a scrub time that can be configured in the
| BIOS, which also adds a regular verification of the entire
| RAM at regular intervals, just in case something goes wrong
| in some rarely used part of the memory.
| IYasha wrote:
| Unfortunately, background scrubbing significantly increases
| power consumption and impacts performance as well.
| metadat wrote:
| Do you want (1) a higher rate of stable and correct
| computations to be performed at a slightly higher energy
| cost, or (2) a demonstrably less reliable device at a
| slightly lower energy efficiency?
|
| I'll go for #1 in most cases, as long as the system is to
| be relied upon for anything deemed important.
| IYasha wrote:
| Me too, of course. I'm just highlighting downsides so
| people know what to expect.
| ilyt wrote:
| Weirdly enough I had same case but memory turned out to be
| fine, replacing power supply fixed the issue. I ran test, saw
| memory is bad, replaced sticks, same problem, put the sticks
| back in and decided to just run it (it was gaming PC).
|
| Few months later powersupply outright died (had ~8years at
| that point), replaced it with good one, no memory errors.
| vladvasiliu wrote:
| In my case it was clearly a bad RAM stick. Took it out, OK.
| Switched them around: errors. Replaced it with a new one,
| back to OK.
|
| In this particular case, a bad PSU would be the end of the
| PC. It's an HP dekstop mini. Basically a laptop without a
| screen, powered by and external adaptor that puts out a
| single 12V line. All further conversions are done on the
| motherboard somehow.
| consp wrote:
| Badly socketed ram was one of the reasons my PC started failing
| after being on for a while. When everything was cool it was all
| fine, when the case and everything heated up a bit it failed
| eventually. Re-seating the ram fixed it and ran for quite a
| time without issues. This was in the early Athlon days though.
| lizardactivist wrote:
| I'm curious what the actual manufacturing costs for ECC DRAM is
| compared to regular DRAM. Is it considerably more expensive, or
| just the usual over-charge because it's better?
| jeffbee wrote:
| It's exactly 1/8th more.
| aortega wrote:
| I have 128 GB of non-ecc memory in my notebook, never detected a
| single error, and has been on 24/7 for more than 4 years.
|
| Unless you live over 4000 meters over the sea level, like to
| compile while flying or live close to an unshielded nuclear
| reactor, you don't need ECC.
|
| And most memory problems you can fix by better cooling, and
| better shielding.
| H8crilA wrote:
| This problem has no ultimate solution. I've seen all components
| flip bits, CPUs, networking cards, RAM, most often you just can't
| know for sure what did it. You can remedy it a bit (like with
| ECC), but ultimately there will always be corruption if you
| process hundreds of petabytes of data. Get used to it, your
| computer executes an instruction with a probability extremely
| close to 1, but not equal to 1.
|
| Deep in the archives of a well known tech company is a very well
| documented case of a bit flip that caused the wrong function to
| be executed in a C++ v-table. The big oof was that this function
| was the equivalent of an SQL "drop table", and just happened to
| be 32 bytes off of a very benign function that did something like
| stat(). Really funny stuff once the crisis is over :)
| dale_glass wrote:
| ECC isn't a terribly complicated technology, and can be used in
| all those cases.
|
| In limited cases, a checksum is good enough. If you checksum
| outgoing data, and verify it on reception, then it being
| corrupted in transit whether on the network card or the cable
| can be detected and transparently compensated for.
|
| Really, we can do much better than to "get used to it".
| H8crilA wrote:
| You are under the impression that CPUs and other chips always
| perform the same instructions as are written in the code, and
| only RAM can flip bits because DRAM is DRAM :)
|
| It can (and should! whenever possible) be improved, not
| fixed. There's always that pesky gamma that can hit a
| specific transistor, even if it is deep underground. Gamma
| cannot be fully stopped. At certain scales data corruption
| becomes directly measurable. And yes, corruption levels vary
| between pieces of hardware.
| ilyt wrote:
| Sure but once your registers, cache, data bus and address
| bus has ECC you have vastly smaller area that can flip.
|
| You can even _just buy_ (well, chipaggedon aside) ARM cores
| that have 2 chips running in parallel and faulting when the
| result is different
| my123 wrote:
| > You can even just buy (well, chipaggedon aside) ARM
| cores that have 2 chips running in parallel and faulting
| when the result is different
|
| See dual-core lock-step Arm chips (used for automotive).
| dale_glass wrote:
| Of course not. I'm not saying we can have perfection. I'm
| saying that we can do much better, using methods and
| technologies that are very old at this point.
|
| The reason why we don't is laziness and market
| segmentation, mostly.
| sobriquet9 wrote:
| The probability of a bit flip depends on the size of the
| transistor used. RAM tends to pack many small transistors.
| don-code wrote:
| Is this something that's documented publicly? I'd love to read
| more.
| fpoling wrote:
| One can process petabytes without bit flips if one use proper
| checksums and error correction codes. While that does imposes
| overhead, it is not big and, thanks to Shannon theorem, can be
| made arbitrary small with sufficiently big blocks.
| water8 wrote:
| [dead]
| IYasha wrote:
| It would be so nice if DD and DDRescue did calculate hashes while
| copying.
| chunk_waffle wrote:
| For anyone curious (as I was once and looked into it) Dell and
| Lenovo both ship Laptops with Xeon's and ECC memory though they
| are very expensive.
| thekombustor wrote:
| Yep, you can get it on the higher end trims of the P-series
| Thinkpads.
| TacticalCoder wrote:
| > Why ECC Memory Is So Important
|
| Except that it's not for many use cases. It's great for servers
| but for people on their personal and/or work computer, it's
| simply not that useful.
|
| Seriously: which percentage of developers have ECC on their
| development machine(s)?
|
| As developers we live in a world of SSH, cryptographic hashes,
| checksums everywhere, Git repositories (that is a big one),
| Merkle trees, digital signatures, reproducible builds (which are
| gaining traction), etc.
|
| Heck, I'm torrenting the latest Debian or Devuan .iso image. My
| torrent client is using every known trick under the sun to make
| sure that should anything go wrong, the broken data shall be
| discarded and re-downloaded. Download is done, I dd the image to
| some installation medium. I can then verify its checksum matches
| the official one. A bit flip didn't slip by unnoticed.
|
| All the music I carefully ripped from my audio CDs? They're all
| cross-checked with an online DB of known bit-perfect rips.
| There's an accompanying file containing each song's hash and I
| can verify at anytime that all my files are 100% correct.
|
| But really most of all I live in a world of Git repositories. My
| entire Emacs config is versioned under Git (I know YMMV but I
| like it that way). Some people version under Git their entire
| user dir.
|
| Tell me how my lack of ECC is going to really make life miserable
| here?
|
| I have nothing _against_ ECC... But if I want to upgrade my AMD
| 3700X to a 7700X, apparently I cannot get ECC.
|
| And that's totally fine: I certainly won't discard the 7700X
| because I cannot get ECC for it.
|
| And if _anything_ looks suspicious, running Memtest is the first
| thing you should do.
|
| I've had bad RAM at times. I'm still there.
| craftkiller wrote:
| > Seriously: which percentage of developers have ECC on their
| development machine(s)?
|
| Every single one of my non-mini computers uses ECC ram except
| for my laptop. If someone would release a framework laptop
| motherboard that supports ECC ram (and preferably risc-v) I'd
| finally be able to close to reliability gap. It blows my mind
| that we say "Well sure, we COULD make infallible ram, but that
| would cost a tiny bit extra so instead lets just hope nothing
| bad happens." That's right up there with not wearing a seat-
| belt because I haven't needed one yet.
| digitallyfree wrote:
| I develop on VMs on my server (with ECC and RAIDZ) which i
| access using SSH or VDI protocols. I would love to have ECC
| on all my machines but that isn't feasible until the status
| quo changes, so I stick with the remote approach. To me
| that's an acceptable tradeoff as the non-ECC desktops/laptops
| are just used as dumb terminals while the real work happens
| on the reliable server.
|
| I can't speak for ECC at the moment but ZFS has definitely
| saved me from data corruption that would have been left to
| manifset otherwise.
| Brian_K_White wrote:
| "If someone would release a framework laptop motherboard that
| supports ECC ram..."
|
| This. If I could, I would. The only reason my laptop doesn't
| have ecc is because the manufacturer doesn't offer the option
| in any macines I otherwise want.
|
| That comment was very misguided in trying to suggest that
| there is any valid excuse to tolerate unreliable execution
| hardware. git and ssh and md5sums do _not_ mean that it 's ok
| if your very brain can't be trusted to deliver data from one
| part to another within itself, or spit back the same data
| that was put in a cell. Everything else is built _upon_ that!
| leguminous wrote:
| Checksums don't save you from memory corruption. If your data
| gets corrupted in memory, you will just end up checksumming and
| committing bad data. Or your checksum could get corrupted, and
| you commit a checksum that doesn't match your data. Checksums
| are more useful for safeguarding against disk or network
| corruption (although you shouldn't have network corruption
| issues over TLS or SSH).
|
| Apparently Ryzen 7000 cpus can use ECC. I've heard reports that
| AMD needs to release an AGESA update, though, and ECC DDR5
| memory availability is terrible. I'm hopeful that the situation
| will improve, because I also want to update my desktop. I've
| been using ECC memory since losing a filesystem on a desktop
| when a DIMM went bad.
| dale_glass wrote:
| > But really most of all I live in a world of Git repositories.
| My entire Emacs config is versioned under Git (I know YMMV but
| I like it that way). Some people version under Git their entire
| user dir.
|
| > Tell me how my lack of ECC is going to really make life
| miserable here?
|
| Git will break if RAM is bad just like anything else. That it
| checksums everything won't save you from checking in corrupt
| data, the filesystem itself being corrupt, or some internal git
| structure becoming corrupt. Losing your repo because something
| in it was written wrong is very much a possibility.
|
| Having multiple machines involved helps, but it's not a
| complete fix, because the possibility exists of something
| damaged being transmitted from a broken machine to a good one,
| ensuring there's no good copy anywhere.
|
| There's really nothing software can do to operate correctly
| with bad RAM all of the time. Instructions for the software are
| in RAM. The OS that the software expects to behave right is in
| RAM. Various buffers used for disk access and networking are in
| RAM. An application like git assumes all of that is performing
| correctly, and can't compensate for every possible malfunction
| that could happen.
| everybodyknows wrote:
| How trustworthy is git-fsck for detecting random-bit
| corruption?
| dale_glass wrote:
| Depends on how you look at it.
|
| For actual verification, unless there's a bug my
| understanding is that very trustworthy. But by that point
| it's already too late. Okay, you know something is broken,
| but that won't give your good data back.
|
| But that only tells you that Git data is intact and that
| all the hashes match. If git got a corrupted file to start
| with, then correctly hashed it, everything will verify 100%
| and still be broken.
| rhn_mk1 wrote:
| > which percentage of developers have ECC on their development
| machine(s)?
|
| That question won't help you evaluate the demand for ECC simply
| because the supply is strangled. Those who want it have to make
| compromises to get ECC: get a Xeon, or get one of few AMD
| motherboards with matching CPUs and overpriced RAM without a
| guarantee that it will end up working.
| simoncion wrote:
| > ...get one of few AMD motherboards with matching CPUs and
| overpriced RAM...
|
| I mean, if you want a _guarantee_, then sure, get one of
| those certified-for-ECC motherboards.
|
| But, like, as far as I know, going _at least_ far back as the
| Phenom II (released in 2008) AMD desktop processors have
| always supported ECC RAM. And -as far as I know- ASUS
| motherboards for said processors have always supported
| dropping in ECC RAM (and Linux and memtest and friends have
| always agreed that ECC was enabled and functioning in such a
| system).
|
| Source: Personal experience with Phenom II, Threadripper, and
| Ryzen 5 CPUs and ASUS motherboards, and looking-from-a-
| distance at the rest of the AMD CPUs between the Phenom and
| the Ryzen 5.
| tasubotadas wrote:
| When people complain about these random OS crashes and freezes
| it's usually RAM corruption at fault.
|
| Per 1gb of RAM, you can expect to see 266 bit errors per
| month[1-2] if you are using your PC 16h per day. Multiply that
| by 64GB or 128GB of RAM and it's crazy to think that you won't
| run into any of the stability issues.
|
| [1-2]
| https://static.googleusercontent.com/media/research.google.c...
| [1-2] https://en.wikipedia.org/wiki/ECC_memory#Research
| pflanze wrote:
| > Per 1gb of RAM
|
| The Google study you linked says "25,000 to 70,000 errors per
| billion device hours per Mbit".
|
| Assuming bits, {25000 to 70000}/1e9 * 30 _16_ 1000 = 12 to
| 33.6 bit errors per month. Assuming bytes, it 's 96 to 268
| bit errors per month.
|
| Apparently you've meant bytes, not bits. (I'm a pedant, but
| was also just unsure and interested in the numbers.)
|
| FWIW, a comment I ran across that confirms toast0's view[1]:
| "It's a bimodal distribution - you either have many errors
| (due to a defect somewhere) or basically zero. If you're on
| the good side of the distribution, with only extremely rare
| errors, then you probably don't need ECC. But without ECC,
| you don't know whether you need ECC!"
|
| [1] https://www.realworldtech.com/forum/?threadid=198497&curp
| ost...
| ubercow13 wrote:
| >266 bit errors per month
|
| That seems like an overestimate. How can memtest ever pass on
| non-ECC RAM if errors are that frequent?
| moloch-hai wrote:
| Because memtest only looks at values it wrote out very
| recently, before they have had a chance to flip.
|
| Memtest is looking for reliable failures, not evanescent
| one-off events.
| willis936 wrote:
| SDRAM is continuously refreshing all cells. How long ago
| data was written doesn't make a big difference (aside
| from the case where you're reading data immediately after
| writing or reading that data).
| moloch-hai wrote:
| This does not, of course, make any sense. Any given
| memory cell will be read once per, say, millisecond, and
| the value written back. Once it has flipped once, the
| wrong value will then be written back, and after that the
| wrong value is read back out and rewritten again,
| indefinitely. Errors are sticky, and accumulate.
| (Flipping back again is negligible unlikely.)
|
| With ECC in the refresh path, such an error could be
| corrected and the right value would be overwritten over
| top of the bad one. Then errors would not accumulate, but
| would instead be "scrubbed". Mainframe machines scrub
| their RAM. Disks too.
| ilyt wrote:
| > SDRAM is continuously refreshing all cells
|
| ...so ? we're not talking about slow deterioration (which
| is why refresh is needed), we're talking about bit flip
| from cosmic rays where cell changes state completely.
| [deleted]
| toast0 wrote:
| I don't think this rate is reasonable to use in this manner.
| I've run thousands of servers, with lots of ECC ram (quite a
| few servers had 768GB of ram, but most were more like 32 or
| 64). The vast majority ran for years with zero reported
| errors. A small handful would have major errors of thousands
| in an hour, but we would replace for hundreds in an hour. A
| couple servers developed a periodic report of one or two
| (correctable) errors per day. If 266 bit errors per month per
| 1Gb was a usable rate, all of our servers would have been
| throwing ECC correctable errors all the time.
|
| But I didn't have time or desire to publish a study on our
| experience, so there's no hard numbers.
| kevingadd wrote:
| I went out of my way for ECC because losing files can mean
| losing days of work. With how much a sweng gets paid it's worth
| the ECC premium if it saves me a few days of work over the life
| of the machine.
|
| Recently I discovered that one of my SSDs was quietly failing
| without setting off any warnings; doing a chkdsk showed that
| some files had already gotten corrupted. One of them was my
| backblaze backup index!
|
| Even though I have automated backups (backblaze + macrium
| backups to a NAS), recovering files from them is non-trivial.
| If I were to lose work to non-ECC ram who knows how long it
| would take me to reconstruct a known-good work environment and
| file set. Imagine if you're working on something huge and hard
| to validate like neural net weights where corruption can occur
| silently and be hard to detect after it's happened?
| newZWhoDis wrote:
| My favorite is when data silently corrupts, and then is
| happily propagated to your off-site recovery :/
| kevingadd wrote:
| Yeah, I don't know when the drive started failing, so in
| practice my backups are probably all screwed too. My only
| option would be to get a spare drive, restore a backup to
| it, then do a block level diff of the current and backup
| volumes and try to figure out whether any of the
| differences are file corruption.
| Filligree wrote:
| Probably I'm saying nothing you haven't thought of, but
| ZFS is great for preventing this. If you also pair it
| with ECC, then you've eliminated most ways to cause
| corruption.
| rwmj wrote:
| But if I open up an editor and write a document then that could
| be corrupted in RAM and then the corrupted data saved to disk.
| The document is likely to be more important than some ripped
| CDs or a git repo that I can download again.
|
| The CPU, RAM and mobo manufacturers need to get together and
| make ECC RAM mandatory. It's absurd that we have machines with
| gigabytes of storage using microscopic (nanoscopic??)
| capacitors that doesn't have this basic protection. And
| honestly this should have happened _years_ ago.
|
| (Edit: And before anyone says DDR5 is ECC by default, that's
| not quite true although the difference is a bit subtle: https:/
| /en.wikipedia.org/wiki/DDR5_SDRAM#DIMMs_versus_memory...)
| ilyt wrote:
| >But if I open up an editor and write a document then that
| could be corrupted in RAM and then the corrupted data saved
| to disk. The document is likely to be more important than
| some ripped CDs or a git repo that I can download again.
|
| Devils advocate: if it is just some bits in character flipped
| it's entirely recoverable while flipping some bits in
| compressed stream for video would corrupt more.
| rwmj wrote:
| Agreed. One problem with RAM errors (which I've actually
| experienced) is they are insidious. You probably won't
| notice them immediately so the error can be propagated, and
| they're very very difficult to diagnose if you do notice
| them.
|
| Back in the 90s we had a database server which had a stuck
| bit in memory normally mapped to the page cache. This
| caused sectors to be written to the backing software RAID
| which couldn't be read back in (because I think some
| checksum was corrupted when written and then failed when
| read back). It took an absolute age to diagnose this. I
| think I only worked it out by eliminating everything else.
| justsomehnguy wrote:
| And condoms are only 99.8% effective, so let's not use it? You
| can always pull out in time or verify checksu^W^W Plan B?
|
| > But really most of all I live in a world of Git repositories
|
| That's great _for you_ , but 99% don't even know what Git is,
| just like _checksums, cryptographic hashes, Merkle trees,
| digital signatures and reproducible builds_.
|
| You just miss the one imortant thing: ECC isn't that helpful
| where _you have_ the means to check and verify the data. ECC is
| the only way to at least know what something is happening with
| data when there is no way to check.
|
| To give you a slight idea I would tell you an anecdote from my
| L1 life almost two decades ago:
|
| I visited a client who claimed what the PC was working
| erratically and constantly threw weird error messages.
|
| Welp, the usual deal, just some ugly software or a virus. In
| the first 3 minutes I got like 5 errors about failing to load a
| .dll from C:\WINDOWS\SYSTEM33\USER32.DLL, nothing unusual, just
| need to.. WAIT. Why system _33_? Stupid virus masking for a
| well-known folder? Doubt. So I go to C:\Windows and I see:
| System32 System33 System34
|
| If you are a smart fellow you probably already understood what
| that was a bit-flip error which somehow managed to be in the
| in-memory copy of MFT of the system drive. And this is the only
| reason the user noticed it - because sometimes programs
| wouldn't start and sometimes there was weird error messages. If
| that error was in the data area of some program - it wouldn't
| be discovered at all.
| [deleted]
| dmitrybrant wrote:
| Hang on... to go from "System32" to "System33" is one bit
| flip, but to go from "System33" to "System34" is three bit
| flips at once. Doesn't that seem astronomically more likely
| to be a malicious program than bad RAM?
| stefantalpalaru wrote:
| [dead]
| baeaz wrote:
| [flagged]
| mrlonglong wrote:
| We've got a server that keeps rebooting due to a bad ECC DIMM
| chip. I thought the whole point of ECC was to keep the server
| going until we can replace the DIMM?
| AaronFriel wrote:
| ECC can typically correct 1 bit errors and usually detect (and
| fault) on 2 bit errors.
|
| Your server is faulting and preventing itself from corrupting
| data.
___________________________________________________________________
(page generated 2022-12-25 23:00 UTC)