[HN Gopher] ECC matters
___________________________________________________________________
ECC matters
Author : rajesh-s
Score : 624 points
Date : 2021-01-03 15:38 UTC (7 hours ago)
(HTM) web link (www.realworldtech.com)
(TXT) w3m dump (www.realworldtech.com)
| sys_64738 wrote:
| ECC memory is predominantly used in servers where failure
| absolutely must be identified and logged. The desktop market to a
| lesser extent due to lack of mission critical tasks being run
| from there.
| dijit wrote:
| There are situations though, where you're working on a document
| and the documents "save" format is a memory dump. Corruption
| for things of that type (Adobe RAW for example) would remove
| data.
|
| It might present itself as a 1pixel colour difference, but it
| could be more damaging (incorrect finances, in accounting
| software for example). Software trusts memory; but memory can
| lie.
|
| That's dangerous.
| MaxBarraclough wrote:
| That's an interesting point. In an extreme case, an order or
| money transfer might be placed for an incorrect quantity, or
| to an incorrect recipient.
| KingMachiavelli wrote:
| Well maybe. Rather than having to trust memory completely,
| it would just be better to use a binary format where each
| bit is verifiable so then at least a single bit flip would
| be immediately obvious. For example, a bit flip in a TLS
| session causes the whole session to fail rather than a
| random page element to change.
| knorker wrote:
| That doesn't help if the memory is corrupted before the
| verification code is applied. (the code will simply put a
| signature on incorrect data)
|
| Or after it's been checked. (time-of-check vs time-of-
| use)
| MaxBarraclough wrote:
| Right, exactly. TCP protects us from data-corruption in
| network streams, and ECC protects us from data-corruption
| in RAM. I doubt any sort of software solution could
| practically compete against hardware ECC, even if it
| could be done it would presumably be disastrous for
| performance.
| knorker wrote:
| The best integrity checking is "end to end". The problem
| with non-ECC is that there are no "ends" that are
| trustworthy.
|
| I guess in theory some software could produce signed data
| in CPU cache, and "commit" it to RAM as a verified block.
|
| But the overhead would be enormous. Would you slow down
| your CPU by half in order to not pay 12.5% more for RAM?
|
| Hmm, I wonder what SGX and similar do about this.
| mark-r wrote:
| That's the principle behind Gray Code counting:
| https://en.wikipedia.org/wiki/Gray_code
| sys_64738 wrote:
| Those corner cases might occur rarely but are probably
| inconsequential given rate of occurrence versus rate of
| criticalness - it probably doesn't justify the markup for
| most. In a data center you're processing millions of
| transactions per minute so occurrence is much more impactful.
| knorker wrote:
| I would EASILY pay 12.5% more (that's the bit overhead) for
| memory that actually works.
|
| If my data is fine being corrupted to save 12.5% on RAM
| costs, then why am I even bothering processing the data?
| Apparently it's worthless.
|
| People today weigh the cost of maybe 16 vs 32GB on a mid-
| tier desktop. ~doubling the cost for twice the RAM. Yes,
| paying 12.5% more for ECC RAM is a no-brainer.
| xxs wrote:
| You need 1/8 more memory - that's real the cost. It's
| pretty much Intel's fault for the segmentation.
| jkbbwr wrote:
| To be fair, if your save mechanism is just a straight memory
| dump with no checksums and validation. You have bigger
| issues.
| dijit wrote:
| That happens more than you think though. Most* things that
| output PNG are making an in-memory data structure and
| dumping it to disk.
| xxs wrote:
| Why does it matter if it =HAD= checksum, the numbers would
| have been altered prior to save. It means you store one but
| you get two when read later. If the format calculates
| immediate checksums on blocks it'd detect memory corruption
| at best. The extreme downside is that such a part is
| untestable under normal conditions, hard to maintain, and
| it costs more than the ECC in development.
| projektfu wrote:
| Perhaps consumer-grade software that needs guarantees of
| correctness should be using error correction in software. For
| example, database records for financial software, DNS, e-mail
| addresses, etc.
| wicket wrote:
| Over the years, I don't think I've ever been able to explain to
| anyone that their memory error could have been caused a cosmic
| ray without being laughed at.
| amelius wrote:
| Does Apple use ECC in its M1 laptop?
| dijit wrote:
| No. It uses a unified package of LPDDR4x SDRAM
| my123 wrote:
| LPDDR4X systems with ECC exist, but it indeed looks like
| Apple M1 systems aren't one...
| graeme wrote:
| This is my one worry. I have an imac pro and anecdotally it has
| been a LOT more reliable than my old macbook pro. The imac pro
| has ecc.
| dijit wrote:
| I beg this, every time this conversation comes up it's the same
| answer "I don't see a problem".
|
| It's so easy to chalk these kind of errors to other issues, a
| little corruption here, a running program goes bezerk there-
| could be a buggy program or a little accidental memory overwrite.
| Reboot will fix it.
|
| But I ran many thousands of physical machines, petabytes of RAM,
| I tracked memory flip errors and they were _common_; common even
| in: less dense memory, in thick metal enclosures surrounded by
| mesh. Where density and shielding impacts bitflips a lot.
|
| My own experience tracking bitflips across my fleet led me to buy
| a Xeon laptop with ECC memory (precision 5520) and it has
| (anecdotally) been significantly more reliable than my desktop.
| [deleted]
| derefr wrote:
| Were you around for enough DRAM generations to notice an effect
| of DRAM _density_ / cell-size on reported ECC error rate?
|
| I've always believed that, ECC aside, DRAM made intentionally
| with big cells would be less prone to spurious bit-flips (and
| that this is one of the things NASA means when they talk about
| "radiation hardening" a computer: sourcing memory with ungodly-
| large DRAM cells, willingly trading off lower memory capacity
| for higher per-cell level-shift activation-energy.)
|
| _If_ that's true, then that would mean that the per-cell error
| rate would have actually been _increasing_ over the years, as
| DRAM cell-size decreased, in the same way cell-size decrease
| and voltage-level tightening have increased error rate for
| flash memory. Combined with the fact that we just have N times
| more memory now, you'd think we'd be seeing a _quadratic_
| increase in faults compared to 40 years ago. But do we? It
| doesn't seem like it.
|
| I've _also_ heard a counter-effect proposed, though: maybe
| there really are far more "raw" bit-flips going on -- but far
| less of main memory is now in the causal chain for corrupting a
| workload than it used to be. In the 80s, on an 8-bit micro,
| POKEing any random address might wreck a program, since there's
| only 64k addresses to POKE and most of the writable ones are in
| use for something critical. Today, most RAM is some sort of
| cache or buffer that's going to be used once to produce some
| ephemeral IO effect (e.g. the compressed data for a video
| frame, that might decompress incorrectly, but only cause 16ms
| of glitchiness before the next frame comes along to paper over
| it); or, if it's functional data, it's part of a fault-tolerant
| component (e.g. a TCP packet, that's going to checksum-fail
| when passed to the Ethernet controller and so not even be sent,
| causing the client to need to retry the request; or, even if
| accidentally checksums correctly, the server will choke on the
| malformed request, send an error... and the client will need to
| retry the request. One generic retry-on-exception handler
| around your net request, and you get memory fault-tolerance for
| free!)
|
| If both effects are real, this would imply that regular PCs
| without ECC _should_ still seem quite stable -- but that it
| would be a far worse idea to run a non-ECC machine as a
| densely-packed multitenant VM hypervisor today (i.e. to tile
| main memory with OS kernels), than it would have been ~20 years
| ago when memory densities were lower. Can anyone attest to
| this?
|
| (I'd just ask for actual numbers on whether per-cell per-second
| errors have increased over the years, but I don't expect anyone
| has them.)
| jeffreygoesto wrote:
| Sorry, I don't have the numbers you asked for. But afaik one
| other effect is that "modern" semiconductor processes like
| FinFET and Fully-Depleted Silicon-on-Insulator are less prone
| to single event upsets and especially result in only a single
| bit flipping and no drain of a whole region of transistors
| from a single alpha particle.
| mlyle wrote:
| I think it's been quadratic with a pretty low contribution
| from the order 2 term.
|
| Think of the number of events that can flip a bit. If you
| make bits smaller, you get a modestly larger number of events
| in a given area capable of flipping a bit, spread across a
| larger number of bits in that area.
|
| That is, it's flip event rate * memory die area, not flip
| event rate * number of memory bits.
|
| In recent generations, I understand it's even been a bit
| paradoxical-- smaller geometries mean less of the die is
| actual memory bits, so you can actually end up with _fewer_
| flips from shrinking geometries.
|
| And sure, your other effect is true: there's a whole lot
| fewer bitflips that "matter". Flip a bit in some framebuffer
| used in compositing somewhere-- and that's a lot of my
| memory-- and I don't care.
| smoyer wrote:
| There is no guarantee of state at the quantum level ... just a
| high-degree of assurance in a state. After 40 years in the
| electronics, optics, software business, I've learned that there
| is absolutely the possibility for unexplained "blips".
| loeg wrote:
| Yeah, it's real obnoxious of Intel to silo ECC support off into
| the Xeon line, isn't it? I switched to ECC memory in 2013 or
| 2014 with a Xeon E3 (fundamentally a Core i7 without the ECC
| support fused off) and of course a Xeon-supporting motherboard
| (with weird "server board" quirks: e.g., no on-board sound
| device).
|
| I love that AMD doesn't intentionally break ECC on its consumer
| desktop platforms and upgraded to the Threadripper in 2017.
| defanor wrote:
| I've considered using an AMD CPU instead of Intel's Xeon on
| the primary desktop computer, but even low-end Ryzen
| Threadripper CPUs have TDP of 180W, which is a bit higher
| than I'd like. And though ECC is not disabled in Ryzen CPUs,
| AFAIK it's not tested in (or advertised for) those, so one
| won't be able to return/replace a CPU if it doesn't work with
| ECC memory, AIUI, making it risky. Though I don't know how
| common it is for ECC to not be handled properly in an
| otherwise functioning CPU; are there any statistics or
| estimates around?
| BlueTemplar wrote:
| > one won't be able to return/replace a CPU if it doesn't
| work with ECC memory
|
| I don't know where you live, but around here, (if you buy
| new?), the vendor MUST take back items up to 15 days after
| they were delivered, for ANY reason.
|
| So, as long as you synchronize your buying of CPU, RAM,
| (motherboard), you should be fine.
| marcosdumay wrote:
| Keep in mind that Intel lies about its TDP.
| magila wrote:
| There's been a lot of misinformation spread about what
| TDP means for modern CPUs. In Intel's case TDP is the
| steady state power consumption of the CPU in its default
| configuration while executing a long running workload.
| Long meaning more than a minute or two. The CPU
| implements this by keeping an exponentially weighted
| moving average (EWMA) of the CPU's power consumption. The
| CPU will modulate its frequency to keep this moving
| average at-or-below the TDP.
|
| One consequence of using a moving average is that if the
| CPU has been idle for a long time then starts running a
| high power workload instantaneous power consumption can
| momentarily exceed the TDP while the average catches up.
| This is often misleadingly referred to as "turbo mode" by
| hardware review sites. It's not a mode, there's no state
| machine at work here, it's just a natural result of using
| a moving average. The use of EWMA is meant to model the
| heat capacity of the cooling solution. When the CPU has
| been idle for a while and the heatsink is cool, the CPU
| can afford to use more power while the heatsink warms up.
|
| Another factor which confuses things is motherboard
| firmware disabling power limits without the user's
| knowledge. Motherboards marketed to enthusiasts often do
| this to make the boards look better in review benchmarks.
| This is where a lot of the "Intel is lying" comes from,
| but it's really the motherboard manufacturers being
| underhanded.
|
| The situation on the AMD side is of course a bit
| different. AMD's power and frequency scaling is both more
| complex and much less documented than Intel's so it's
| hard to say exactly what the CPU is doing. What is known
| is that none of the actual power limits programmed into
| the CPU align with the TDP listed in the spec. In
| practice the steady state power consumption of AMD CPUs
| under load is typically about 1.35x the TDP.
|
| Unlike Intel, firmware for AMD motherboards does not mess
| with the CPU's power limit settings unless the user does
| so explicitly. Presumably this is because AMD's CPU
| warranty is voided by changing those settings, while
| Intel's is not.
| xxs wrote:
| Intel measures TDP at base frequency... that's
| disingenuous.
| colejohnson66 wrote:
| They don't. They just measure it differently than AMD.
| Intel measures at base clock, but AMD measures at
| sustained max clock IIRC. It's definitely deceptive, but
| it's not a lie as long as Intel tells you (which they
| do).
| wtallis wrote:
| Intel's TDP numbers are at best an indicator of which
| product segment a chip falls into. They are wildly
| inaccurate and unreliable indicators of power draw under
| _any_ circumstance. For example, here 's a "58W" TDP
| Celeron that can't seem to get above 20W:
| https://twitter.com/IanCutress/status/1345656830907789312
|
| And on the flip side, if you're building a desktop PC
| with a more high-end Intel processor, you will usually
| have to change a _lot_ of motherboard firmware settings
| to get the behavior to resemble Intel 's own
| recommendations that their TDP numbers are supposedly
| based on. Without those changes, lots of consumer retail
| motherboards default to having most or all of the power
| limits effectively disabled. So out of the box, a "65W"
| i7-10700 and a "125W" i7-10700K will both hit 190-200W
| when all 8 cores/16 threads are loaded.
|
| If a metric can in practice be off by a factor of three
| in either direction, it's really quite useless and should
| not be quantified with a scientific unit like Watts.
| marcosdumay wrote:
| Well, it's a power measurement that isn't total and can't
| be used for design... So, it's a lie.
|
| If they gave it some other name, it would be only
| misleading. Calling it TDP is a lie.
| ksec wrote:
| It is a lie when they change the definition of TDP
| without telling you first and later redefined the word to
| different thing once they got caught.
|
| May be we should use a new term for it, something like
| iTDP.
| mlyle wrote:
| They both lie, but Intel lies worse :D
| paulmd wrote:
| Nah. Both brands pull more than TDP when boosting at max,
| AMD will pull up to 30% above the specified TDP for an
| indefinite period of time (they call this number the
| "PPT" instead).
|
| Intel mobile processors actually obey this better than
| AMD processors do - Tiger Lake has a hard limit, when you
| configure a 15W TDP then it really is 15W once steady-
| state boost expires, AMD mobile products will pull up to
| _50%_ more than configured.
|
| https://images.anandtech.com/doci/16084/Power%20-%2015W%2
| 0Co...
|
| "the brands measure it differently" is true but not in
| the sense people think.
|
| On AMD it is literally just a number they pick that goes
| into the boost algorithm. Robert Hallock did some dumb
| handwavy shit about how it's measured with some delta-t
| above ambient with a reference cooler but the fact is
| that the chip itself basically determines how high it'll
| boost based on the number they configure, so that is a
| self-fulfilling prophecy, the delta-t above ambient is
| dependent on the number they configure the chip to run
| at.
|
| In practice: what's the difference between a 3600 and a
| 3600X? One is configured with a TDP of 65W and one is
| configured with a TDP of 95W, the latter lets you boost
| higher and therefore it clocks higher.
|
| Intel nominally states that it's measured as a worst-case
| load at base clocks, something like Prime95 that
| absolutely nukes the processor (and even then many
| processors do not actually hit it). But really it is also
| just a number that they pick. The number has shifted over
| time, previously they used to undershoot a lot, now they
| tend to match the official TDP. It's not an actual
| measurement, it's just a "power category" that they
| classify the processors as, it's _informed_ by real
| numbers but it 's ultimately a human decision which tier
| they put them in.
|
| Real-world you will always boost above base clocks on
| both brands at stock TDP, at least on real-world loads.
| You won't hit full boost on either brand without
| exceeding TDP, the "AMD measures at full boost" is
| categorically false despite the fact that it's commonly
| repeated. AMD PPT lets them boost above the official TDP
| for an unlimited period of time, they cannot run full
| boost when limited to official TDP.
| numlock86 wrote:
| Can you cite something? Sounds interesting.
| colejohnson66 wrote:
| It's not true. Sortove. Intel measures at base clock
| while AMD does at sustained peak clock. Deceptive? Yes.
| Lie? No.
| CydeWeys wrote:
| > but even low-end Ryzen Threadripper CPUs have TDP of
| 180W, which is a bit higher than I'd like.
|
| Why does it matter? It doesn't idle that high; it only goes
| that high of you're using it flat out, in which case the
| extra power usage is justified because it's giving that
| much more performance over a 100 W TDP CPU. Now I totally
| get it if you don't want to go Threadripper just for ECC
| because it's more _expensive_ , but max power draw, which
| you don't even have to use? I've never seen anyone shop a
| desktop CPU by TDP, rather than by performance and price.
| defanor wrote:
| I prefer to pick PSU and fans (for both CPU and chassis)
| that can handle it comfortably (preferably while staying
| silent and with some reserve) with maximum TDP in mind,
| and given that I don't need that many cores or high clock
| speed either, a powerful CPU with high TDP is undesirable
| because it just makes picking other parts harder. I've
| mentioned TDP explicitly because I wouldn't mind if it
| was a (possibly even high-end) Threadripper that somehow
| didn't produce as much heat. Although price also matters,
| indeed.
| phkahler wrote:
| >> I've never seen anyone shop a desktop CPU by TDP,
| rather than by performance and price.
|
| Oh oh, me! Back in the day I bought a 65W CPU for a
| system that could handle a 90W. I wanted quiet and
| figured that would keep fan noise down at a modest
| performance penalty. It should also last longer, being
| the same design but running cooler. I ran that from 2005
| until a few years ago (it still run fine but is in
| storage).
|
| Planning to continue this strategy. I suspect it's common
| among SFF enthusiasts.
| koolba wrote:
| SFF?
| lostlogin wrote:
| The Intel Nuc and Mac mini are good examples of this -
| however the Nuc doesn't have its psu inside, it's a
| brick. Great for fixing failures, horrible in general as
| a built in psu is so much tidier.
| oconnor663 wrote:
| "small form factor" as far as I can tell
| sam_lowry_ wrote:
| Hm... My 2013 NUC in fanless Akasa enclosure runs 24/7 on
| a 6W CPU, I recently looked at the options, and the 2019
| 6W offering changes little in performance. Yes, memory
| got faster, but that's it.
|
| My passive-cooled desktop is also running a slightly
| trottled down 65W CPU.
|
| So yes, there are people who choose there hardware by
| TDP.
| francis-io wrote:
| When looking for a CPU for a server that sits in my
| living room, I went down the thought process of getting a
| low tdp. I don't have a quote, but I seem to remember
| coming to the conclusion that tdp is the max temp
| threshold, not the consistent power draw. If you have a
| computer idling I believe you won't see a difference in
| temp between cpus, but you will have the performance when
| you need it.
|
| These days, a quiet, pwm fan with good thermal paste (and
| maybe some linux CPU throttling) more than achieves my
| needs for a "silent" pc 99% of the time.
|
| I would love to be told my above assumptions are wrong if
| they are.
| mlyle wrote:
| Yah-- one should look at performance within a given power
| envelope. Being able to dissipate more and then either
| end up with the fan running or the processor throttling
| back somewhat is good, IMO.
|
| The worst bit is, AMD and Intel define TDP differently--
| neither is the maximum power the processor can draw--
| though Intel is far more optimistic.
| mlyle wrote:
| On AMD, with Ryzen Master, you can set the TDP-envelope
| of the processor to what you want. Then the
| boost/frequency/voltage envelope it chooses to operate in
| under sustained load is different.
|
| IMO, shopping by performance/watt makes sense. Shopping
| by TDP doesn't. (Especially since there is no comparing
| the AMD and Intel TDP numbers as they're defined
| differently; neither is the maximum the processor can
| draw, and Intel significantly exceeds the specified TDP
| on normal workloads).
| ReactiveJelly wrote:
| Back when my daily driver was a Core 2 laptop, someone
| told me that capping the clock frequency would make it
| unusable.
|
| As a petty "Take that", I dropped the max frequency from
| 2.0 GHz to 1.0 GHz. I ran a couple benchmarks to prove
| the cap was working, and then just kept it at 1.0 for a
| few months, to prove my point.
|
| It made a bigger difference on my ARM SBC, where I tried
| capping the 1,000 MHz chip to 200 or 400 MHz. That chip
| was already CPU-bound for many tasks and could barely
| even run Firefox. Amdahl's Law kicked in - Halving the
| frequency made _everything_ twice as slow, because almost
| everything was waiting on the CPU.
| mlyle wrote:
| The funny thing is, on modern processors-- throttling TDP
| only affects when running flat out all-core workloads. A
| subset of cores can still boost aggressively, and you can
| run all-core max-boost for short intervals.
|
| And the relationship between power and performance isn't
| linear as processor voltages climb trying to squeeze out
| the last bit of performance.
|
| So if you want to take a 105W CPU and ask it to operate
| in a 65W envelope, you're not giving up even 1/3rd of
| peak performance, and much less than that of typical
| performance.
| vvanders wrote:
| TDP matters a fair bit in SFF(Small Form Factor) PCs. For
| instance the 3700x is a fantastic little CPU since it has
| a 65W TDP but pretty solid performance.
|
| In a sandwich style case you're usually limited to low
| profile coolers like Noctua L9i/L9a since vertical height
| is pretty limited.
| mlyle wrote:
| Performance/watt matters. You can just set TDP to what
| you want with throttling choices.
|
| If you want a 45W TDP from the 3700X, you can just pop
| into Ryzen Master and ask for a 45W TDP. Boom, you're
| running in that envelope.
|
| I think shopping based on TDP is not the best, because
| it's not comparable between manufacturers and because
| it's something you can effectively "choose".
| mongol wrote:
| How do you do that? Is it a setting in the bios? Or can
| it be done runtime? If so, how? It sounds interesting if
| I can run a beefy rig as a power efficient device, for
| always-on scenarios, and then boost it when I need.
| mlyle wrote:
| > How do you do that? Is it a setting in the bios? Or can
| it be done runtime?
|
| On AMD, it's a utility you run. I believe you may require
| a reboot to apply it. On some Intel platforms, it's been
| settings in the BIOS.
|
| > It sounds interesting if I can run a beefy rig as a
| power efficient device, for always-on scenarios, and then
| boost it when I need.
|
| This is what the processor is doing internally anyways.
| It throttles voltage and frequency and gates cores based
| on demanded usage. Changing the TDP doesn't change the
| performance under a light-to-moderate workload scenario
| at all.
|
| Ryzen Master lets you change some of the tuning for the
| choices it makes about when and how aggressively to
| boost, though, too.
| Cloudef wrote:
| Ryzen Master doesnt seem to be available for linux so you
| end up with bunch of unnofficial hacks that may or may
| not work. I run sff setup myself, originally wanted to
| get 3600 but it was out of stock, and the next tdp
| friendly processor was 3700x.
| mlyle wrote:
| That's an annoyance, but on Linux you have infinite more
| control of thermal throttling and you can get whatever
| thermal behavior you want. Thermald has been really good
| on Intel, and now that Google contributed RAPL support
| you can get the same benefits on AMD-- pick exactly your
| power envelope and thermal limits.
| vvanders wrote:
| Yeah but can I get a metric ton of benchmarks at that 45w
| setpoint?
|
| I don't really see the reason in paying for a 100w TDP
| premium if I'm just going to scale it down to 65w.
| bayindirh wrote:
| > I've never seen anyone shop a desktop CPU by TDP,
| rather than by performance and price.
|
| That's me. When I start to plan for a new system, I
| select the processor first and read its thermal design
| guidelines (Intel used to have nice load vs. max temp
| graphs in their docs) and select every component around
| it for sustained max load.
|
| This results in a more silent system for idle and peace
| of mind for loading it for extended duration.
| 411111111111111 wrote:
| That's not necessarily correct.
|
| You can passively cool threadrippers if you underclock
| them enough and have good ventilation in case.
| bayindirh wrote:
| If my only interest would be ECC, I might do that but, I
| develop scientific software for research purposes. I need
| every bit of performance from my system.
|
| In my case loading means maxing out all cores and
| extended period of time can be anything from five minutes
| to hours.
| mlyle wrote:
| The problem is-- you can't compare the TDP nor even the
| system cooling design guidelines between AMD and Intel.
|
| Both are optimistic lies, but-- if you look at the
| documents it looks like currently AMD needs more cooling,
| but actually dissipates less power in most cases and
| definitely has higher performance/watt.
| bayindirh wrote:
| > The problem is-- you can't compare the TDP nor even the
| system cooling design guidelines between AMD and Intel.
|
| Doesn't matter for me since I'm not interested in
| comparing them.
|
| > Both are optimistic lies, but-- if you look at the
| documents it looks like currently AMD needs more cooling,
| but actually dissipates less power in most cases and
| definitely has higher performance/watt.
|
| I'm aware of the situation, and I always inflate the
| numbers 10-15% to increase headroom in my systems. The
| code I'm running is not a _most case_ code. A FPU heavy,
| "I will abuse all your cores and memory bandwidth" type,
| heavily optimized scientific software. I can sometimes
| hear that my system is swearing at me for repeatedly
| running for tests.
|
| I don't like to add this paragraph but, I'm one of the
| administrators of one of the biggest HPC clusters in my
| country. I know how a system can surpass its TDP or how
| can CPU manufacturers skew this TDP numbers to fit in
| envelopes. We make these servers blow flames from their
| exhausts.
| ethanpil wrote:
| Built a NAS. My #1 concern for choosing CPU was TDP. This
| machine is on 24/7 and power use is a primary concern
| where I live because electricity is NOT cheap.
| mlyle wrote:
| This is a poor way to make the choice. TDP is supposed to
| specify the highest power you can get the processor to
| dissipate, not typical or idle use. And since different
| manufacturers specify TDP differently, you can't even
| compare the number.
|
| Performance/watt metrics and idle consumption would have
| been a far better way to make this choice.
|
| If you have a choice between A) something that can
| dissipate 65W peak for 100 units of performance, but
| would dissipate 4W average under your workload, and B)
| something that can dissipate 45W peak for 60 units of
| performance, but would dissipate 4.5W under your
| workload... I'm not sure why you'd ever pick B.
| mongol wrote:
| Is there a metric to look for to understand what power
| consumption is at "idle" or something close to that? That
| is what confuses me. I don't want to spend a lot of money
| on something that will be always on, and usually idling,
| and finding that its power usage is way higher than I
| thought. But perhaps there is a metric that tells that. I
| have not looked closely at it.
|
| Also, even though the CPU may draw less, can still the
| power supply waste more, just because it is beefy?
| Comparing with a sports car, they have great performance,
| but also use more gas in ordinary traffic? Can a computer
| be compared with that?
| mlyle wrote:
| > Is there a metric to look for to understand what power
| consumption is at "idle" or something close to that? That
| is what confuses me. I don't want to spend a lot of money
| on something that will be always on, and usually idling,
| and finding that its power usage is way higher than I
| thought.
|
| Community benchmarks, from Tom's Hardware, etc.
|
| The vendor numbers are make believe-- you can't use them
| for power supply sizing or for thermal path sizing. If
| you look at the cited TDP numbers today-- it can be
| misleading-- e.g. often Intel 45W TDP parts use more
| power at peak than AMD 65W parts.
|
| On modern systems, almost none of the idle consumption is
| the processor. The power supply's idle use and
| motherboard functions dominate.
|
| > Also, even though the CPU may draw less, can still the
| power supply waste more, just because it is beefy?
|
| Yes, having to select a larger power supply can result in
| more idle consumption, though this is more of a problem
| on the very low end.
| vvanders wrote:
| I don't think Threadripper is a hard requirement for ECC.
| There's some pretty reasonable TDP processors if you step
| down from Threadripper.
| usefulcat wrote:
| It's not. I have a low end Epyc machine with ECC. It has
| a TDP of something like 30 watts.
| defanor wrote:
| I didn't consider embedded CPUs (I guess that's about an
| embedded EPYC, not a server one), those look neat. But
| there's no official ECC support (i.e., it's similar to
| Ryzen CPUs), is there?
|
| Edit: as detaro mentioned in the reply, there is, and
| here's the source [0] -- that's what they mean by "RAS"
| on promotional pages [1]. That indeed looks like a nice
| option.
|
| [0] https://www.amd.com/system/files/documents/updated-30
| 00-fami...
|
| [1] https://www.amd.com/en/products/embedded-
| epyc-3000-series
| loeg wrote:
| RAS covers more than just DRAM, but yes. Historically,
| the reporting interface is called MCA (Machine Check
| Architecture) / MCE. I think both AMD and Intel have
| extensions with other names, but MCA/MCE points you in
| the right direction.
| detaro wrote:
| All EPYC, including the embedded ones, do officially have
| ECC support
| adrian_b wrote:
| For embedded applications, there is official ECC support
| for all CPUs named Epyc or Ryzen Vxxxx or Ryzen Rxxxx.
|
| There are computers in the Intel NUC form factor, with
| ECC support (e.g. with Ryzen V2718), e.g from ASRock
| Industrial.
| detaro wrote:
| what kind of machine is that? Been vaguely looking for
| one a while back, and everything seemed difficult to get
| (since the main target is large-volume customers I guess)
| cuu508 wrote:
| I haven't seen definite details and test results on these
| (but haven't looked recently).
|
| What specific configurations (CPU, MB, RAM) are known to
| work?
|
| Let's say I have a Ryzen system, how can I check if ECC
| really works? Like, can I see how many bit flips got
| corrected in, say, last 24h?
| xxs wrote:
| Every Ryzen (non APU) supports it* Check the montherboard
| of your choice, they would declare it in big bold
| letters, e.g.[0]
|
| *not officially, and the memory controller provides no
| report for 'fixed' errors.
|
| 0: http://www.asrock.com/mb/AMD/X570%20Taichi/
| cturner wrote:
| Regarding verification. There is a debian package called
| edac-utils. As I recall you overclock your RAM and run
| your system at load in order to generate failures.
|
| Looking back at my notes, the output of journalctl -b
| tells should say something like, "Node 0: DRAM ECC
| enabled."
|
| Then 'edac-ctl --status' should tell you that drivers are
| loaded.
|
| Then you run 'edac-util -v' to report on what it has
| seen, mc0: 0 Uncorrected Errors with no
| DIMM info mc0: 0 Corrected Errors with no DIMM
| info mc0: csrow2: 0 Uncorrected Errors
| mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
| mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
| mc0: csrow3: 0 Uncorrected Errors mc0: csrow3:
| mc#0csrow#3channel#0: 0 Corrected Errors mc0:
| csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
| edac-util: No errors to report.
| a1369209993 wrote:
| > As I recall you overclock your RAM and run your system
| at load in order to generate failures.
|
| You can also use memtest86+ for this, although I don't
| recall if it requires specific configuration for ECC
| testing.
| p_l wrote:
| All AMD CPUs with integrated memory controllers support
| ECC. The CPU also exposes an interface usable by the
| operating system to verify ECC works - the same interface
| is used to provide monitoring of memory fault data
| provided by ECC.
|
| They aren't tested on it, so it's possible to get a dud,
| but it's minuscule chance that isn't worth bothering.
|
| Now, to _actual_ issues you can encounter: _motherboards_
|
| The problem is that ECC means you need to have, iirc, 8
| more data lines between CPU and memory module, which of
| course mean more physical connections (don't remember how
| many right now). Those also need to be properly done and
| tested, and you might encounter a motherboard where it
| wasn't done. Not sure how common, unfortunately.
|
| Another issue is motherboard firmware. Even though AMD
| supplies the memory init code, the configuration can be
| tweaked by motherboard vendor, and they might simply
| break ECC support accidentally (even by something as
| simple as making a toggle default to _false_ then forgot
| to expose it in configuration menu).
|
| Those are the two issues you can encounter.
|
| The difference with AFAIK Threadripper PRO, and EPYC, is
| that AMD includes ECC in its test and certification
| programs for it, which kind of enforces support.
| jtl999 wrote:
| > Another issue is motherboard firmware. Even though AMD
| supplies the memory init code, the configuration can be
| tweaked by motherboard vendor, and they might simply
| break ECC support accidentally (even by something as
| simple as making a toggle default to false then forgot to
| expose it in configuration menu).
|
| I think some Gigabyte boards are infamous for this in
| certain circle
|
| OTOH: Gigabyte _might_ have a Threadripper PRO
| motherboard (WRX80 chipset) coming out in the future
| p_l wrote:
| Gigabyte is also infamous for trying to claim that they
| implemented UEFI by dropping a build of DUET (UEFI that
| boots on top of BIOS, used for early development) into
| BIOS image...
| adrian_b wrote:
| All desktop Ryzen CPUs without integrated GPU, i.e. with
| the exception of APUs, support ECC.
|
| You must check the specifications of the motherboard to
| see if ECC memory is supported.
|
| As a rule, all ASRock MBs support ECC and also some ASUS
| MBs support ECC, e.g. all ASUS workstation motherboards.
|
| I have no experience with Windows and Ryzen, but I assume
| that ECC should work also there.
|
| With Linux, you must use a kernel with all the relevant
| EDAC options enabled, including CONFIG_EDAC_AMD64.
|
| For the new Zen 3 CPUs, i.e. Ryzen 5xxx, you must use a
| kernel 5.10 or later, for ECC support.
|
| On Linux, there are various programs, e.g. edac-utils, to
| monitor the ECC errors.
|
| To be more certain that the ECC error reporting really
| works, the easiest way is to change the BIOS settings to
| overclock the memory, until memory errors appear.
| theevilsharpie wrote:
| On Windows, to check if ECC is working, run the command
| 'wmic memphysical get memoryerrorcorrection':
| PC C:\> wmic memphysical get memoryerrorcorrection
| MemoryErrorCorrection 6
|
| SuperUser has a convenient decoder[1], but modern systems
| will report "6" here if ECC is working.
|
| When Windows detects a memory error, it will record it in
| the system event log, under the WHEA source. As a side
| note, this is also how memory errors within the CPU's
| caches are reported under Windows.
|
| [1] https://superuser.com/questions/893560/how-do-i-tell-
| if-my-m...
| stefan_ wrote:
| I don't understand. Whatever the TDP of Intel processors,
| you are straight up getting less bang for watt given their
| ancient process. Same reason smartphones burst to high
| clocks and power; getting the task done faster is on
| average much more efficient.
| loeg wrote:
| > I've considered using an AMD CPU instead of Intel's Xeon
| on the primary desktop computer, but even low-end Ryzen
| Threadripper CPUs have TDP of 180W, which is a bit higher
| than I'd like.
|
| Any apples-to-apples comparable Intel CPU will have
| comparable power use. The difficulty is that Intel didn't
| really have anything like Threadripper -- their i9 series
| was the most comparable (high clocks and moderate core
| counts), but i9 explicitly did not support ECC memory,
| nullifying the comparison.
|
| You're looking at 2950X, probably? That's a Zen+ (previous
| gen) model. 16 core / 32 thread, 3.5 GHz base clock,
| launched August 2018.
|
| Comparable Intel Xeon timeline is Coffee Lake at the
| latest, Kaby lake before that. As far as I can tell, _no_
| Kaby Lake nor Coffee Lake Xeons even have 16 cores.
|
| The closest Skylake I've found is an (OEM) Xeon Gold 6149:
| 16/32 core/thread, 3.1 GHz base clock, 205W nominal TDP
| (and it's a special OEM part, not available for you). The
| closest buyable part is probably Xeon Gold 6154 with 18/36
| core/threads, 3GHz clock, and 200W nominal TDP.
|
| Looking at i9 from around that time, you had Skylake-X and
| a single Coffe Lake-S (i9-9900K). 9900K only has 8 cores.
| The Skylake i9-9960X part has 16/32 cores/threads, base
| clock of 3.1GHz, and a nominal TDP of 165W. That's somewhat
| comparable to the AMD 2950X, ignoring ECC support.
|
| Another note that might interest you: you could run the
| Threadripper part at substantially lower power by
| sacrificing a small amount of performance, if thermals are
| the most important factor and you are unwilling to trust
| Ryzen ECC:
| http://apollo.backplane.com/DFlyMisc/threadripper.txt
|
| Or just buy an Epyc, if you want a low-TDP ECC-definitely-
| supported part: EPYC 7302P has 16/32 cores, 3GHz base
| clock, and 155W nominal TDP. EPYC 7282 has 16/32 cores, 2.8
| GHz base, and 120W nominal TDP. These are all zen2 (vs
| 2950X's zen+) and will outperform zen+ on a clock-for-clock
| basis.
|
| > And though ECC is not disabled in Ryzen CPUs, AFAIK it's
| not tested in (or advertised for) those, so one won't be
| able to return/replace a CPU if it doesn't work with ECC
| memory, AIUI, making it risky.
|
| If your vendor won't accept defective CPU returns, buy
| somewhere else.
|
| > Though I don't know how common it is for ECC to not be
| handled properly in an otherwise functioning CPU; are there
| any statistics or estimates around?
|
| ECC support requires motherboard support; that's the main
| thing to be aware of shopping for Ryzen ECC setups. If the
| board doesn't have the traces, there's nothing the CPU can
| do.
| theevilsharpie wrote:
| > And though ECC is not disabled in Ryzen CPUs, AFAIK it's
| not tested in (or advertised for) those
|
| ECC isn't validated by AMD for AM4 Ryzen models, but it's
| present and supported if the motherboard also supports it.
| Many motherboards have ECC support (the manual will say for
| sure), and a handful of models even explicitly advertise it
| as a feature.
|
| I have a Ryzen 9 3900X on an ASRock B450M Pro4 and 64 GB of
| ECC DRAM, and ECC functionality is active and working.
| colejohnson66 wrote:
| What do you mean by "validated"? There's the silicon, but
| they don't test it?
| Laforet wrote:
| More like "The feature is present in silicon but
| motherboard makers are not required to turn it on". At
| the end of the day, ECC support does require extra copper
| traces in the PCB and some low end models may
| deliberately choose to skip them, thus the expectation
| has to be managed.
| loeg wrote:
| IMO, "validated" is intentionally wishy-washy and mostly
| means that AMD would prefer it if enterprises paid them
| more money by buying EPYC (or Ryzen Pro) parts instead of
| consumer Ryzen parts. Much like how Intel prefers selling
| higher-margin Xeons over Core i5. It's market
| segmentation, but friendlier to consumers than Intel's
| approach.
| cturner wrote:
| I went through this about a year ago, to build a low-TDP
| ECC workstation. I do not have stats on failure rates, just
| this anecdotal experience. Asrock and Asus seem to be the
| boards to get. For RAM, I got two sticks of Samsung
| M391A4G43MB1, and verified. The advice I remember from the
| forums was to stick to unbuffered ram (UDIMMS).
| everybodyknows wrote:
| Did you consider any off-the-shelf ECC boxes?
|
| Found some here -- bottom of the EPYC product line starts
| at $2849 ...!
|
| https://www.velocitymicro.com/wizard.php?iid=337
| loeg wrote:
| Yes, the consumer parts only support UDIMMs. If you want
| RDIMMs, you have to pay for EPYC.
| CalChris wrote:
| Yeah, the iMac Pro has the Xeon W and ECC. T'would be nice if
| the Apple Silicon MacBook Pro had it. There's not much of a
| reason to pay for the Pro over the Air. But like Linus, I'm
| going to blame Intel for this situation in the market. Maybe
| Apple will strike out on its own with Apple Silicon but since
| their dominant use case is phones, I'll not hold my breath.
| DCKing wrote:
| Unless something weird happens, the next generation of the
| Apple M-line will use LPDDR5 memory instead of the LPDDR4X
| used in the Apple M1. While it probably won't support error
| correction _monitoring_ , LPDDR5 has built in error
| correction that silently corrects single bit flips. That
| alone should be a huge reliability improvement.
|
| LPDDR5 will enable some much needed level of error
| correction in a metric ton of other future SoC designs too.
| I look forward to the future Raspberry Pi with built in
| error correction capabilities.
| rhn_mk1 wrote:
| Doesn't intel make ECC available on the i3 line of CPUs?
| xxs wrote:
| Not any more[0] 10300. It used to[1] - 9300:
|
| 0: https://ark.intel.com/content/www/us/en/ark/products/199
| 281/...
|
| 1: https://ark.intel.com/content/www/us/en/ark/products/134
| 886/...
| minot wrote:
| I was going to say no but I just checked and at least ONE
| latest generation i3 processor supports ECC
|
| https://ark.intel.com/content/www/us/en/ark/compare.html?pr
| o...
|
| https://ark.intel.com/content/www/us/en/ark/products/208074
| /...
|
| Problem is this processor is an Embedded processor so
| probably not for us
|
| > Industrial Extended Temp, Embedded Broad Market Extended
| Temp
|
| My understanding is Intel does not support ECC on the
| desktop unless you pay extra.
| hollerith wrote:
| That i3 is for file servers.
| makomk wrote:
| Yeah, that appears to be a BGA-packaged processor
| designed to be permanently soldered to the board of some
| embedded device, not something that you can install in
| your desktop at all. I'm not sure why Intel decided to
| brand their embedded processors with ECC as i3, though I
| suspect the reason this range exists at all is because
| companies were going with competitors like AMD instead
| due to their across-the-board ECC support.
| opencl wrote:
| They used to support ECC in the desktop i3 lineup, current
| gen does not have ECC except in some embedded SKUs.
|
| https://ark.intel.com/content/www/us/en/ark/products/199280
| /...
| vbezhenar wrote:
| You can find non-Xeons with ECC support. But they are rare
| and usually suitable for some kinds of micro servers.
| fortran77 wrote:
| While it's true that Intel only has ECC support on Xeon (and
| several other chips targeted at the embedded market) it's not
| true that ECC is supported well on AMD.
|
| We _only_ use Xeons on developer desktops and production
| machines here precisely because of ECC. It 's about 1 bit
| flip/month/gigabyte. That's too much risk when doing
| something critical for a client.
| loeg wrote:
| > it's not true that ECC is supported well on AMD.
|
| That's an extreme claim. Why do you say so?
| theevilsharpie wrote:
| > it's not true that ECC is supported well on AMD
|
| ECC is supported on most Ryzen models[1], as long as the
| motherboard supports it. In fact, ASUS and ASRock (possibly
| others) have Ryzen motherboards designed for
| workstation/server use where ECC support is specifically
| advertised.
|
| [1] The only exception is the Ryzen CPUs with integrated
| graphics.
| js2 wrote:
| Depends what you mean by supported. Semi-offically:
|
| _ECC is not disabled. It works, but not validated for
| our consumer client platform.
|
| Validated means run it through server/workstation grade
| testing. For the first Ryzen processors, focused on the
| prosumer / gaming market, this feature is enabled and
| working but not validated by AMD. You should not have
| issues creating a whitebox homelab or NAS with ECC memory
| enabled._
|
| https://old.reddit.com/r/Amd/comments/5x4hxu/we_are_amd_c
| rea...
| loeg wrote:
| Your quote is for consumer platforms (Ryzen) only; GP's
| statement was that ECC is not well-supported on AMD _at
| all_ , which is obviously false (EPYC, Threadripper).
| adrian_b wrote:
| Yes there is a risk to buy a Ryzen CPU with non-
| functional ECC.
|
| However, I use only computers with ECC, previously only
| Xeons, but in the last years I have replaced many of them
| with Ryzens, all of which work OK with ECC memory.
|
| When having to choose between a very small risk of losing
| the price of a CPU and having to use for sure, during
| many years, an Intel CPU with half of the AMD speed, the
| choice was very obvious for me.
| theevilsharpie wrote:
| AMD may claim not to validate ECC on Ryzen, but it's
| working well enough for major motherboard vendors to
| market Ryzen motherboards with ECC advertised as a
| feature.
|
| ECC support not being "validated," for all practical
| purposes, simply means that board vendors can advertise a
| board lacking ECC support as compatible with AMD's AM4
| platform, without getting a nasty letter from AMD's
| lawyers.
| jeffbee wrote:
| > While it's true that Intel only has ECC support on Xeon
|
| That's not true. There are Core i3, Atom, Celeron, and
| Pentium SKUs with ECC. E.g. the Core i3-9300
|
| https://en.wikichip.org/wiki/intel/core_i3/i3-9300
| lighttower wrote:
| Can you get decent battery life with this ecc memory in a
| laptop?
| dijit wrote:
| Yes. ECC memory uses only marginally more power than non-ECC
| memory. And memory isn't the largest consumer of battery life
| by a country mile.
|
| Screen, Wi-Fi, and to a much lesser extent (unless under
| load) the CPU are the most major culprits of low battery
| life.
| indolering wrote:
| It can actually reduce power consumption, because refresh
| rates don't need to be so high:
|
| https://media-
| www.micron.com/-/media/client/global/documents...
| hosteur wrote:
| How did you track memory errors across thousands of physical
| machines?
| core-questions wrote:
| https://github.com/netdata/netdata/issues/1508
|
| Looks like `mcelog --client` might be a starting place? Feed
| that into your metrics pipeline and alert on it like anything
| else...
| jeffbee wrote:
| Newer linux have replaced mcelog with edac-util. I think
| most shops operating systems at that scale are getting
| their ECC errors out of band with IPMI SEL, though.
| gsvelto wrote:
| It's rasdaemon these days:
| https://www.setphaserstostun.org/posts/monitoring-ecc-
| memory...
| ikiris wrote:
| The same way you do it with everything else, export the
| telemetry and store it in time series...
| incrudible wrote:
| When you say bitflips were "common" on thousands of physical
| machines, does that mean you observed thousands of bitflips?
|
| Otherwise, I would think that an unlikely event becoming 1000x
| more likely by sheer numbers would have warped your perception.
|
| I believe that hardware reliability is mostly irrelevant,
| because software reliability is already far worse. It doesn't
| matter whether a bitflip (unlikely) or some bug (likely) causes
| a node to spuriously fail, what matters is that this failure is
| handled gracefully.
| ikiris wrote:
| Its enough that graphs can show you solar weather.
|
| I can't give my source, but its far higher than most people
| think. Just pay the money.
| dkersten wrote:
| Another comment[1] mentioned 1 bitflip per gigabyte per
| month. If you have a lot of RAM, that's rather a lot.
|
| > It doesn't matter whether a bitflip (unlikely) or some bug
| (likely) causes a node to spuriously fail
|
| Except that a bitflip can go undetected. It _may_ crash your
| software or system, but it also may simply leak errors into
| your data, which can be far more catastrophic.
|
| [1] https://news.ycombinator.com/item?id=25623206
| jhasse wrote:
| So can a bug.
| dkersten wrote:
| Yes. And? That doesn't suddenly make bitflips benign.
| incrudible wrote:
| The point is that you can't prevent failure by just
| buying something. You have to deal with the fact that
| failure _can not be prevented_.
|
| In other words, if a single defective DIMM somewhere in
| your deployment is causing catastraphic failure, your
| mistake was not buying the wrong RAM modules. Your
| mistake was relying on a single point of failure for
| mission critical data.
| tyoma wrote:
| It depends where the failure happens. Sometimes you really
| lose the "failure in the wrong place" lottery. For example,
| in a domain name: http://dinaburg.org/bitsquatting.html
| jjeaff wrote:
| Ya, I'm not buying that biyflips are a problem. Or maybe
| modern software can correct better for this? Because I use my
| desktop all day everyday running tons of software on 64 gb of
| ram and I don't get errors or crashes often enough to
| remember ever having one.
| ChrisLomont wrote:
| > I'm not buying that biyflips are a problem.
|
| Google and read up - it is a problem, has killed people,
| has thrown election results, and much more.
|
| It's such a common problem than bitsquatting is a real
| thing :)
|
| Want to do an experiment? Pick a bitsquatted domain for a
| common site, and see how often you get hits.
|
| https://en.wikipedia.org/wiki/Bitsquatting
| incrudible wrote:
| Nobody denies that bitflips _happen_. On the whole, you
| fail to make a case that preventing bitflips is the
| solution to a problem. Bitsquatting is not a real
| problem, it 's a curiosity.
|
| As for the case of bitflips killing someone: Bitflips are
| not the root cause here. The root cause is that somebody
| engineered something life-critical that mistakenly
| assumed hardware can not fail. Bitflips are just one of
| many reasons for hardware failure.
| ChrisLomont wrote:
| >Bitflips are not the root cause here.
|
| So those systems didn't fail when a bitflip happened?
|
| > The root cause is that somebody engineered something
| life-critical that mistakenly assumed hardware can not
| fail.
|
| The systems I am aware of were designed with bitflips in
| mind. NO software can handle arbitrary amounts of
| bitflips. ALL software designed to mitigate bitflips only
| lower the odds via various forms of redundancy. (For
| context, I've written code for NASA, written a few
| proposals on making things more radiation hardened, and
| my PhD thesis was on a new class of error correcting
| codes - so I do know a little about making redundant
| software and hardware specifically designed to mitigate
| bitflips).
|
| By claiming a bitflip didn't kick off the problems, and
| trying to push the cause elsewhere, you may as well blame
| all of engineering for making a device that can kill on
| failure.
|
| So your argument is a red herring
|
| >On the whole, you fail to make a case that preventing
| bitflips is the solution to a problem
|
| Yes, had those bitflips been prevented, or not happened,
| those fatalities would not have happened.
|
| >Ya, I'm not buying that biyflips are a problem.
|
| If bitflips are not a problem then we don't need ECC ram
| (or ECC almost anything!) which is clearly used a lot. So
| bitflips are enough of a problem that a massively
| widespread technology is in place to handle precisely
| that problem.
|
| I guess you've never written a program and watched bits
| flip on computers you control? You should try it - it's a
| good exercise to see how often it does happen.
|
| I guess you define something being a problem differently
| than I or the ECC ram industry do.
| dkersten wrote:
| Crashes aren't such a big problem. You can detect them and
| reboot or whatever. Silent data corruption is the real
| issue IMHO.
|
| See also this comment above:
| https://news.ycombinator.com/item?id=25623764
| adrian_b wrote:
| On a single computer with a large memory, e.g. 32 GB or more,
| the time between errors can be of a few months, if you are
| lucky to have good modules. Moreover, some of the errors will
| have no effect, if they happened to affect free memory.
|
| Nevertheless, anyone who uses the computer for anything else
| besides games or movie watching, will greatly benefit from
| having ECC memory, because that is the only way to learn when
| the memory modules become defective.
|
| Modern memories have a shorter lifetime than old memories and
| very frequently they begin to have bit errors from time to
| time long before breaking down completely.
|
| Without ECC, you will become aware that a memory module is
| defective only when the computer crashes or no longer boots
| and severe data corruption in your files could have happened
| some months before that.
|
| For myself, this was the most obvious reason why ECC was
| useful, because I was able in several cases to replace memory
| modules that began to have frequent correctable errors, after
| many years with little or no errors, without losing any
| precious data and without downtime.
| ikiris wrote:
| The good modules bit is important. I'm told by some
| colleagues that most of the bit flips are from alpha
| particles from the ram casings surprisingly enough.
| petermcneeley wrote:
| I would also add that Row Hammer Attacks are much harder on ECC.
|
| When I first tried to replicate the row hammer attack I was not
| getting any results. Turns out I was doing this on ECC. On non
| ECC memory the same test easily replicated the row hammer attack.
|
| https://en.wikipedia.org/wiki/Row_hammer
| rahimiali wrote:
| I have trouble parsing information from this rant. Is someone
| willing to translate this into an argument (a string of facts
| tied by logical steps)?
| mark-r wrote:
| 1. Linux sometimes has crashes, not due to software errors but
| because of memory glitches. 2. ECC would prevent memory
| glitches. 3. ECC is hard to find on desktop PCs because Intel
| uses the feature to differentiate desktop CPUs from server
| CPUs, so it can charge more for servers. 4. Even when someone
| like AMD makes the feature available, the market doesn't have
| ECC DRAM modules or motherboards readily available because
| Intel killed the demand for it.
| phh wrote:
| I don't know if ECC is that important, but reliability of RAM (or
| any storage) feels pretty crazy to me. 128GB being refreshed
| every second for a month error requires that the per-bit refresh
| process has a reliability of 99.9999999999999999% to be flawless.
| Considering we are dealing with quantum effects (which are
| inherently probabilistic), I wouldn't trust myself to design
| anything like that.
|
| Now back to ECC, I'll probably be corrected, but I don't think
| ECC helps gain more than two order of magnitudes, so we still
| need incredibly reliable RAM. If we move to ECC RAM by default
| everywhere, aren't we simply going to get less reliable RAM at
| the end?
| formerly_proven wrote:
| RAM is not as reliable as you think. Some ECC memory hardly
| ever finds an error, some machines see them at a very
| consistent rate, e.g. 50 errors per TB-day. That would
| translate to 1-2 errors per day in a 32 GB PC. Without ECC you
| cannot know in which bucket you are.
| trevyn wrote:
| If true, that seems like... a very straightforward bucket to
| test if you're in.
| toast0 wrote:
| The bucket can change over time though. If you want to be
| sure, you need to test often, which gets in the way of
| using the computer.
| bitcharmer wrote:
| A system on Earth, at sea level, with 4 GB of RAM has a 96%
| percent chance of having a bit error in three days without ECC
| RAM. With ECC RAM, that goes down to 1.67e-10 or about one
| chance in six billions.
|
| So I'd say ECC _is_ not only important but insanely impactful.
| There 's a reason why many organizations don't even want to
| hear about getting rigs with non-ECC memory.
| gzalo wrote:
| That number is flawed, and the author did a follow-up with
| better results: http://lambda-diode.com/opinion/ecc-memory-2
|
| "33 to 600 days to get a 96% chance of getting a bit error."
| Still, it seems way too high. I guess anyone with ECC RAM
| could confirm that they are getting those sort of recovered
| error rates?
| mrlala wrote:
| So, I hear what you are saying. But, on the other hand, I
| have been using 2 non-ECC desktops for a workstation/server
| for the past ~6 years.. and I would be hard pressed to come
| up with a single situation where either of the machines
| randomly crashed or applications did anything 'unexpected'
| (to my knowledge, of course).
|
| My point is, when you say there is a "96% chance of having an
| error in THREE DAYS", one would EXPECT to be having issues
| like.. all the time? So I'm not disagreeing with you, but
| with the amount of non-ECC machines all over the world and
| how insanely stable modern machines are, it still seems like
| a very low risk.
|
| Now of course I agree that if you want to take every
| precaution, go ECC, but simple observation prove that this
| "problem" can't be as bad as the numbers are saying.
| bitcharmer wrote:
| Your questions are perfectly valid. It's just that out of
| all the random bit flips that happen over a period of time
| on a non-ECC platform only a miniscule percentage will
| manifest to you in any noticeable way.
|
| Most will escape your attention.
| johndough wrote:
| I ran a memory test for two weeks straight on a consumer
| laptop with 8 GB RAM and could not get a single bit flip, so
| your mileage may vary.
| bitcharmer wrote:
| How did you run those tests? From what I understand on the
| topic, for your results to be statistically significant you
| need at least hundreds of machines and very rigid testing
| methodology.
| avian wrote:
| As someone who also ran a similar test myself and haven't
| seen a bit flip, I'm also skeptical of the 96% figure.
|
| I'm too lazy to run the exact numbers right now, but with
| "4 GB, 96% percent chance, three days" as the hypothesis,
| I think you'll find that an experimental result of "8 GB,
| 0% chance, 14 days" is highly statistically significant.
|
| Edit: rough back of napkin estimate - you're not seeing
| an event in roughly 10x trials (2x number of bits and ~5x
| number of days). Given hypothesis is true your
| experimental result has a probability of (1-0.96)^10 =
| very very small. Conclusion: hypothesis is false.
| bitcharmer wrote:
| The 96% figure comes from Google and was obtained in a
| large scale experiment over many months. I've been in
| this business long enough to have witnessed adverse
| effects of cosmic rays an non-ECC memory multiple times
| myself. I don't think you're sample gets anywhere near
| statistical significance. Not mentioning testing
| methodology.
| toast0 wrote:
| My anecdotal evidence is far from rigorous, but the
| Google data from ten years ago doesn't match up with my
| experience running thousands of ECC enabled servers up to
| a few years ago. Their rates seem a lot higher than what
| my servers experienced; we would page on any ram errors,
| correctable or not (uncorrectable would halt the machine,
| so we would have to inspect the console to confirm; when
| we knowingly tried machines with uncorrectable errors
| after a halt, they nearly all failed again within 24
| hours, so those we didn't inspect the console of probably
| were counted on their second failure), and while there
| were pages from time to time, it felt like a lot less
| than 8% of the machine having a
|
| There's a lot of variables that go into RAM errors,
| including manufacturing quality and condition of the ram,
| the dimm, the dimm slot, the motherboard generally, the
| power supply, the wiring, and the temperature of all of
| those. Google was known for cost cutting in their
| servers, especially early on; so I wouldn't be surprised
| if some of that resulted in higher bitflip rate than
| running in commercially available servers. Things like
| running bare motherboards, supported only on the edges
| cause excess strain and can impact resistance and
| capacitance of traces on the board (and in extreme cases,
| break the traces).
| tomxor wrote:
| I like when people back up their claims with numbers, but
| would you mind describing roughly what that 96% probability
| of error is based upon?
|
| I understand altitude has some kind of proportionality to
| cosmic ray exposure, and number of bits will multiply the
| probability of _an_ error.. I 'm presuming there is also an
| inherent error rate to DRAM separate from environment. But
| what are those numbers.
| bitcharmer wrote:
| Apologies, you're totally right. I should have linked to
| the source:
|
| http://lambda-diode.com/opinion/ecc-
| memory#:~:text=A%20syste....
| tomxor wrote:
| Great thanks!
|
| [edit]
|
| Looks like the calculation was revised [0] after
| criticism:
|
| > Under these assumptions, you'll have to wait about 33
| to 600 days to get a 96% chance of getting a bit error.
|
| What's more worrying is the variance, the above
| calculation is based on expected well behaved DRAM.. yet
| some computers just seem to have manufacturing defects
| that make the incidence of errors high enough to be a
| regular problem.
|
| [0] http://lambda-diode.com/opinion/ecc-memory-2
| dejj wrote:
| And even higher in the vicinity of radioactive cattle:
| https://www.jakepoz.com/debugging-behind-the-iron-curtain/
| davidw wrote:
| Could you measure altitude with memory?
| asimpletune wrote:
| That's a very interesting idea, and I think you totally
| could. You run some benchmarks, measure the bit flips, and
| after enough runs you'd be able to say with a degree of
| confidence what your altitude is. I wonder though what
| accuracy could be achieved with this?
| cyberlurker wrote:
| If the 96% every 3 days is true, you could approximate
| based on that. But it would be a really slow measurement.
| tomxor wrote:
| :D yes, although I expect you would need either a
| prohibitively large quantity of memory or a extremely slow
| rate of change in altitude to effectively measure it.
| rafaelturk wrote:
| Little bit offtopic: Again seems that Intel? what?! is the one
| lowering the bar.
| b0rsuk wrote:
| I browsed some online listings for ECC memory modules, and they
| seem to be sold one module at a time. Standard DDR4 modules are
| sold in pairs, to benefit from dual channel mode.
|
| Does ECC memory support dual channel??
| KingMachiavelli wrote:
| Is there such a thing as 'software' ECC where a segment in memory
| also has a checksum stored in memory and the CPU just verifies it
| when the memory segment is accessed?
|
| It would be a lot slower than real ECC but it could just be used
| for operations that would be especially vulnerable to bit flips.
| It would also not know for certain if the memory segment of data
| or the memory segment holding the checksum was corrupted besides
| their relative sizes (checksum is much smaller so more unlikely
| to have had a bit flip in it's memory region).
| a1369209993 wrote:
| Actually... there _is_ a word of memory that you already have
| to _read_ every time you access a region of memory: the page
| table entry for that region. If you have 64-byte cache lines,
| that 's 64 lines per (4KB) page, so you could load a second
| 64-bit word from the page table[0], and use that as a parity
| bit for each cache line, storing it back on write the same way
| you store active and dirty bits in the PTE proper. Actual E[
| _correcting_ ]C would require inflating the effective PTEs from
| 8(orginal)-16(parity) bytes to about 64(7 bits per line,
| insufficient)-128(15, excessive), which is probably untenable,
| but you could at least get parity checks this way.
|
| There's also the obvious tactic of just storing every logical
| 64-bit word as 128 bits of physical memory, which gives you
| room for all kinds of crap[1], at the expense of halving your
| effective memory and memory bandwidth.
|
| 0: This is extremely cheap since you're loading a 64- vs
| 128-bit value, with no extra round trip time and still fits in
| a cache line, so you're likely just paying extra memory use
| from larger page tables.
|
| 1: Offhand, I think you could fit triple or even quadruple
| error _correction_ into that kind of space (there 's room for
| _eight_ layers of SECDED, but I don 't remember how well bit-
| level ECC scales).
| temac wrote:
| Intel has some recent patents on that.
| zdw wrote:
| Good news is that for DDR5, ECC is a required part of the spec
| and should be a feature of every module:
|
| https://www.anandtech.com/show/15912/ddr5-specification-rele...
| [deleted]
| rajesh-s wrote:
| A whitepaper on DDR4 ECC by Micron that goes over some of the
| implementation challenges
|
| https://media-www.micron.com/-/media/client/global/documents...
| toast0 wrote:
| On die ECC is great for increasing reliability, if all else is
| equal, but if it doesn't report to the memory controller, and
| if the memory controller doesn't report to the OS, I think it
| will be worse than status quo, because all else won't be equal.
| With no feedback, systems are going to continue to run on the
| edge, but now detectable failures will all be multi-bit;
| because single bit errors are hidden.
| cududa wrote:
| Huh? Why would the memory controller not be updated
| accordingly? Also I have no idea about Linux or Mac, but
| Windows has had ECC support and active management for
| decades?
| indolering wrote:
| It's part of the firmware first trend of fixing things at
| the firmware level before reporting problems up the stack.
| This makes it a real nightmare for systems integrators to
| do root cause analysis.
| mlyle wrote:
| Normally, ECC has meant just the DIMM stores some extra
| bits, and the memory controller itself implements ECC--
| writing the extra parity, and recovering when errors emerge
| (and halting when non-recoverable errors happen).
|
| DDR5 includes on-die ECC, where the RAM fixes the errors
| before sending them over the memory bus.
|
| This means if the bus between the processor and ram
| corrupts the bits-- tough luck, they're still corrupted.
| And it's unclear whether we're going to get the quality of
| memory error reporting that we're used to or get the
| desired halt-on-non-recoverable error behavior (I've not
| been able to obtain/read the DDR5 specification as yet).
| cududa wrote:
| Thank you!
| [deleted]
| hinkley wrote:
| Is it built in as an added feature, or as the only way to make
| DDR5 reliable? My inner cynic is screaming the latter.
|
| When the value add feature becomes a necessity, it's not a
| value add any more.
| CoolGuySteve wrote:
| I always wondered why isn't ECC built into the memory
| controller, the same hardware that runs the bus into L3 or the
| page mapper could checksum groups of cachelines.
|
| It seems redundant to have every module come with its own
| checking hardware.
| p_l wrote:
| ECC is a function of memory controller, not memory, on
| current systems. There's also usually some form of ECC on
| whatever passes for system bus, and internal caches have ECC
| as well.
|
| For memory controller, parity/ECC/chipkill/RAIM usually
| involved simply adding additional memory planes to store
| correction data. I believe the rare exceptions are fully
| buffered memories where you have effectively separate memory
| controller on each module (or add-in card with DIMMs)
| kasabali wrote:
| AFAIK it is built into the memory controller, at least for
| ECC UDIMM. There's an extra DRAM chip on the module for
| parity (generally 8+1), but it is memory controllers
| responsibility to utilize it (that's why not all CPUs support
| ECC)
| bradfa wrote:
| I read it to say that on die ecc is recommended but that dimm-
| wide ecc is still optional.
|
| And now you have 8 bits of ecc per 32 data versus older DDR
| having 8 bits of ecc per 64 data. Hence the cost for dimm-wide
| ecc is going up.
| cbanek wrote:
| As someone who has had to read thousands of random game crash
| reports from all over the interwebs (you know when Windows says
| you might want to send that crash log? like that), I totally
| agree.
|
| Of all the things to be worried about, like OS bugs, bad hardware
| configuration, etc. bad memory is one of those really troubling
| things. You look at the code and say "it's can't make it here,
| because this was set" but when you can't trust your memory you
| can't trust anything.
|
| And as the timeline goes to infinity, you may also get one of
| these reports and be asked to fix it... good luck.
| lighttower wrote:
| Someone reads those reports!?! Wow, how do I write them to
| ensure someone who reads them takes them seriously?
| apankrat wrote:
| Aye. I have an assert in the code that fronts a _very_ pedantic
| test of the context. In all cases when this assert was tripped
| (and reported) an overnight memtest86 test surfaced RAM issues.
|
| - Edit -
|
| Also, bit flips in the non-ECC memory are _the_ cause of the
| "bitrot" phenomenon. That is when you write out X to a storage
| device, but you get Y when you read it back. A common
| explanation is that the corruption happens _at rest_. However
| all drives from the last 30+ years have FEC support, so in
| reality the only way a bit rot can happen is if the data is
| damaged _in transit_, while in RAM, on the way to/from the
| storage media.
|
| So, if you ever decide if to get an ECC RAM, get it. It's very
| much worth it.
| pkaye wrote:
| I wonder how much of those crashes are due to gamers
| aggressively overclocking their systems?
| faitswulff wrote:
| Do the crash reports include whether the machine has ECC
| memory?
| jackric wrote:
| Do the crash reports include recent solar activity?
| cbanek wrote:
| Well, I've had to actually worry about radiation bitflips
| as well. It does happen. But usually not so much on Earth!
| dharmab wrote:
| I once got to tell a CTO the reason our shiny new point
| to point connection was suddenly trash was due to solar
| flares.
| jgalentine007 wrote:
| One of the tire pressure sensors in my car tires had a
| bit flip a couple years ago and I had to reprogram it's
| ID. Luckily it was a subaru, so only a light came on in
| the dash.
|
| My old Honda crv however would turn traction control on
| if your pressure was low - which worked by applying
| brakes to wheels that were slipping. If you were going up
| a slippery hill you would soon have no power, sliding
| backwards nearly off the road in nowhere West Virginia on
| the way to a ski resort.
| jjeaff wrote:
| How in the world would you ever know that problem was
| caused by a bit flip and not just one of the countless
| other reasons that a sensor could fail?
| jgalentine007 wrote:
| I have a TPMS programming tool (ATEQ QuickSet) and reader
| (Autel TS401), because I like to swap my winter / summer
| tires on my own. The TPMS light came on one day and
| inflating tires didn't help - I used the reader and found
| that one sensor's ID had changed. When I compared the ID
| (it was in hex) to the last programming - it was a single
| bit off. I couldn't reprogram the sensor itself, but I
| was able to update the ECU with the changed ID using the
| ATEQ.
|
| I live in Denver but spend a lot of time skiing around
| 11k feet, maybe the higher elevation means more
| radiation.
| dharmab wrote:
| Similar story, we saw that one particular IP address in a
| public cloud network had a 3% TLS handshake error rate.
| We diverted traffic and then analyzed with wireshark. We
| found one particular bit was being pulled low (i.e. 0 ->
| 0 and 1 -> 0). HTTP connections didn't notice but TLS
| checksum verifications would randomly fail. Had a hell of
| a time convincing the cloud provider they had a hardware
| fault- turned out to be a bug which disabled ECC on some
| of their hardware.
|
| Aside: I'm surprised you got a TPMS programming tool
| instead of a set of steelies. Big wheels? Multiple winter
| vehicles?
| jgalentine007 wrote:
| I have 2 cars. I like the TPMS to work since I've had 3
| nails in tires in 4 years (newer construction area). Also
| the TPMS light in my impreza is almost as bright as the
| sun.
| jacquesm wrote:
| Timestamp + location should be enough to figure that out.
| ant6n wrote:
| It would be interesting to see whether there is a
| correlation between solar activity and game crashes --
| which in turn may provide an indication whether crashes
| are due to bugs or bit flips.
| Triv888 wrote:
| most gaming desktops don't use ECC RAM anyways (at least
| those from a few years ago)
| jacquesm wrote:
| On intel consumer boxes it is pretty safe to assume that they
| don't, on AMD it might be the case but it usually isn't.
| Springcleaning wrote:
| Worse than a game crash is your data.
|
| It is incomprehensible that there are still NAS devices being
| sold without ECC support.
|
| Synology took a step in the right direction to offer prosumer
| devices with ECC but it is not really advertised as such. It is
| actually difficult to find which do have ECC and which ones
| don't.
| ksec wrote:
| >Synology took a step in the right direction to offer
| prosumer devices with ECC
|
| I just look it up because if it was true it would have been
| news to me. Synology have been known to be stingy with
| Hardware Spec. But none of what I called Prosumer, the Plus
| Series have ECC memory by default. And there are "Value" and
| "J" Series below that.
|
| Edit: Only two model from the new xx21 series using AMD Ryzen
| V has ECC memory by default.
| BlueTemplar wrote:
| Yeah, here's one example along many more :
|
| https://forums.factorio.com/viewtopic.php?p=405060#p405060
| dboreham wrote:
| You don't need to look at kernel crashes to speculate about bus
| and memory errors -- just check the logs on a few systems that do
| have ecc. Pretty soon you'll see correctable errors being
| reported.
| maddyboo wrote:
| I don't know much about this topic, but is it possible that ECC
| memory is more prone to single bit errors than non-ECC memory
| because there is less pressure on companies to minimize such
| errors? If this were the case, it would skew the data.
| belzebalex wrote:
| Asked myself, would it be possible to build a Geiger counter with
| RAM?
| johnklos wrote:
| From the fortune database:
|
| As far as we know, our computer has never had an undetected
| error. -- Weisert
| otterley wrote:
| D. J. Bernstein (of qmail/daemontools fame) spoke of it over a
| decade ago as well. https://cr.yp.to/hardware/ecc.html
| slim wrote:
| these days he's more famous for the NaCl crypto library
| loup-vaillant wrote:
| For which bit flips are even more relevant: EdDSA has this
| nasty tendency of leaking the private key if the wrong bits
| are flipped (there are papers on fault injection attacks).
| People who sign lots of stuff all the time, say _Let 's
| Encrypt_, could conceivably gain some piece of mind with ECC.
|
| _(Note: EdDSA is still much much better than ECDSA, most
| notably because it 's easier to implement correctly.)_
| 1996 wrote:
| Linus is absolutely right.
|
| I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM.
| I can't get that, at all - even without the other fancy things I
| would like such as a 4k OLED with pen/touchscreen.
|
| In 2020, even the Dell XPS stopped shipping OLED (goodbye dear
| 7390!)
|
| I will gladly give my money to anyone who sells AMD laptop with
| ECC. Hopefully, it will show there's demand for "high end yet non
| bulky laptops"
| miahi wrote:
| Lenovo P53 has 3 NVMe slots, 4k OLED with touchscreen (and
| optional pen) and up to 128GB ECC RAM if you choose the Xeon
| processor. It's big and heavy, but it exists.
|
| I hope AMD will create a better market for the ECC laptop
| memory (right now it's hard to find + expensive).
| 1996 wrote:
| I know- I had my eye on this very model, as you can even add
| a mSata on the WWAN slot to get a 4th drive.
|
| Unfortunately, Lenovo is not selling the P53 anymore, which
| is exactly why I say I can't get that even in a "bulky"
| version.
| otterley wrote:
| About 1/3 of Google's machines and 8% of Google's DIMMs in their
| fleet suffer at least one correctible memory error per year:
| http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
| jjeaff wrote:
| Which means, assuming google is running very large machines
| with lots of memory that one might expect a single correctable
| error once every 6-10 years on your average workstation of
| small server. That's generously assuming your workstation has
| 1/3 as much memory as the average google server.
| Nebasuke wrote:
| Google does not use very large or even large machines for
| most of their fleet. You can quickly see in the paper this is
| for 1, 2, and 4 GB RAM machines (in 2006-2008).
| mauri870 wrote:
| In case the page os not loading, refer to the wayback machine[1]
| for a copy
|
| [1]
| https://web.archive.org/web/*/https://www.realworldtech.com/...
| JumpCrisscross wrote:
| What is the status of ECC on Macs?
| CalChris wrote:
| iMac Pro which has Xeon M. There's a good chance that will go
| away with the new Apple Silicon iMac Pro due out this year.
| MacRumors roundup article doesn't mention ECC.
|
| https://www.macrumors.com/roundup/imac/
| MAXPOOL wrote:
| Well shit.
|
| I run some large ML models in my home PC and I get NaN's and some
| out of range floats every month or so. I have spent hours
| debugging but doing the same computation with the same random
| seeds does not recreate the problem.
|
| How about GPU's and their GDDR SDRAM? Do they have parity bits?
| layer8 wrote:
| Some pro-level Nvidia GPUs have ECC RAM, they are very
| expensive though. I don't think regular gaming GPUs have
| parity, due to the extra cost, performance impact (probably
| minor but measurable) and irrelevance for gaming.
| vbezhenar wrote:
| Cheap pro-level GPUs don't have ECC RAM either. And it's not
| easy to find out, it might be buried somewhere.
| [deleted]
| JoeAltmaier wrote:
| ECC works if done right. Accessing a memory location can fix bit-
| flips (ECC is a 'correcting' code). But systems that don't
| regularly visit every memory location, can accumulate risk. Those
| dark corners of RAM can eventually get double-bit errors and be
| uncorrectable. So an OS might 'wash' RAM during idle moments,
| reading every location in a round-robin manner to get ECC to kick
| in and auto-correct. Doesn't matter how fast (1M every hour or
| whatever) as long as somehow ECC has a chance to work.
| jacquesm wrote:
| Interesting, similar to scrubbing raid arrays. How often do
| those double bitflips appear though? You'd have to have a
| pretty long running server for that to be a problem, no?
| jeffbee wrote:
| According to Google's old paper on the subject, about 1% of
| their machines suffered from an uncorrectable (i.e. multi-
| bit) error in a year.
| temac wrote:
| The RAM already needs to be refreshed and IIRC it is done by
| the memory controller when not in sleep mode.
|
| However I don't remember if there are provisions for ECC
| checking in case there are some dedicated refresh commands. I
| hope so, but I'm not sure.
| musingsole wrote:
| A double-bit error in many cases is fine. If the error is at
| least detectable at the time of a read, your protection worked.
| What's scary is a triple-flip event. Most of those will still
| look like corrupted data, but if it happens to flip into
| looking like a fixable, single-bit error, you're out of luck
| and won't even know it.
| a1369209993 wrote:
| > Most of those will still look like corrupted data,
|
| Not if you're using a typical 72-bit SECDED code[0].
|
| You have two error indicators: a summary parity bit (even
| number of errors: 0,2,etc vs odd number of errors: 1,etc),
| and a error index: 0 for no errors, or the bitwise xor of the
| locations each bit error.
|
| For a triple error at bits a,b, and c, you'll have summary
| parity of 1 (odd number of errors, assumed to be 1), and a
| error index of a^b^c, in the range 0..127, of which 0..71[1]
| (56.25%, a clear albeit not overwhelming majority) will
| correspond to legitimate single-bit errors.
|
| 0: https://en.wikipedia.org/wiki/Hamming_code#Hamming_codes_w
| it...
|
| 1: or 72 out of 128 anyway; the active bits might not all be
| assigned contiguous indexes starting from zero, but it
| doesn't change the probability and it's simpler to analyse if
| summary is bit 0 and index bit i is substrate bit 2^i.
| electricshampo1 wrote:
| Patrol scrub is basically this (https://www.intel.com/content/d
| am/www/public/us/en/documents... it is built into the memory
| controller, no OS involvement is needed.
| electricshampo1 wrote:
| working link:
|
| https://www.intel.com/content/dam/www/public/us/en/documents.
| ..
| wagslane wrote:
| It really does. I did a write-up recently on it as I was diving
| in and understanding the benefits:
| https://qvault.io/2020/09/17/very-basic-intro-to-elliptic-cu...
| avianes wrote:
| Be careful not to confuse ECC memory with ECC encryption.
|
| ECC memory = memory with Error-Correcting Code
|
| ECC encryption = Elliptic Curve Cryptography
| _0ffh wrote:
| Please someone correct me if I'm wrong, but as far as I can
| remember memory with extra capacity for error detection used to
| be a rather common thing on early PCs. That really only changed a
| couple of decades in, in order to be able to offer lower prices
| to home users who didn't know or care about the difference.
| Probably about the time, or earlier, when with some hard disk
| manufacturers megabytes suddenly shrunk to 10^6 bytes (before
| kibibytes or mebibytes where a thing, btw).
| wmf wrote:
| Yes, PCs used to use parity memory.
| musingsole wrote:
| It's a shame we don't have ECC for individuals. How many of
| society's bugs come from someone wandering around with a bit
| flipped?
| ratiolat wrote:
| I have: Asus PRIME A520M-K Motherboard 2x M391A2K43DB1-CVF
| (Samsung 16GiB ECC Unbuffered RAM) AMD Ryzen 5 3600
|
| I specifically was looking for bang for buck, low(er) wattage and
| ECC.
| IanCutress wrote:
| Those AMD motherboards with consumer CPUs are a bit iffy. They
| run ECC memory, but it's hard to tell if it is running in ECC
| mode. Even some of the tools that identify ECC is running will
| say it is, even when it isn't, because the motherboard will
| report it is, even when it isn't. ECC isn't a qualified metric
| on the consumer boards, hence all the confusion.
| linsomniac wrote:
| This reminds me of last year we ordered a new $14K server, it
| arrived and we ran it through our burn-in process which included
| running memtest86 on it, and it would, after around 7 hours,
| generate errors.
|
| Support was only interested if their built-in memory tester,
| which even on it's most thorough, would only run for ~3 hours,
| would show errors, which it wouldn't. IIRC, the BMC was logging
| "correctable memory errors", but I may be misremembering that.
|
| "We've run this test on every server we've gotten from you,
| including several others that were exactly the same config as
| this, this is the only one that's ever thrown errors". Usually
| support is really great, but they really didn't care in this
| case.
|
| We finally contacted sales. "Uh, how long do we have to return
| this server for a refund?" All of a sudden support was willing to
| ship us out a replacement memory module (memtest86 identified
| which slot was having the problem), which resolved the problem.
|
| They were all too willing to have us go to production relying on
| ECC to handle the memory error.
| FartyMcFarter wrote:
| Does anyone know why ECC memory requires the CPU to support it?
|
| Naively, I can understand why error _reporting_ has dependencies
| on other parts of the system, but it would seem possible for
| error _correction_ to work transparently.
| TomVDB wrote:
| I think the memory just provides additional storage bits to
| detect the issue, but doesn't contain the logic.
|
| This is in line with all technical parameters of DRAM:
| everything must be as cheap as possible, and all the difficult
| parts are moved to the memory controller.
|
| Which is the right thing to do, because you can share one
| memory controller with multiple DRAM chips.
| wmf wrote:
| Historically the detection and correction is performed in the
| memory controller not the DRAM.
| toast0 wrote:
| As implemented today, ECC is a feature of the memory
| controller. You need special ram, because instead of 8 parallel
| rams per bank, you need 9, and all the extra data lines to go
| to the controller.
|
| Modern CPUs have integrated memory controllers, so that's why
| the CPU needs to support it.
|
| Correction without reporting isn't great; anyway, you _need_ a
| reporting mechanism for uncorrectable errors, or all you 've
| done is ensure any memory errors you do experience are worse.
| nix23 wrote:
| I always have that conversation when ZFS comes up. Some peoples
| think ZFS NEEDS ECC, but in fact ZFS needs ECC much as every
| single one FS in Linux. And every single reliable Machine needs
| ECC.
| paulie_a wrote:
| There was a great defcon talk a while back regarding using ECC.
| The concept was called "dns jitter"
|
| Basically you can register domains using small bit differences
| for domains and start getting email and such for that domain
|
| If I recall correctly the example given was a variation of
| microsoft.com
|
| All because so much equipment doesn't use ECC
| zx2c4 wrote:
| Voila http://media.blackhat.com/bh-
| us-11/Dinaburg/BH_US_11_Dinabur...
| tyoma wrote:
| There were some great follow up talks as well! It turns out a
| viable attack vector was also MX records. And there was the
| guy who registered kremlin.re ( versus kremlin.ru ).
| jeffbee wrote:
| miclosoft.com is only one bit away from microsoft.com. Used to
| see these problems all the time when I worked on gmail.
|
| At Google even with ECC everywhere there wasn't enough
| systematic error detection and correction to prevent the global
| database of monitoring metrics from filling up with garbage.
| /rpc/server/count was supposed to exist but also in there would
| be /lpc/server/count and /rpc/sdrver/count and every other
| thing. Reminded me daily of the terrors of flipped bits.
| [deleted]
| louwrentius wrote:
| ECC matters, even on the desktop, it's not even a discussion, to
| me.
|
| If you think it doesn't matter: how do you know? If you don't run
| with ECC memory, you'll never know if memory was corrupted (and
| recovered).
|
| That blue screen, that sudden reboot, that program crashing. That
| corrupted picture of your kid.
|
| Who knows.
|
| I'll tell you, who knows. God damn every sysadmin (or the modern
| equivalent) can tell you how often they get ECC errors. And at
| even a small scale you'll encounter them. I have, on servers and
| even on an SAN Storage controller, for crying out loud.
|
| If you care about your data, use ECC memory in your computers.
| supernovae wrote:
| I've got nearly 30 years of experience and not once has non ECC
| memory lead to corruption. Maybe a crash, maybe a panic, maybe
| a kernel dump...
|
| But.. in all my time operating servers over 3 decades, it's
| always been bad drivers, bad code and problematic hardware
| that's caused most of my headaches.
|
| Have i seen ECC error correction in logs? yeah.. I don't
| advocate against it but, i've found for most people you design
| around multiple failure scenarios more than you design around
| preventing specific ones.
|
| Take the average web app - you run it on 10 commodity systems
| and distribute the load.. if one crashes, so what. Chances are,
| a node will crash for many more reasons other than memory
| issues.
|
| If you have an app that requires massive amounts of ram or you
| do put all of your begs in one basket, then ECC makes sense...
|
| I just know i like going horizontal and I avoid vertical
| monoliths.
| louwrentius wrote:
| The problem with memory corruption is not just crashes, those
| are the more benign outcomes.
|
| The real killer is data corruption. Houw would you even begin
| to know that data is corrupted until it is too late?
| ajnin wrote:
| > I've got nearly 30 years of experience and not once has non
| ECC memory lead to corruption
|
| How do you know?
| ptx wrote:
| > if one crashes, so what
|
| Crashes might not matter, but silent data corruption does.
| The owner/user of that data will care when they eventually
| discover that it at some point mysteriously got corrupted.
| alkonaut wrote:
| I know what it does, but I still don't care (so long as it
| costs money or even 1% performance).
|
| It's a tradeoff between money/performance and the frequency of
| crashes, corruption etc.
|
| Bit rot is just one of many threats to my data. Backups take
| care of that as well as other threats like theft, fire,
| accidental deletion.
|
| This is similar to my reasoning around the recent side channel
| attacks on intel CPUs. If I had a choice I'd like to run with
| max performance without the security fixes even though it would
| be less secure. Not because I don't care about security but
| because 1% or 5% perf is a lot and I'd rather simply avoid
| doing anything security critical on the machine entirely than
| take that hit.
| louwrentius wrote:
| > Bit rot is just one of many threats to my data. Backups
| take care of that as well as other threats like theft, fire,
| accidental deletion.
|
| No, that's the big mistake people make: backups just backup
| bit-rotted data, until it is too late and the last good
| version is rotated out and lost forever.
| alkonaut wrote:
| I'm aware. But the risk is extremely small (and 99.9% of
| important data is not created on the machine but goes
| directly from e.g iOS camera to backup).
|
| My desktop machine is basically a gaming rig with
| disposable data. Hence the "performance over integrity".
|
| I also never rotate anything out. Every version of
| everything is in the backups. Storage is that cheap these
| days.
| mark-r wrote:
| Backups can't fix what was already corrupted when it was
| written to disk.
| kensai wrote:
| "ECC availability matters a lot - exactly because Intel has been
| instrumental in killing the whole ECC industry with it's horribly
| bad market segmentation."
|
| Its.
|
| There, I finally corrected Linus Torvalds in something. :))
| hugey010 wrote:
| He uses "do do" instead of "to do" which is a more obvious
| typo. Linus usually comes across as borderline arrogant, and
| deservedly so, but not necessarily perfect in his writing. I
| think it's an effective strategy to communicate his priorities
| and wrangle smart but easily intimidated folk "do do" what he
| believes is right!
| mark-r wrote:
| I have a simple way of remembering when to leave out the
| apostrophe. His, hers, its are all possessive and none of them
| have an apostrophe.
| Glanford wrote:
| In this particular case 'it's' can also be possessive
| although it's considered non-standard, so to be correct you
| can always treat it like a contraction of 'it is'.
| raverbashing wrote:
| Yeah I'm always annoyed with this kind of mistake. Especially
| as non-native speakers should know better than the native ones
| (which usually don't give a f.).
|
| Now the point about internally doing ECC is an interesting one,
| could be a way out of this mess. And apparently ECC is more
| available in AMD land
| tssva wrote:
| The really annoying thing is that auto correct on mobile
| device keyboards will often want to incorrectly change "its"
| to "it's" or vice versa.
| raverbashing wrote:
| Yes, auto-corrects compound the problem.
| simias wrote:
| For a 2nd language speaker making these homophonic mistakes
| is actually a sign of fluency. It means that you just
| transcribe a mental flow of words instead of consciously
| constructing the language.
|
| The first time I wrote "your" instead of "you're" in English
| I thought it was quite a milestone!
| raverbashing wrote:
| > For a 2nd language speaker making these homophonic
| mistakes is actually a sign of fluency.
|
| I kinda disagree because while the homophony works in
| (spoken) English in written it stands as a sore thumb. So
| yeah you will make it if you only heard it but doesn't know
| the written form.
|
| (And in their native language it's probably two unrelated
| words, so that might intensify the feeling of wrongness)
| simias wrote:
| I mean, my native language is French where "your" is
| "ton" and "you're" is "tu es", yet it (rarely) happens
| that I mix them up in English. If I proofread I'll spot
| it almost every single time, but if I'm just typing my
| "stream of consciousness" my brain's speech-to-text
| module sometimes messes up.
| leetcrew wrote:
| meh, plenty of (intelligent!) native english speakers do
| not know all the canonical grammar rules. english
| contains a lot of what could be considered error
| correction bits, so it doesn't usually impede
| understanding. syntactically perfect english with
| weird/misused idioms (common among non-native speakers
| with lots of formal education) is harder to understand in
| my experience. I imagine this is true of most natural
| languages.
| protomolecule wrote:
| For what its worth as a non-native speaker I too started
| making this kind of errors when my English became fluent
| enough.
| [deleted]
| andi999 wrote:
| Yes. I noticed this. When I was younger, I thought how can
| you mix up 'their, they're, there' people you do this must
| be the opposite of smart. This lasted for 4 years living in
| an English speaking country....
| harperlee wrote:
| As an "english as a second language" user, I can't see
| myself writing e.g. "should of" instead of "should have",
| however fluent I am. I think you don't make that kind of
| typo unless you have learnt english before grammar.
| simias wrote:
| I also wouldn't do this one, but that's because in my
| English accent I simply wouldn't pronounce them the same
| way. Also the word sequence "should of" is extremely
| uncommon in proper English, so it catches the eye more
| easily I think.
|
| "You're/your", "their/they're", "its/it's" and the like
| are a different story, because I do pronounce those the
| same and they're all very common.
| lolc wrote:
| I was quite surprised when it started happening to me.
| harperlee wrote:
| Wow that's interesting!
| young_unixer wrote:
| I've realized that when I'm engaged in the writing (angry or
| emotional in some way) I tend to commit more of these
| mistakes, even though I know the difference between "it's"
| and "its". Linus is always angry, so that probably makes him
| commit more orthographic mistakes.
| touisteur wrote:
| I think it's available for customer SKUs on AMD and not just
| for servers like in 'Xeon-land'... How I've wanted an ECC-
| ready NUC...
| jeffbee wrote:
| The AMD parts all have the ECC feature but the platform
| support outside of EPYC may as well not exist. Most
| motherboards for the Ryzen segment don't do it properly or
| don't do it at all, some support it but aren't capable of
| reporting events to the operating system which is dumb.
| Ryzen laptops don't have it either.
|
| Closest you can come to a nuc with ecc is I think a mini
| server equipped with one of the four-core i3 parts that
| have ecc.
| erkkie wrote:
| Probably not what you meant but https://ark.intel.com/conte
| nt/www/us/en/ark/products/190108/... has support for Xeon
| (and ECC). Now how to actually practically source 32GB ECC
| enabled SO-DIMM sticks ..
| africanboy wrote:
| As a non native speaker, my phone has both the Italian and
| English dictionary, when I write its it always auto corrects
| to it's as soon as I hit space and sometimes it gets
| unnoticed.
| phkahler wrote:
| >> But is ECC more available in AMD land?
|
| Yes it is. The problem is they dont really advertise it. I'm
| not certain but it might even be standard on AMD chips, but
| if they dont say so and board makers are also unclear, who
| knows...
| ethbr0 wrote:
| It's a market size problem.
|
| For consumer motherboard OEMs, only AMD effectively has ECC
| support (Intel's has been so spotty and haphazard from
| product to product), and of AMD users, only a small number
| care about ECC.
|
| So motherboard companies, being resource and time-starved
| as they are, don't make it a priority to address such a
| small user-base.
|
| If Intel started shipping ECC on everything, it would go a
| long way towards shifting the market.
| [deleted]
| jacquesm wrote:
| How is your Finnish?
| jankeymeulen wrote:
| Or Swedish for that matter, as I believe Torvalds maternal
| language is Swedish
| [deleted]
| jacquesm wrote:
| Finnish is stupendously hard. Far harder than Swedish, at
| least, by my estimation.
| dancek wrote:
| Yes. Swedish is also easy compared to English and French,
| the other two languages I've learned after early
| childhood. The only thing that makes it hard is that you
| never really have use for it and you're forced to learn
| it nevertheless here in Finland.
|
| I'm happy to see people here on HN respect the difficulty
| of learning languages. Most foreigners that speak Finnish
| do it very poorly at first and even after decades they
| still sound like foreigners. But it shows huge respect to
| our small country for someone to make the effort, and we
| really appreciate it. I'm hoping other people see
| learning their own mother tongue the same way. Sure, most
| of us need English, but learning it _well_ is still a
| huge task.
| dehrmann wrote:
| It is. Swedish and English are both Germanic languages,
| so there are a lot of commonalities. Finnish is in a
| completely different language family. English and Swedish
| are more closely related to Persian and Hindi than to
| Finnish.
| young_unixer wrote:
| Yes. https://www.youtube.com/watch?v=0rL-0LAy04E
| Igelau wrote:
| It could use some Polish.
| jacquesm wrote:
| Dobrze ;)
| xxs wrote:
| Linus must have English as his '1st' language now. For non-
| originally-native speaker mistakes like 'it's vs its', 'than
| vs then', etc. are pretty uncommon.
| Tade0 wrote:
| I guess this is what happens when someone first learns to
| _speak_ the language, learning how to write in it only
| later on - as it often is the case with children.
|
| I spent my preschool years in a multicultural environment
| and English was our _lingua franca_ (ironically the school-
| mandated language was French), so I didn't properly learn
| contractions until grade school - same with similarly
| sounding words like "than vs then" and "your vs you're".
| jacquesm wrote:
| I've spent my whole life speaking multiple languages and
| this still trips me up every now and then, in fact quotes
| as such are a problem for me and I keep using them wrong,
| no idea why, it just won't register. So unless I slow down
| to 1/10th of my normal writing speed I will definitely make
| mistakes like that. Good we have proofreaders :)
| dehrmann wrote:
| (guessing you mean apostrophes)
|
| It's because they have two different uses (three if you
| count nested quotes, but those aren't common and are
| pretty easy to figure out), contractions and possession,
| and they seemingly collide on words like "its" where
| you'd think it could mean either.
|
| Not sure if you've already learned this (or if it helps),
| but English used to be declined, and its pronouns still
| are, e.g. they/their/them. That's why "its" isn't
| contracted; the possessive marker is already in the word.
| mixmastamyk wrote:
| His, hers, its
| JosephRedfern wrote:
| Maybe he composed the message using a machine with non-ECC RAM
| and suffered a bit flip, which through some chain of events,
| led to the ' being added. Best to give him the benefit of
| doubt, I think!
| notretarded wrote:
| The mistake was that it was included.
| JosephRedfern wrote:
| Oops, that was dumb. Fixed, thanks.
| spacedcowboy wrote:
| Seems likely that "bad ram" was the reason for the recent AT&T
| fiber issues, given that 1 bit was being flipped reliably in data
| packets [1]
|
| [1]:
| https://twitter.com/catfish_man/status/1335373029245775872?l...
| p_l wrote:
| I have had in the past encountered an issue where line card was
| stripping exactly one bit of address data. Don't know of the
| follow up investigation, but it probably wasn't TCAM
| SV_BubbleTime wrote:
| I think you meant seems _un_ likely
| MarkusWandel wrote:
| This is one justified Linus rant! My personal history includes
| data loss twice because of defective RAM, and many more RAMs
| discarded after the now obligatory overnight run of MemTest86+
| (these were all secondhand RAMs - I would never buy a new one
| without a refund guarantee). My very first "PC" still had the ECC
| capability and I used it. My own now very dated rant on the
| subject: http://wandel.ca/homepage/memory_rant.html
| mixmastamyk wrote:
| A few years back memtest86 wouldn't run on newer machines, has
| that been fixed?
| IgorPartola wrote:
| I wish this was more of a cohesive argument. He says he thinks
| it's important and points to row-hammer problems but doesn't
| explain why. Probably because the audience it was written for
| already knows the arguments of why, but this is not the best
| argument.
|
| If in doubt, get ECC. Do your own research on how it works and
| why. This post won't explain it, just will blame Intel (probably
| rightfully so).
| turminal wrote:
| It's a message in a thread from a technological forum. I think
| its intended audience are people already familiar with ECC
| unlike here on HN.
| IgorPartola wrote:
| Exactly my point :)
| eloy wrote:
| He does explain it:
|
| > We have decades of odd random kernel oopses that could never
| be explained and were likely due to bad memory. And if it
| causes a kernel oops, I can guarantee that there are several
| orders of magnitude more cases where it just caused a bit-flip
| that just never ended up being so critical.
|
| It might be false, but I think it's a reasonable assumption.
| IgorPartola wrote:
| To someone on HN who isn't familiar with what ECC does that
| explains nothing about how ECC works and how it could have
| prevented these situations. Or how often they really happen.
| simias wrote:
| The problem is that, if you don't have ECC to detect the
| errors, it's very hard to know what exactly caused a
| random, non-reproducible crash. Especially in kernel mode
| where there's little memory protection and basically any
| driver could be writing anywhere at any time.
|
| I can understand Linus's frustration from that point of
| view: without ECC RAM when you get some super weird crash
| report where some pointer got corrupted for no apparent
| reason you can't be sure if it's was just a random bitflip
| or if it's actually hiding a bigger problem.
| andi999 wrote:
| You could run memtest on a pc without ecc for a couple of
| days and to estimate the error rate, or not?
| fuster wrote:
| Pretty sure most memory test tools like memtest86 write
| the memory and then read it back shortly thereafter in
| relatively small blocks. This makes the window for errors
| to be introduced dramatically smaller. Most memory in a
| computer is not being continually rewritten under normal
| use.
| simias wrote:
| If you manage to replicate bitflips every few days your
| RAM is broken.
|
| It's the "once every other year" type of bitflip that's
| the problem. The proverbial "cosmic ray" hitting your
| DRAM and flipping a bit. That will be caught by ECC but
| it'll most likely remain a total mystery if it causes
| your non-ECC hardware to crash.
| zlynx wrote:
| It isn't only cosmic rays. Regular old radiation can also
| cause it. I've read about a server that had many repeated
| problems and the techs replaced the entire motherboard at
| one point.
|
| Then one of them brought in his personal Geiger counter
| and found the radiation coming off the steel in that rack
| case was significantly higher than background.
|
| You may never know when the metal you use was recycled
| from something used to hold radioactive materials.
| reader_mode wrote:
| It takes 5 seconds to Google ECC memory if you're really
| interested and if you're working on kernel related stuff
| you 99.9999% know what it is.
| IgorPartola wrote:
| Right. My point that TFA serves zero purpose to most
| people on here. Those that know how ECC works already
| know that it is a must have. Those that don't will learn
| very little from the post because it fails to explain
| what ECC is and why you need it aside from general
| statements about memory errors. It will reaffirm for
| those that know about what ECC RAM is that it's a good
| idea, but they already know it anyways. It reads a lot
| like an article about why vitamin C is a good thing.
| nix23 wrote:
| To someone on HN who isn't familiar with what Google does
| that explains nothing about how Google works ;)
| TheCoelacanth wrote:
| Google is like an evil version of Duck Duck Go.
| Danieru wrote:
| Nah, to Google is just a generic verb. For example I too
| do all my googling at Duck Duck Go.
|
| Hi alphabet lawyers.
| vorticalbox wrote:
| I believe there was a suit against alphabet about this
| very thing.
|
| They argued that 'Google' has now become a verb meaning
| 'to search the Internet for' and as such alphabet should
| have the name taken away.
| chalst wrote:
| From https://en.m.wikipedia.org/wiki/ECC_memory -
|
| > A large-scale study based on Google's very large number
| of servers was presented at the SIGMETRICS/Performance '09
| conference.[6] The actual error rate found was several
| orders of magnitude higher than the previous small-scale or
| laboratory studies, with between 25,000 (2.5 x 10-11
| error/bit*h) and 70,000 (7.0 x 10-11 error/bit*h, or 1 bit
| error per gigabyte of RAM per 1.8 hours) errors per billion
| device hours per megabit. More than 8% of DIMM memory
| modules were affected by errors per year
| unixhero wrote:
| Fantastic burn by Linus Torvalds whom also had some skin in the
| CPU game.
|
| Offtopic, I wonder if he trawls that site regularly. And
| eventually I wonder, is he here also? :)
| knorker wrote:
| I have multiple times postponed buying new computers for YEARS,
| because I'm waiting for intel to get their head out of their ass
| and actually let me buy something that does ECC for desktop.
| (incl laptops)
|
| I would have bought computers when I "wanted one". Now I buy them
| when I _need_ one. Because buying a non-ECC computer just feels
| like buying a defective product.
|
| In the last 10 years I would have bought TWICE as many computers
| if they hadn't segmented their market.
|
| Fuck intel. I sense that Linus self-censored himself in this
| post, and like me is even angrier than the text implies.
| vbezhenar wrote:
| There are plenty of Xeons which are suitable for desktops and
| there are plenty of laptops with Xeons.
|
| Price is not nice though.
| skibbityboop wrote:
| Have you finally stopped buying Intel? Current Ryzens are a
| much better CPU anyhow, just dump Intel and be happy with your
| ECC and everything else.
| jhoechtl wrote:
| I definitely do not want Linus Torvalds yelling at me in that
| tone --- but reading his utterings is certainly entertaining.
| indolering wrote:
| My favorite example is a bit flip altering election results:
|
| https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...
| qwerty456127 wrote:
| ECC should be everywhere. It seems outrageous to me almost no
| laptops have ECC.
| arendtio wrote:
| It would be interesting to see how many more kernel oops appear
| on machines without ECC compared to those with ECC.
| nostrademons wrote:
| I still remember Craig Silverstein being asked what his biggest
| mistake at Google was and him answering "Not pushing for ECC
| memory."
|
| Google's initial strategy (c. 2000) around this was to save a few
| bucks on hardware, get non-ECC memory, and then compensate for it
| in software. It turns out this is a terrible idea, because if you
| can't count on memory being robust against cosmic rays, you also
| can't count on the software being stored in that memory being
| robust against cosmic rays. And when you have thousands of
| machines with petabytes of RAM, those bitflips do happen. Google
| wasted many man-years tracking down corrupted GFS files and index
| shards before they finally bit the bullet and just paid for ECC.
| maria_weber23 wrote:
| ECC memory can't eliminate the chances of these failures
| entirely. They can still happen. Making software resilient
| against bitflips in memory seems very difficult though, since
| it not only affects data, but also code. So in theory the
| behavior of software under random bit flips is well... Random.
| You probably would have to use multiple computers doing the
| same calculation and then take the answer from the quorum. I
| could imagine that doing so would still be cheaper than using
| ECC ram, at least around 2000.
|
| Generally this goes against software engineering principles.
| You don't try to eliminate the chances of failure and hope for
| the best. You need to create these failures constantly (within
| reasonable bounds) and make sure your software is able to
| handle them. Using ECC ram is the opposite. You just make it so
| unlikely to happen, that you will generally not encounter these
| errors at scale anymore, but nontheless they can still happen
| and now you will be completely unprepared to deal with them,
| since you chose to ignore this class of errors and move it
| under the rug.
|
| Another intersting side effect of quorum is that it also makes
| certain attacks more difficult to pull off, since now you have
| to make sure that a quorum of machines gives the same "wrong"
| answer for an attack to work.
| colejohnson66 wrote:
| > You probably would have to use multiple computers doing the
| same calculation and then take the answer from the quorum.
|
| The Apollo missions (or was it the Space Shuttle?) did this.
| They had redundant computers that would work with each other
| to determine the "true" answer.
| EvanAnderson wrote:
| The Space Shuttle had redundant computers. The Apollo
| Guidance Computer was not redundant (though there were two
| AGCs onboard-- one in the CM and one in the LEM). The
| aerospace industry has a history of using redundant
| dissimilar computers (different CPU architectures, multiple
| implementations of the control software developed by
| separate teams in different languages, etc) in voting-based
| architectures to hedge against various failure modes.
| haolez wrote:
| Sounds similar to smart contracts running on a blockchain
| :)
| buildbuildbuild wrote:
| This remains common in aerospace, each voting computer is
| referred to as a "string".
| https://space.stackexchange.com/questions/45076/what-is-a-
| fl...
| sroussey wrote:
| In aerospace where this is common, you often had multiple
| implementations, as you wanted to avoid software bugs
| made by humans. Problem was, different teams often
| created the same error at the same place, so it wasn't as
| effective as it would have seemed.
| tomxor wrote:
| > Making software resilient against bitflips in memory seems
| very difficult though, since it not only affects data, but
| also code.
|
| There is an OS that pretty much fits the bill here. There was
| a show where Andrew Tanenbaum had a laptop running Minix 3
| hooked up to a button that injected random changes into
| module code while it was running to demonstrate it's
| resilience to random bugs. Quite fitting that this discussion
| was initiated by Linus!
|
| Although it was intended to protect against bad software I
| don't see why it wouldn't also go a long way in protecting
| the OS against bitflips. Minix 3 uses a microkernel with a
| "reincarnation server" which means it can automatically
| reload any misbehaving code not part of the core kernel on
| the fly (which for Minix is almost everything). This even
| includes disk drivers. In the case of misbehaving code there
| is some kind of triple redundancy mechanism much like the
| "quorum" you suggest, but that is where my crude
| understanding ends.
| slumdev wrote:
| Error-correcting code (the "ECC" in ECC) is just a quorum at
| the bit level.
| sobriquet9 wrote:
| Modern error correction codes can do much better than that.
| eevilspock wrote:
| I'm surprised that the other replies don't grasp this.
| _This_ is the proper level to do the quorum.
|
| Doing quorum at the computer level would require
| synchronizing parallel computers, and unless that
| synchronization were to happen for each low level
| instruction, then it would have to be written into the
| software to take a vote at critical points. This is going
| to be greatly detrimental both to throughput and software
| complexity.
|
| I guess you could implement the quorum at the CPU level...
| e.g. have redundant cores each with their own memory. But
| unless there was a need to protect against CPU cores
| themselves being unreliable, I don't see this making sense
| either.
|
| At the end of the day, _at some level_ , it will always
| come down to probabilities. "Software engineering
| principles" will never eliminate that.
| slumdev wrote:
| I would highly recommend a graduate-level course in
| computer architecture for anyone who thinks ECC is a
| 1980s solution to a modern problem.
|
| There are a lot of seemingly high-level problems that are
| solved (ingeniously) in hardware with very simple, very
| low-level solutions.
| bollu wrote:
| Could you please link me to such a course that displays
| the hardware level solutions? I'm super interested!
| slumdev wrote:
| https://www.udacity.com/course/high-performance-computer-
| arc...
| andrewaylett wrote:
| https://en.wikipedia.org/wiki/NonStop_(server_computers)
|
| My first employer out of Uni had an option for their
| primary product to use a NonStop for storage -- I think
| HP funded development, and I'm not sure we ever sold any
| licenses for it.
| sobriquet9 wrote:
| If you use multiple computers doing the same calculation and
| then take the answer from the quorum, how do you ensure the
| computer that does the comparison is not affected by memory
| failures? Remember that _all_ queries have to through it, so
| it has to be comparable in scale and power.
| rovr138 wrote:
| > how do you ensure the computer that does the comparison
| is not affected by memory failures?
|
| You do the comparison on multiple nodes too. Get the
| calculations. Pass them to multiple nodes, validate again
| and if it all matches, you use it.
| sobriquet9 wrote:
| > validate again
|
| Recursion, see recursion.
| Guvante wrote:
| I mean raft and similar algorithms run multiple
| verification machines because a single point of failure
| is a single point of failure.
| wtallis wrote:
| See also Byzantine fault tolerance: https://scholar.harva
| rd.edu/files/mickens/files/thesaddestmo...
| hn3333 wrote:
| Bit flips can happen, but regardless if they can get repaired
| by ECC code or not, the OS is notified, iirc. It will signal
| a corruption to the process that is mapped to the faulty
| address. I suppose that if the memory contains code, the
| process is killed (if ECC correction failed).
| wtallis wrote:
| > I suppose that if the memory contains code, the process
| is killed (if ECC correction failed).
|
| Generally, it would make the most sense to kill the process
| if the corrupted page is _data_ , but if it's code, then
| maybe re-load that page from the executable file on non-
| volatile storage. (You might also be able to rescue some
| data pages from swap space this way.)
| gizmo686 wrote:
| If you go that route, you should be able to avoid the
| code/data distinction entirely; as data pages can also be
| completly backed by files. I believe the kernel already
| keeps track of what pages are a clean copy of data from
| the filesystem, so I would think it would be a simple
| matter of essentially pageing out the corrupted data.
|
| What would be interesting is if userspace could mark a
| region of memory as recomputable. If the kernel is
| notified of memory corruption there, it triggers a
| handler in the userspace process to rebuild the data.
| Granted, given the current state of hardware; I can't
| imagine that is anywhere near worth the effort to
| implement.
| AaronFriel wrote:
| It can't eliminate it but:
|
| 1. Single bitflip correction along with Google's metrics
| could help them identify algorithms they've got, customer's
| VMs that are causing bitflips via rowhammer and machines
| which have errors regardless of workload
|
| 2. Double bitflip detection lets Google decide if they say,
| want to panic at that point and take the machine out of
| service, and they can report on what software was running or
| why. Their SREs are world-class and may be able to deduce if
| this was a fluke (orders of magnitude less likely than a
| single bit flip), if a workload caused it, or if hardware
| caused it.
|
| The advantage the 3 major cloud providers have is scale. If a
| Fortune 500 were running their own datacenters, how likely
| would it be that they have the same level of visibility into
| their workloads, the quality of SREs to diagnose, and the
| sheer statistical power of scale?
|
| I sincerely hope Google is not simply silencing bitflip
| corrections and detections. That would be a profound waste.
| tjoff wrote:
| ECC seems like a trivial thing to log and keep track of.
| Surely any Fortune 500 could do it and would have enough
| scale to get meaningful data out of it?
| giantrobot wrote:
| I don't think ECC is going to give anyone a false sense of
| security. The issue at Google's scale is they had to spend
| thousands of person-hours implementing in software what they
| would have gotten for "free" with ECC RAM. Lacking ECC (and
| generally using consumer-level hardware) compounded scale and
| reliability problems or at least made them more expensive
| than they might otherwise had been.
|
| Using consumer hardware and making up reliability with
| redundancy and software was not a bad idea for early Google
| but it did end up with an unforeseen cost. Just a thousand
| machines in a cosmic ray proof bunker will end up with memory
| errors ECC will correct for free. It's just reducing the
| surface area of "potential problems".
| Animats wrote:
| _consumer hardware..._
|
| That's Intel's PR. Only "enterprise hardware", with a
| bigger markup, supports ECC memory. Adding ECC today should
| add only 12% to memory cost.
|
| AMD decided to break Intel's pricing model. Good for them.
| Now if we can get ECC at the retail level...
|
| The original IBM PC AT had parity in memory.
| ksec wrote:
| >I still remember Craig Silverstein being asked what his
| biggest mistake at Google was and him answering "Not pushing
| for ECC memory."
|
| Did they ( Google ) or He ( Craig Silverstein ) ever officially
| admit it on record? I did a Google search and results that came
| up were all on HN. Did they at least make a few PR pieces
| saying that they are using ECC memory now because I dont see
| any with searching. Admitting they made a mistake without
| officially saying it?
|
| I mean the whole world of Server or computer might not need ECC
| insanity was started entirely because of Google [1] [2] with
| news and articles published even in the early 00s [3]. And
| after that it has spread like wildfire and became a common
| accepted fact that even Google doesn't need ECC. Just like
| Apple were using custom ARM instruction to achieve their fast
| JS VM performance became a "fact". ( For the last time, no they
| didn't ). And proponents of ECC memory has been fighting this
| misinformation like mad for decades. To the point giving up and
| only rant about every now and then. [3]
|
| [1] https://blog.codinghorror.com/building-a-computer-the-
| google...
|
| [2] https://blog.codinghorror.com/to-ecc-or-not-to-ecc/
|
| [3] https://danluu.com/why-ecc/
| tyoma wrote:
| Figure this is as good of a time as any to ask this:
|
| There are many various DRAMs in a server (say, for disk cache).
| Has Google or anyone who operates at a similar scale seen
| single bit errors in these components?
| [deleted]
| gh02t wrote:
| The supercomputing community has looked at some of the effect
| on different parts of the GPU.
|
| https://ieeexplore.ieee.org/abstract/document/7056044
| bsder wrote:
| This is as old as computing and predates Google.
|
| When America Online was buying EV6 servers as fast as DEC
| could produce them, they used to see about about 1 _double_
| bit error per day across their server farm that would reboot
| the whole machine.
|
| DRAM has only gotten worse--not better.
| gigatexal wrote:
| I mean early on sure at a startup where you're not printing
| money I can see how saving on hardware makes sense. But surely
| you don't need an MBA to know that hardware will continue to
| get cheaper whereas developers and their time will only get
| more expensive: better to let the hardware deal with it than to
| burden developers with it ... I'd have made the case for ECC
| but hindsight being what it is ...
| colejohnson66 wrote:
| But if you can save $1M+ now, then throw the cost of fixing
| it onto the person who replaces you, why do you care? You
| already got your bonus and jumped ship.
| starfallg wrote:
| Recent advances have blurred the lines a bit. The ECC memory
| that we all know and love is mainly side-band EEC, with the
| memory bus widened to accommodate the ECC bits driven by the
| memory controller. However as process size shrink, bit flips
| become more likely to the point that now many types of memory
| have on-die EEC, where the error correction is handled
| internally on the DRAM modules themselves. This is present on
| some DDR4 and DDR5 modules, but information on this is kept
| internal by the DRAM makers and not usually public.
|
| https://semiengineering.com/what-designers-need-to-know-abou...
|
| There has been a lot of debate regarding this that was
| summarised in this post -
|
| https://blog.codinghorror.com/to-ecc-or-not-to-ecc/
| type0 wrote:
| Consumer awareness about ECC needs to be better, with recent
| security implications I simply can't understand why more
| motherboard manufacturers don't support it on AMD. Intel of
| course is all to blame on the blue side, I stopped buying their
| overpriced Xeons because of this.
| rajesh-s wrote:
| Good point on the need for awareness!
|
| The industry has convinced the average user of consumer
| hardware that PPA (Power,Performance,Area) is all that needs to
| get better with generational improvements. Hoping that the
| concerning aspects of security and reliability that have come
| to light in the recent past changes this.
| aborsy wrote:
| For the average user, what's the impact of bit flips in memory in
| practical terms?
|
| I am not talking about servers dealing with critical data.
|
| Suppose that I maintain a repository (documents, audio and
| video), one copy in a ZFS-ECC system and one in an ext4-nonECC
| system.
|
| Would I notice a difference between these two copies after 5-10
| years?
|
| That tells us if ECC matters for most people.
| throwaway9870 wrote:
| This isn't about disk storage, this is about DRAM. A bit flip
| in DRAM might corrupt data, but could also cause random crashes
| and system hangs. That generally matters to everyone.
| [deleted]
| theevilsharpie wrote:
| > For the average user, what's the impact of bit flips in
| memory in practical terms?
|
| The most likely impact (other than nothing, if bits are flipped
| in unused memory) is program crashes or system lock-ups for no
| apparent reason.
| elgfare wrote:
| For those out of the loop like me, ECC does indeed stand for
| error correcting code. https://en.m.wikipedia.org/wiki/ECC_memory
| vlovich123 wrote:
| A couple of years ago there was advancements that claimed to make
| Rowhammer work on ECC RAM even with DDR4 [1]. Is that no longer a
| concern for some reason?
|
| I would think the only guaranteed solutions to Rowhammer are
| actually cryptographic digests and/or guard pages.
|
| [1] https://www.zdnet.com/article/rowhammer-attacks-can-now-
| bypa...
| theevilsharpie wrote:
| ECC isn't a direct mitigation against Rowhammer attacks, as
| memory errors caused by three or more flipped bits would still
| go undetected (unless you're using ChipKill, but that's a rare
| setup).
|
| However, flipped three bits simultaneously isn't trivial, and
| the attempts that flip fewer bits will be detected and logged.
| GregarianChild wrote:
| Isn't ChipKill just another form of ECC? If so there is a
| number of bitflips that ChipKill can no longer correct /
| detect. [1] seems to say that they observed some flips in
| dRAM with ChipKill, although the paper is a bit vague here.
|
| [1] B. Schroeder et al, _DRAM Errors in the Wild: A Large-
| Scale Field Study_
| http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
| rajesh-s wrote:
| Right! Section 1.3 of this publication discusses possible
| mitigations for the row hammer problem and where ECC fits in
|
| https://users.ece.cmu.edu/~omutlu/pub/rowhammer-summary.pdf
| GregarianChild wrote:
| The paper you cite is from 2014 and the mitigations
| discussed there have all been circumvented. [1] is from
| 2020 and a better read for Rowhammer mitigation.
|
| [1] J. S. Kim et al, _Revisiting RowHammer: An Experimental
| Analysis of Modern DRAM Devices and Mitigation Techniques_
| https://arxiv.org/abs/2005.13121
| rajesh-s wrote:
| Thanks for pointing that out!
| simias wrote:
| I used to be pretty skeptical of ECC for consumer-grade hardware,
| mainly because I felt that I'd always prefer cheaper/more RAM
| over ECC RAM even if it meant that I'd get a couple of crash
| every year due to rogue bitflips. For servers it's a different
| story, but for a desktop I'm fine dealing with some instability
| for better performance.
|
| But these days with the RAM density being so high and bitflipping
| attacks being more than a theoretical threat it seems like
| there's really no good reason not to switch to ECC everywhere.
| ekianjo wrote:
| > no good reason not to switch to ECC everywhere.
|
| Not all CPUs support ECC however.
| josefx wrote:
| Just Intel fucking over security by making ECC a non feature
| on consumer grade hardware - wouldn't be surprised if it was
| just a single bit flipped in a feature mask.
| jjeaff wrote:
| Well, with as common as a bunch of people in this thread
| seem to think bit flips are, it should just be a matter of
| time until that bit gets flipped on your cpu and activates
| the ecc feature.
| josefx wrote:
| That bit probably is either burned in or stored with the
| firmware in something more permanent than RAM. Modern RAM
| has the issue that it is optimized for capacity and speed
| to a point where state changes can leak into nearby bits.
| loeg wrote:
| (Intel)
| tokamak-teapot wrote:
| Are there any Ryzen boards that support ECC and _actually
| correct errors_?
| gruez wrote:
| quick search:
|
| https://rog.asus.com/forum/showthread.php?112750-List-
| Asus-M...
| bcrl wrote:
| Most Ryzen ASRock boards support ECC as well. I'm happily
| using one right now.
| loeg wrote:
| > Most
|
| Circa Zen1 launch, ASRock claimed _all_ of their consumer
| boards would support ECC.
| [deleted]
| fulafel wrote:
| The functionality seems to all be in the memory controller
| integrated to the CPU.
| loeg wrote:
| Yes. E.g., all ASRock boards.
| freeqaz wrote:
| I bought ECC RAM for my laptop and it definitely was about 4x the
| price. It's valuable to me for a few reasons -- peace of mind
| being a big one.
|
| Bit flips happen and are real. I really wish ECC was plentiful
| and not brutally expensive!
| washadjeffmad wrote:
| For the price, it made more sense for me to buy an R630 and
| populate it with a few less expensive, higher capacity ECC
| RDIMMs. I don't really need ECC as a local feature, so this
| lets me run on the mobile I want.
| temac wrote:
| Note that the price is mostly due to market segmentation, in
| your case _most_ of it by the laptop vendor (of course some for
| Intel, but not _that_ much compared to the laptop vendor)
|
| Xeon with ECC are not that overpriced compared with similar
| Core without. Likewise, RAM sticks with ECC are cheap to
| produce (basically just one more chip to populate per side per
| module). Likewise soldered RAM would simply add maybe $10 or
| $20 of extra chips.
| bitcharmer wrote:
| This is the first time I hear about a laptop that supports ECC
| memory. Could you please share the make and model?
| bluedino wrote:
| Lenovo (P series) and HP workstation models also support ECC
| xxs wrote:
| Lenovo has Xeon laptops[0], and technically Intel used to
| support ECC on i3 (and celeron, etc.)
|
| 0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/T
| hi...
| lb1lf wrote:
| -My boss has a Xeon Dell - a 7550, methinks - luggable.
|
| It is filled to the gunwales with ECC RAM.
|
| Cost him the equivalent of $7k or so. Eeek.
| dijit wrote:
| I have a Dell Precision 5520 (chassis of an XPS 15) which has
| a Xeon and ECC memory.
|
| Finding a memory upgrade seems difficult though.
| markonen wrote:
| I was looking at getting the Xeon-based NUC recently and
| one of the reasons I decided against it was that ECC SO-
| DIMMs seem to be a really marginal product. If you want
| ECC, something that takes full-size DIMMs seems _much
| easier_ to buy memory for.
| jjeaff wrote:
| You should be able to check logs for corrected errors, right?
|
| I'm guessing you won't find any.
| londons_explore wrote:
| I simply care that my computer executes code perfectly. Let's
| settle on "one instance of unintended behaviour per hundred
| years" for that metric.
|
| If it needs ECC memory to do that, then fit it with ECC memory.
| If there are other ways to achieve that (for example deeper dram
| cells to be more robust to cosmic rays) that's fine too.
|
| Just meet the reliability spec - I don't care how.
| simias wrote:
| Then you'll have to pay a huge primer for that privilege. I can
| assure you that your standard computer components are not rated
| for century-scale use.
|
| That's why I've always been on the fence with this ECC thing.
| For servers it's vital because you need stability and security.
|
| For desktops I think that for a long time it was fine without
| ECC. If I have to chose between having, say, 30% more RAM or
| avoid a potential crash once a year, I'll probably take the
| additional RAM.
|
| The problem is that now these problem can be exploited by
| malicious code instead of just merely happening because of
| cosmic rays. That's the main argument in favour of ECC IMO, the
| rest is just a tradeoff to consider.
| ClumsyPilot wrote:
| But it isn't just a crash, it's also silent data corruption
| that will never be detected
| dev_tty01 wrote:
| This. How many user documents have memory flip errors
| introduced that are never detected? Impossible to say, but
| it is not a small number given the world-wide use of DRAM.
| Most are in trivial and unimportant documents, but some
| aren't...
| simias wrote:
| It can be a concern, that's true, but personally most of
| the stuff I edit end up checked into a git repository or
| something similar.
|
| And I mean, we all spend all day editing test messages and
| comments and files on non-ECC hardware, yet bitflip-induced
| corruption is rare enough that I can't say that I've
| witnessed a single instance of it in my life, despite
| spending a good chunk of it looking at screens.
|
| It's just not a problem that occurs in practice in my
| experience. If you're compiling the release build of a
| critical piece of software, you probably want ECC. If
| you're building the dev version of your webapp or writing
| an email to your boss, you'll probably survive without it.
| ClumsyPilot wrote:
| Can make that statement with any certainty? My personal
| and family computers have crashed quite a few times, and
| have corrupted photoes and files, some of them are
| valuable (taxes, healthcare, etc. Personal computers have
| valuable data these days)
|
| I couldn't tell, as a user, which if those corruptions
| and crashes were causes by bitflips. Could you?
| loup-vaillant wrote:
| > _I can assure you that your standard computer components
| are not rated for century-scale use._
|
| And that's probably not what GP asked for. There's a
| difference between guaranteeing an error rate of 1 error per
| century of use on average, and guaranteeing it over the
| course of an _actual century_. It might be okay to guarantee
| that error rate for only 5 years of uninterrupted use, and
| degrade after that. For instance: Years 1-
| 5: 1 error per century. Years 6-10: 3 errors per
| century. Years 10-15: 10 errors per century.
| Years 15-20: 20 errors per century. Years 20-30: 1
| error per *year*. Years 30+ : the chip is broken.
|
| Now, given how energy hungry and polluting the whole computer
| industry actually is, it might be a good idea to shoot for
| extreme durability and reliability anyway. Say, sustain 1
| error per century, over the course of _fifty years_. It will
| be slower and more expensive, but at least it won 't burn the
| planet as fast as our current electronics.
| temac wrote:
| In "theory" it needs ECC because you must also protect the link
| between the CPU and the RAM. So with ECC fully in DRAM but no
| protection on the bus, you risk some errors during the
| transfer. However maybe this kind of errors are rare enough so
| that you would have less than one per century. It probably
| depends on the motherboard design and fabrication quality
| though, and the environment where it is used.
| z3t4 wrote:
| Memory often comes with lifetime guarantees. If they had ECC it
| would be much easier to detect bad memory...
| jkuria wrote:
| For those, like me, wondering what ECC is, here's an explanation:
|
| https://www.tomshardware.com/reviews/ecc-memory-ram-glossary...
___________________________________________________________________
(page generated 2021-01-03 23:00 UTC)