[HN Gopher] What Flips Your Bit: Cosmic Ray Errors at Mozilla
___________________________________________________________________
What Flips Your Bit: Cosmic Ray Errors at Mozilla
Author : dannyobrien
Score : 146 points
Date : 2022-04-13 15:42 UTC (7 hours ago)
(HTM) web link (blog.mozilla.org)
(TXT) w3m dump (blog.mozilla.org)
| [deleted]
| spullara wrote:
| I bit squatted cloudfront.net years ago and got many, many
| requests. Most of them *.js which would, if I were malicious,
| have allowed me to do just about anything. It was interesting to
| see that the errors definitely happened in different places. For
| instance, sometimes the Host header was the original domain and
| sometimes it matched my domain.
| robotsteve2 wrote:
| Any sort of hardware or software error seems much more likely.
| Computers are incredibly complex and approximations are used
| everywhere (in the design of the hardware, in the theory of
| operation). I don't think inference-based experiments or analysis
| on cosmic ray bit flips are appropriate.
|
| You really need some kind of dedicated cosmic ray detector nearby
| as a control. If the flux of cosmic rays into the detector is
| orders of magnitude lower than the rate of bit errors you ascribe
| to cosmic rays, it's probably some hardware/software issue and
| not the cosmic rays.
| AshamedCaptain wrote:
| I believe people use "cosmic rays" as catch-all phrase for all
| these very low probability error causes (just because of the
| coolness of cosmic rays), but in practice _any_ other cause is
| much more common than cosmic rays.
|
| Even at the processor level every single transistor on it has a
| rated mean time between failures a.k.a. MTBF. Sure it may be
| astronomical, but you do have a lot of transistors, so in
| practice a random bitflip is not such a rare event. Designers
| actually explore MTBF vs power usage trade-offs here, and there
| is even a fascinating area of "fault resilient computing"
| research.
|
| Every single clock domain crossing has another MTBF (google
| metastability). Again they are very high (billions of years if
| done properly), but you will have plenty of such crossings (and
| the number keeps growing with modern, more asynchronous
| design).
|
| Processors are quite unreliable things.
| gnufx wrote:
| Yes, but what you'd want to do is look for coincidences between
| a detector for a cosmic ray shower around (above?) the
| electronics you're monitoring with whatever it is these days
| that instruments ECC events. The time resolution would be
| pathetic for a nuclear physics experiment, but probably good
| enough.
|
| If you look at the ambient gamma-ray spectrum in a
| semiconductor detector (which would be germanium rather than
| silicon) the main background you see is typically from
| concrete; I'm ashamed to say I've forgotten the energy from
| K-40, but in the region of 1500 keV. (Ironically, large
| concrete blocks used for shielding would be regarded as a
| significant radiation hazard if all the activity in them was
| concentrated.)
| jldugger wrote:
| Indeed, there was a study in IEEE pointing out the absurdity of
| cosmic rays as causes -- one point cited was that the vast
| majority of bitflip happen at specific points in the address
| space, page boundaries between chips essentially
| bqmjjx0kac wrote:
| I'm curious why that is evidence against the cosmic ray
| explanation.
|
| Couldn't it have something to do with the physical layout of
| memory? Perhaps those page-boundary-adjacent addresses
| present a larger physical target, perhaps on the bus.
|
| Of course I am wildly speculating right now. I'd love to see
| the article if you have a link!
| 323 wrote:
| Modern devices have tiny features which are extremely
| fragile to any sort of interference, which are much more
| abundant than cosmic rays.
|
| See the row-hammer attack where you can flip an unrelated
| bit just by read/writes to adjacent bits from software!!!
| cozzyd wrote:
| I'd be very interested in reading that article if you have a
| link (or title, or doi...)
| Avlin67 wrote:
| What about overclocking ? does it cause bit flip ? especially low
| grad DDR4 pushed to its limit...
| anonymousiam wrote:
| This is why I always buy ECC/EDAC capable servers. SEUs are a
| real thing.
| ThePhysicist wrote:
| One of the first things you'll learn when studying experimental
| physics is how to come up with all kinds of alternative
| mechanisms that might explain the result you've observed in your
| experiment, and then think of ways to test that the results
| weren't actually caused by those unwanted mechanisms. Most Nobel-
| prize winning physics experiments were carefully designed to
| compensate for any relevant secondary effects, and I would even
| go as far as saying that this is often the largest challenge when
| doing high-precision experiments.
|
| So the first question I'd ask myself when thinking about cosmic-
| ray induced errors is how I would ensure that the bit flips are
| not caused by e.g. problems on the hard drives or the NAND array
| (which are probably much more likely to occur than cosmic ray
| events, at least on the surface of the earth).
| mherdeg wrote:
| Yeah that was one of the gotchas in the story at
| https://blogs.oracle.com/linux/post/attack-of-the-cosmic-ray...
| -- MAYBE it was a bit flip due to a cosmic ray, or MAYBE it was
| a bit flip due to another layer of the system that makes RAM
| chips store and retrieve data.
|
| I like the idea of a physicist who thinks about this and says -
| "well, why should we shrug and say 'maybe it was a cosmic ray?'
| Surely we can test this! Let's put the computer in a lead-lined
| enclosure and benchmark the memory failure rate and see if it
| changes", or whatever.
|
| That's a great extension of the classic computer-hacker view
| that "of course we can understand why this bug happened, we
| don't have to shrug and say it segfaults until we restart
| sometimes, we can just dig some more." How far can you go?
| grog454 wrote:
| On the subject of bit flips, I am able to detect these in the
| client to server UDP packets in my game. With specific logging
| enabled I would see an error about once per minute while
| receiving about 15,000 of one type of packet per second. I was
| able to estimate about 1/1,000,000 packets contained a single
| flipped bit.
| dextercd wrote:
| The '1 error for every 256MB memory a month' sounds like way tko
| much to me.
|
| A program I wrote launches every time I start my computer. It
| allocates some memory and scans it periodically for unexpected
| changes. After an equivalent of 15.8 256MB/months no anomalies
| have been found yet.
|
| Would really like to see more authoritative figures for modern
| consumer hardware.
| axg11 wrote:
| This is fascinating and hints at a future possible scientific
| study: using phones across the globe to map cosmic ray events.
| I'm not a physicist so I can't speak for the value of such data.
| If cosmic ray events do not occur uniformly across the globe then
| mapping events from 100,000s of phones could give interesting
| insights.
| antognini wrote:
| As they say, there's an app for that:
|
| https://cosmicrayapp.com/
|
| Basically it monitors your camera for the streaks produced by
| cosmic rays. You can see the real time stream of events here:
|
| https://cosmicrayobserver.com/#0.4/0/0
| arc-in-space wrote:
| Wait, this works? That's amazing. I suppose you could do the
| same with any camera sensor?
| li2uR3ce wrote:
| Looks like the rays only hit populated areas.
| https://xkcd.com/1138/
| seanw444 wrote:
| A fascinating revelation indeed.
| axg11 wrote:
| This is amazing - thank you for sharing!
| trollied wrote:
| I know HN has a decent Factorio fanbase. Factorio properly
| stresses PC hardware, and borderline memory is usually ok for a
| casual gamer until you start a Factorio megabase. A decent
| example is Warger who does speedruns:
| https://forums.factorio.com/viewtopic.php?f=7&t=100646
| https://www.speedrun.com/factorio#100 Those that have played the
| game - speedruns are amazing to watch, if you haven't already.
| geophile wrote:
| As the article points out, using collected client data is
| problematic, because some errors will often be undetectable, as
| in numeric data. And in general, you would have to control for
| bit flips somehow caused by software.
|
| I wonder whether a SETI approach would be useful here. Allocate,
| say, 1MB of memory. Fill it with some known bit pattern.
| Periodically check the memory and look for discrepancies. Do this
| once an hour, on 10M devices, and that is a LOT of monitoring.
| Report discrepancies along with time, location (including
| elevation), hardware and OS information.
|
| I would think that this approach would provide a lot of
| interesting information about when and where bit flips occur,
| especially when matched against information on solar and
| atmospheric events (as in the article). Perhaps sensitive
| hardware and OS environments would be detected. Even completely
| negative results would be interesting: no bit flips observed
| would suggest that purported bit flips elsewhere might have other
| explanations.
| simne wrote:
| I've few years ago hear, US gov't created app, which
| periodically checked for random light flashes on camera sensor
| of smartphone, and send info to cloud, to use them as large
| distributed network for detecting of illegal nukes. (btw,
| Soviet and Russians really love to collect such weird facts).
|
| Idea looks very realistic, except that it will drain battery,
| and could also be used as surveillance tool, so I have not seen
| real app.
|
| To be strict, any digital camera sensor is excellent tool for
| such things, much better than ram, only need to close objective
| with something opaque for light, but transparent to particles
| (any thin plastic fit), and sure, run monitoring program and
| store logs in cloud.
|
| And in real life, on ali could buy Geiger counter shield for
| arduino, and one my pal even expose such sensor to internet.
|
| - Better to use two such sensors and simple logical circuit, so
| they will detect not all events, but only when two sensors
| simultaneous detect something, so you will see vector, from
| where cosmic particle appear.
|
| - This method of many sensors with logic and logs, where used
| in researches, which detected hidden rooms in Great Pyramids of
| Giza (real cosmic rays so powerful, that could detected with
| simple equipment at depths up to hundred meters, so ordinary
| concrete is like cardboard for them). And I even seen post from
| one guy, who installed such machinery at home (used 8 or 16
| counters, I forgot details), and in few months from logs was
| clearly seen, where in his room is window :)
| EscargotCult wrote:
| This could be crowdsourced. Imagine some sort of reporting
| network of volunteers, running the simple program (memory
| allocation and periodic checks) on any hardware they're loaning
| time on, and submitting their location and altitude as well.
| tconfrey wrote:
| I worked at Sonus Networks (now Ribbon[0]) in the early 2000's
| building VoIP solutions for telcos. We had a bunch of unexplained
| errors in a new installation in Denver. After much head
| scratching the engineers on the problem concluded that the higher
| altitude significantly increased the likelihood of impact by
| alpha particles and that that was the cause of the problem!
|
| (IIRC we increased the shielding on the devices.)
|
| https://ribboncommunications.com/
| perihelions wrote:
| Minor clarification: alpha radiation doesn't come from cosmic
| rays, rather from U/Th contamination in the circuit materials.
| The altitude dependent component you'd be seeing is rather
| muons and neutrons.
| cozzyd wrote:
| you can get alphas from spallation, but you wouldn't really
| call that alpha radiation.
| jaytaylor wrote:
| Maybe it's a stupid question, but did the additional shielding
| completely resolve the discrepancy?
| li2uR3ce wrote:
| > In almost every case we cannot find any plausible explanation
| or bug
|
| Observe the natural state of every software developer. I kid...
| or do I?
|
| > What if it wasn't just some fantastical explanation?
|
| Doesn't sound nearly as fantastical but bad RAM is probably more
| common than one would expect. You seldom really know the quality
| of hardware you run on. Just say'n, sometimes you don't need a
| helping cosmic ray.
| IncRnd wrote:
| "Bitsquatting is a form of cybersquatting which relies on bit-
| flip errors that occur during the process of making a DNS
| request. These bit-flips may occur due to factors such as faulty
| hardware or cosmic rays. When such an error occurs, the user
| requesting the domain may be directed to a website registered
| under a domain name similar to a legitimate domain, except with
| one bit flipped in their respective binary representations.
|
| "A 2011 Black Hat paper detailed an analysis where eight
| legitimate domains were targeted with thirty one bitsquat
| domains. Over the course of one day, 3,434 requests were made to
| bitsquat domains." [1]
|
| Cisco presented a paper on bitsquatting at defcon, "Examining the
| Bitsquatting Attack Surface". From the paper, "The conclusion is
| that the possibility of bitsquat attacks is more widespread than
| originally thought, but several techniques exist for mitigating
| the effects of these new attacks." [2]
|
| [1] https://en.wikipedia.org/wiki/Bitsquatting
|
| [2]
| https://media.defcon.org/DEF%20CON%2021/DEF%20CON%2021%20pre...
| chadwittman wrote:
| Amazing comment, this is wild. Thank you for sharing this!
| anonymousiam wrote:
| I had the pleasure of working with the author for a brief time,
| and I attended his presentation. Great stuff. What I found
| particularly interesting is some later work that characterized
| the probability of error based upon the device type, and the
| ambient temperature (based on IP Geo-location).
| pitaj wrote:
| Is one mitigation TLS certificate verification?
| legalcorrection wrote:
| Depends on where in the stack the error happened.
| simulate-me wrote:
| It depends on whether or not the client specifically
| requested, or is expecting, traffic over HTTPS. But yes, if
| the user's client requests encrypted traffic, the attacker
| will not be able to produce a valid certificate. This attack
| isn't that different than a MITM.
| tedunangst wrote:
| Nothing prevents an attacker from getting a cert for
| snytimg.com or oslashdot.org.
| legalcorrection wrote:
| That only helps the attacker if the error happened before
| reaching the DNS-specific path. If the error happens
| inside the DNS path, then the browser is still expecting
| to get a certificate for the correct website.
| xenophonf wrote:
| rat9988 wrote:
| > Why not read the linked paper,
|
| You answered your own question. The answer is in page 12,
| which means there is too much information. He is not
| interested in the whole topic, just about this question. So
| he asks, maybe someone is charitable enough to answer.
| Nothing wrong with it.
| xenophonf wrote:
| How could I possibly answer their question better than
| the experts who wrote the paper?
| rat9988 wrote:
| Maybe you can't but someone else can. The question is
| open to anyone who can and wants to answer.
| nixpulvis wrote:
| Skim until you are in the right section?
|
| Literally titled: "Section II - Mitigation of
| bitsquatting attacks"
| tedunangst wrote:
| dahfizz wrote:
| How does a comic bit flip make it past the Ethernet CRC?
| incomingpain wrote:
| I had the opportunity to design my SOC from scratch. Mostly
| ripping off Berkeley's public design.
|
| Something I have documented in the last 2 years. Solar flare
| activity is what causes problems. All memory is ECC but it still
| happens.
|
| Faraday cage incoming?
|
| Wait? Faraday cage racks million $ idea?
| jeffreygoesto wrote:
| Using an FD-SOI process can help reducing soft errors.
| legalcorrection wrote:
| I suspect without great evidence that cosmic ray bitflips are
| mostly a scapegoat for imperfect hardware and are in fact one or
| two orders of magnitude less common than popular wisdom would
| suggest.
| zepearl wrote:
| I don't know folks.
|
| 2 years ago I took a laptop which I wasn't using (16 GiB RAM non-
| ECC) => I created in Linux with Python an array ("bytes"? Don't
| remember exactly anymore) of ~10 or 12 GiB containing random
| integers => computed the array's hash and saved it.
|
| Then for ~1-2 months I recomputed from time to time the hash of
| that array (inbetween the laptop was in suspend-to-RAM) and
| compared it to the original result => it always matched, I never
| had any bitflips.
|
| I therefore doubt that the estimation of "1/256MB/month" is
| correct - I could not prove that, at least not with my laptop.
| deckard1 wrote:
| I've always been a bit skeptical of published numbers. I
| usually just chalk it up to vastly different operating
| conditions and scale.
|
| On my home server w/ ECC you can check the corrected and
| uncorrected (multibit) errors. Assuming my Ryzen is correctly
| reporting them to Linux, I have 0 errors corrected and 0
| uncorrected with a 80 day uptime. I've checked a few other
| times and never seen an error. Others with ECC often report the
| same.
|
| My understanding of modern RAM is that it has checks built in
| to the modules which are somewhat equivalent to ECC already
| (the correcting part, not the reporting part). Which is a
| necessity in order to hit the density we are at today.
| cozzyd wrote:
| A server with 64 GB of ECC ram sitting at an altitude of 3.2 km
| on the Greenland ice sheet is reporting... 0 bit errors
| (whether correctable or uncorrectable) in the 244 days it's
| been up.
|
| A server with 16 GB of ECC ram at an altitude of 3.8 km in
| California is reporting.... 0 bit errors in the 146 days it's
| been up.
|
| Maybe I shouldn't believe what /sys/devices/system/edac/mc is
| reporting? These are EL8 systems...
| tclancy wrote:
| >I therefore doubt that the estimation of "1/256MB/month" is
| correct
|
| As someone who did incredibly poorly in high school physics,
| this line in the article bothered me as well: the study is from
| the 1990s when the density of memory would have been much
| lower. I would think the percentage per megabyte has dropped
| significantly in 30 or so years. It also assumes a constant
| form factor for the memory, doesn't it?
| nomel wrote:
| > I therefore doubt that the estimation of "1/256MB/month" is
| correct
|
| The probability is related to the physical volume the memory
| takes, since it's caused by a physical particle going through
| that volume. So, this rate will continuously drop as memory
| density increases.
___________________________________________________________________
(page generated 2022-04-13 23:00 UTC)