[HN Gopher] Single random bit flip causes error in certificate t...
       ___________________________________________________________________
        
       Single random bit flip causes error in certificate transparency log
        
       Author : shmsr
       Score  : 355 points
       Date   : 2021-07-04 09:13 UTC (13 hours ago)
        
 (HTM) web link (groups.google.com)
 (TXT) w3m dump (groups.google.com)
        
       | agwa wrote:
       | OP here. Unless you work for a certificate authority or a web
       | browser, this event will have zero impact on you. While this
       | particular CT log has failed, there are many other CT logs, and
       | certificates are required to be logged to 2-3 different logs
       | (depending on certificate lifetime) so that if a log fails web
       | browsers can rely on one of the other logs to ensure the
       | certificate is publicly logged.
       | 
       | This is the 8th log to fail (although the first caused by a bit
       | flip), and log failure has never caused a user-facing certificate
       | error. The overall CT ecosystem has proven very resilient, even
       | if a bit flip can take out an individual log.
       | 
       | (P.S. No one knows if it was really a cosmic ray or not. But it's
       | almost certainly a random hardware error rather than a software
       | bug, and cosmic ray is just the informal term people like to use
       | for unexplained hardware bit flips.)
        
         | sillysaurusx wrote:
         | Is it possible to give an example other than "cosmic ray"? I
         | know it's an informal short hand, but it also raises the
         | question of what's actually causing these flips.
         | 
         | Is it just random stray electrons that happen to stray too far
         | while traveling through the RAM? Very interesting to me.
        
       | blumomo wrote:
       | could also be an excuse to cover the real circumstances
        
         | pdpi wrote:
         | It could, yes. I'm sure the thought also crossed the mind of
         | everybody else in that thread, and they've clearly dismissed
         | it.
         | 
         | Bare conspiratorial assertions don't advance the conversation
         | in any meaningful way. If you have something constructive to
         | add to the discussion, by all means do so. Otherwise, please
         | refrain from this sort of comment
        
           | tialaramex wrote:
           | Humans can examine the underlying data and reason that indeed
           | one bitflip could cause the observed result.
           | 
           | In this particular case the alternatives are:
           | 
           | 1. The bitflip happened. Cosmic rays. Defective CPU. Maybe a
           | tremendously rare data tearing race somewhere in software.
           | 
           | OR
           | 
           | 2a. Somebody has a fast way to generate near-Collisions in
           | SHA256, which are 1-bit errors from the target hash. Rather
           | than publish this breakthrough they:
           | 
           | 2b. Had a public CA issue and sign two certificates, one
           | uninteresting real certificate for *.molabtausnipnimo.ml
           | which we've now seen and another "evil" near-colliding
           | certificate with the one-bit-different hash then they:
           | 
           | 2c. Logged the real certificate but at only _one_ log (Yeti)
           | they get the log to show the real certificate yet actually
           | use the hash from the  "evil" certificate, thus perhaps
           | obtaining a single valid SCT for the "evil" certificate.
           | 
           | Scenario 1 isn't terribly likely for any one certificate, but
           | given we're logging vast numbers every day and errors
           | inherently cascade in this system, it was going to happen
           | sooner or later.
           | 
           | Scenario 2 fails for having too many conspirators (a public
           | CA, and likely a different public log) and yet it achieves
           | almost nothing even if it remains secret. Your Chrome or
           | Safari expects _at least_ two SCTs (one of them from Google,
           | in this case Google 's Argon 2022 log), one doesn't get the
           | job done. And it was never going to remain secret, this
           | occurrence was spotted after less than a day, and thoroughly
           | investigated, resulting in the log being shut down, after
           | just a few days.
           | 
           | [Edited multiple times to fix awful HN formatting.]
        
       | durovo wrote:
       | Or was it a targetted data manipulation attack by the aliens?
        
       | bloopernova wrote:
       | Layman question: Is it possible to shield electronics against
       | cosmic rays?
       | 
       | Or is it like a neutrino thing, where they just go through
       | everything? (But neutrinos barely interact, correct?)
        
         | obedm wrote:
         | It should be possible, but not practical.
         | 
         | It's easier to just code checks and balances to fix those
         | errors instead. There's a famous story about a Google search
         | bug caused by a comic ray and they implemented such error
         | checking on their servers after that
        
         | dricornelius wrote:
         | The cosmic ray induced radiation at sea level is a mix of
         | neutrons, gamma rays, electrons, and highly penetrating high
         | energy muons (heavier cousins of electrons). They are caused by
         | the cascade of reactions when the CRs collide with nuclei in
         | the atmosphere. You're right, neutrinos barely interact (hence
         | the name "little neutral ones") so they're no problem. You
         | could put infrastructure underground to escape muons, but that
         | wouldn't be practical. Moreover, any shielding (and the
         | materials in the electronics) needs to be carefully chosen to
         | minimise naturally occurring radioactive materials that could
         | lead to upsets themselves. It's a tricky one. There are other
         | ways to mitigate the risk, error checking, redundant systems,
         | etc.
        
           | dricornelius wrote:
           | My favourite story is this one https://www.thegamer.com/how-
           | ionizing-particle-outer-space-h...
        
             | [deleted]
        
             | dricornelius wrote:
             | More serious examples here https://www.lanl.gov/science/NSS
             | /issue1_2012/story4full.shtm...
        
         | noduerme wrote:
         | That's not what actually happened here (probably), but think of
         | relatively heavy atoms like iron, ejected from a supernova at
         | close to the speed of light. They're not really "rays". They're
         | solid particles that punch holes in everything. The good news
         | is, mostly they're ionized and they get repelled away from
         | hitting the planet's surface by our magnetic field (generated
         | by the big hot iron magnet that's churning under our crust).
         | But in the rest of space where there's no field like that,
         | getting lots of tiny puncture wounds from high-velocity iron
         | particles is pretty normal. It does a lot of damage to cells
         | and DNA in a cumulative way pretty quickly.
         | 
         | To answer your question, shielding is possible but it is much
         | harder in space than it is under the magnetosphere. Even so, a
         | stray particle can wreak havoc. Shielding for the purposes of
         | data protection on earth is essentially a risk/reward analysis.
         | Anyone _could_ get hit by a cosmic bullet, but the chances are
         | pretty low. The odds of an SSD flipping bits on its own are
         | significantly higher.
        
           | Leparamour wrote:
           | > That's not what actually happened here (probably), but
           | think of relatively heavy atoms like iron, ejected from a
           | supernova at close to the speed of light. They're not really
           | "rays".
           | 
           | Wow, I have to admit I always assumed cosmic rays to be gamma
           | radiation but alpha particle radiation sounds a lot more
           | scary.
           | 
           | Does anyone happen to know if computers engineered for the
           | space station or shuttles have already some built-in
           | (memory)redundancy or magnetic shielding to account for
           | damage by cosmic rays? I imagine a system crash of the life
           | support systems in space would be devastating.
        
             | ben_w wrote:
             | Radiation hardening is basically everything you can manage
             | with the weight limit:
             | https://en.wikipedia.org/wiki/Radiation_hardening
             | 
             | IIRC nobody is currently using magnetic fields for
             | shielding, I don't know if that's due to insufficient
             | effectiveness, power consumption, or unwanted interactions
             | e.g. with Earth's magnetosphere.
        
               | myself248 wrote:
               | I've always wondered about that. Seems to me that if you
               | can shield one side from the sun's heat and expose the
               | other side to the cold of space, an MRI magnet's
               | superconductors should be quite happy with the
               | temperature. A big ol' magnet would be a pain to charge
               | up once, and then provide long-term shielding.
        
               | verall wrote:
               | MRIs are big electromagnets and they use big power and
               | produce big heat. In space you have no convection so you
               | must radiate away all of your heat which is challenging.
               | Maybe you can make something better with superconductors
               | that doesn't use much power, but I don't think it exists
               | yet.
        
               | ben_w wrote:
               | Superconductors famously don't produce heat. Putting a
               | superconducting magnet in shade (though that includes
               | from the Earth not just from Sol) will keep it cool once
               | it gets cool.
               | 
               | This is what the James Webb is going to be doing, though
               | for non-superconducting reasons.
        
               | [deleted]
        
               | [deleted]
        
             | adrian_b wrote:
             | The vast majority of the cosmic radiation consists of
             | protons, i.e. hydrogen ions.
             | 
             | Light ions, e.g. of helium, lithium or boron are also
             | relatively abundant and heavier ions are less abundant.
             | There are also electrons, but in smaller quantities,
             | because they can be captured by ions.
             | 
             | The high speed protons collide with the atoms of the Earth
             | atmosphere and the collisions generate a huge variety of
             | particles, but most of them have a very short lifetime, so
             | they decay before reaching ground level.
             | 
             | At ground level, the cosmic radiation consists mostly of
             | muons, because they have a longer lifetime, so they survive
             | the travel from the upper atmosphere where they are
             | generated, until ground level.
             | 
             | Only extremely few of the particles from the primary cosmic
             | radiation, i.e. protons, reach ground, because most are
             | deflected by the magnetic field of the Earth and the
             | remaining protons lose their energy in the collisions with
             | atmosphere, by generating particles, which eventually decay
             | into muons.
        
           | joshspankit wrote:
           | I had thought that Starlink would become extremely compelling
           | when the servers were in orbit as well, but maybe that's
           | naive. Cubesats with massive arrays of active storage might
           | be far too difficult (aka costly) to protect properly.
        
             | noduerme wrote:
             | Putting servers in orbit is a really, really bad idea.
             | Firstly, it would be wildly inefficient, just due to the
             | unavoidable delay both ways. You expect delay over a long
             | distance network, but you want the server to be positioned
             | and cabled up to minimize latency. Just locating a
             | satellite requires quite a bit of overhead, so treating
             | them as servers rather than clients would create a huge
             | amount of latency. As a failover for ground-based servers
             | in a nuclear war scenario, it could make sense, but they're
             | also literally flying over the airspace of hostile nations
             | who, in a war, have the capability to take them out. Mixing
             | any kind of civilian control onto a server embedded in a
             | flying artifact over an enemy nation is, obviously, a
             | recipe for disaster in a hot war. Take all of that and the
             | fact that the bandwidth is stone-age and spotty depending
             | on its position, and there's no satellite you'd want to use
             | to set up a website selling alpaca sweaters. (Watch: A few
             | years from now we'll find out that the US Space Command
             | runs some second strike end-of-world command control server
             | off a satellite; that still only proves it's too late to
             | contact the fucker when you need it).
        
               | joshspankit wrote:
               | > Firstly, it would be wildly inefficient, just due to
               | the unavoidable delay both ways
               | 
               | The point I was thinking about is a scenario where
               | Starlink satellites communicate with each other via laser
               | (already starting to happen), and then communicate with
               | the end user via the satellite over them. Because we're
               | talking about speed-of-light transmission between sats,
               | data in the Starlink network can theoretically cover
               | "ground" (aka miles/km) faster than ground-based ISPs.
               | 
               |  _Then_ it makes sense to deploy servers _within_ that
               | network so that two Starlink users can have extremely low
               | latency from user to server to other user (the slowest
               | part being the earth to sat latencies, one for each
               | user).
        
             | MayeulC wrote:
             | One of the issues with putting servers in orbit is cooling;
             | you can't just use fans in space. On the other hand, real
             | estate is pretty cheap. Servicing is hard, though, and
             | micrometeorites are another risk. Plus launch costs being
             | high, and radiation an issue, I don't see it happening any
             | time soon outside of very specific areas.
        
               | willis936 wrote:
               | Radiation cooling is pretty dramatic in space. With an
               | isolated radiation shield/mirror from the sun and a
               | decent sized heatsink, you can have a lot of cooling
               | power on hand. A server isn't going to be pumping out
               | nearly as much heat as a rocket engine, and launch
               | vehicles don't tend to overheat.
        
               | fennecfoxen wrote:
               | Forget cooling for a minute. Cooling dissipates energy
               | but you need to collect the energy first and worry about
               | powering the server. It's going to be a solar-plus-
               | battery system, which is heavy and expensive, with
               | substantial launch costs.
        
             | fennecfoxen wrote:
             | You can't just put an SSD or whatever in orbit and expect
             | reasonable read latencies at all times. It's in orbit. It
             | moves. Half the time it's on the wrong side of the planet.
        
         | dredmorbius wrote:
         | The problem isn't cosmic rays, per se, it's the issues created
         | by any single-instance modification of a record.
         | 
         | The general solution here is to store multiple redundant copies
         | of data _and_ checksums of those data, in order that lack-of-
         | consistency can be determined _and_ the specific records which
         | are considered suspect can be clearly identified.
         | 
         | In general, a record is a persistence of a pattern through time
         | subject to entropic forces.[1] Given that modifications to
         | records are typically random events, keeping (and continuously
         | integrity-checking) multiple copies of those records is the
         | best general defence, _not_ guarding against one specific
         | variety of modification in a single instance of a record.
         | 
         | ________________________________
         | 
         | Notes:
         | 
         | 1. The symmetric statement of _signals_ is that they are
         | persistence of a pattern through _space_ subject to _noise_.
         | See: https://news.ycombinator.com/item?id=27604073
        
         | Tomte wrote:
         | No, because bit flips from radioactive decay are not typically
         | cosmic rays, but originating in the chip's packaging.
        
         | nicoburns wrote:
         | Yes, the easiest way is to place the computer underground (or
         | underwater).
        
         | numpad0 wrote:
         | Not trivially; if it was viable the Chernobyl reactor would be
         | gone already. Also, IC packagings and chassis materials can be
         | sources of stray electron beams, and other modes of glitching
         | like power failure exists.
        
         | frumiousirc wrote:
         | Yes, by placing them underground.
        
       | nixgeek wrote:
       | It could be a CPU issue, and ECC doesn't save you from all of
       | these:
       | 
       | https://engineering.fb.com/2021/02/23/data-infrastructure/si...
       | 
       | https://arxiv.org/pdf/2102.11245
        
       | tester756 wrote:
       | offtop:
       | 
       | What's the point of using google groups?
        
         | dlgeek wrote:
         | The CT program was spearheaded by Google and the discussion in
         | question is on a list owned by the Chromium project.
        
         | throwaways885 wrote:
         | Mailing lists are great. Better than a slack for long-term
         | discoverability.
        
           | tester756 wrote:
           | How about GitHub issues?
        
             | throwaways885 wrote:
             | It's not the same thing. GH Issues firstly requires a
             | GitHub account, whereas anyone can sign up to a mailing
             | list. The model is just worse for general discussion too.
        
       | jart wrote:
       | The push for crypto without ecc ram is a nonstop horror show.
       | Software under normal circumstances is remarkably resilient to
       | having its memory corrupted. However crypto algorithms are
       | designed so that a single bit flip effectively changes all the
       | bits in a block. If you chain blocks then a single bit flip in
       | one block destroys all the blocks. I've seen companies like msps
       | go out of business because they were doing crypto with consumer
       | hardware. No one thinks it'll happen to them and once it happens
       | they're usually too dim to even know what happened. Bit flips
       | aren't an act of god you simply need a better computer.
        
         | ericbarrett wrote:
         | In my younger days I worked in customer support for a storage
         | company. A customer had one of our products crash (kernel
         | panic) at a sensitive time. We had a couple of crack engineers
         | I used to hover around to try to pick up gdb tricks so I
         | followed this case with interest--it was a new and unexpected
         | panic.
         | 
         | Turns out the crash was caused by the processor _taking the
         | wrong branch_. Something like this (it wasn't Intel but you get
         | the picture):                 test edx, edx       jnz
         | $something_that_expects_nonzero_edx
         | 
         | Well, edx was 0, but the CPU jumped anyway.
         | 
         | So yeah, sometimes ECC isn't enough. If you're really paranoid
         | (crypto, milsec) you should execute multiple times and sanity-
         | test.
        
         | EthanHeilman wrote:
         | >If you chain blocks then a single bit flip in one block
         | destroys all the blocks. I've seen companies like msps go out
         | of business because they were doing crypto with consumer
         | hardware.
         | 
         | If it is caused by a single bitflip you know the block in which
         | that bitflip occurred and can try each bit until you find the
         | right bit. This is an embarrassing parallel problem. Let's say
         | you need to search 1 GB space for a single bit flip. That only
         | requires that you test 8 billion bit flips. Given the merklized
         | nature of most crypto, you will probably be searching a space
         | far smaller than 1 GB.
         | 
         | >Bit flips aren't an act of god you simply need a better
         | computer.
         | 
         | Rather than using hardware ECC, you could implement ECC in
         | software. I think hardware ECC is good idea, but you aren't
         | screwed if you don't use it.
         | 
         | The big threat here is not the occasional random bit flips, but
         | adversary caused targeted bit flips since adversaries can bit
         | flip software state that won't cause detectable failures but
         | will cause hard to detect security failures.
        
         | morelikeborelax wrote:
         | I used to sell workstations and servers years ago and trying to
         | convince people they needed ECC Ram and that it was just
         | insurance for (often) just the price of the extra chips on the
         | DIMMs was a nightmare.
         | 
         | The amount of uninformed and inexperienced counter arguments
         | online suggesting it was purely Intel seeking extra money (even
         | though they didn't sell RAM) was ridiculous.
         | 
         | I never understood why there was so much push back from the
         | consumer world commenting on something they had no idea about.
         | Similar arguments for why would you ever need xxGB of RAM while
         | also condemning the incorrect 640kb RAM Bill Gates comment.
        
         | ajross wrote:
         | > The push for crypto without ecc ram is a nonstop horror show
         | 
         | That's a bit hyperbolic.
         | 
         | First, ECC doesn't protect the full data chain, you can have a
         | bitflip in a hardware flip flop (or latch open a gate that
         | drains a line, etc...) before the value reaches the memory.
         | Logic is known to glitch too.
         | 
         | Second: ECC is mostly designed to protect long term _storage_
         | in DRAM. Recognize that a cert like this is a very short-term
         | value, it 's computed and then transmitted. The failure
         | happened fast, before copies of the correct value were made.
         | That again argues to a failure location other than a DRAM cell.
         | 
         | But mostly... this isn't the end of the world. This is a failed
         | cert, which is a failure that can be reasonably easily handled
         | by manual intervention. There have been many other mistaken
         | certs distributed that had to be dealt with manually: they can
         | be the result of software bugs, they can be generated with the
         | wrong keys, the dates can be set wrong, they can be maliciously
         | issued, etc... The system is designed to include manual
         | validation in the loop and it works.
         | 
         | So is ECC a good idea? Of course. Does it magically fix
         | problems like this? No. Is this really a "nonstop horror show"?
         | Not really, we're doing OK.
        
           | noxer wrote:
           | If a system is critical it should run on multiple machines in
           | multiple locations and "sync with checks" kinda like the oh
           | so hated and totally useless blockchains.
           | 
           | Then if such a bit-flip would occur it would never occur on
           | all machines at the same time in the same data. And on top of
           | that you could easy make the system fix itself if something
           | like that happens (simply assume the majority of nodes didn't
           | have the bit-flip) or in worst case scenario it could at
           | least stop rather than making "wrong progress"
           | 
           | I have no clue what this particular log need for "throughput
           | specs" but I assume it would be easily achievable with
           | current DLT.
        
             | dundarious wrote:
             | I haven't thought about whether an actual blockchain is
             | really the best solution, but the redundancy argument is
             | legitimate. We've been doing it for decades in other
             | systems where an unnoticed bit flip results in complete
             | mission failure, such as an Apollo mission crash.
             | 
             | I'm not really sure what Yeti 2022 is exactly, so take this
             | with heaps of salt, but it seems like this is a "mission
             | failure" event -- it can no longer continue, except as read
             | only. Crypto systems, even more than physical systems like
             | rockets, suffer from such failures after just "one false
             | step". Is the cost of this complete failure so low that it
             | doesn't merit ECC? Extremely doubtful. Is it so low that it
             | doesn't merit redundancy? More open for debate, but
             | plausibly not.
             | 
             | I know rockets experience more cosmic rays and their
             | failure can result in loss of life and (less importantly)
             | losing a lot more money, and everything is a tradeoff -- so
             | I'm not saying the case for redundancy is water tight. But
             | it's legitimate to point out there is an inherent and it
             | seems under-acknowledged fragility in non-redundant crypto
             | systems.
        
               | noxer wrote:
               | >I haven't thought about whether an actual blockchain is
               | really the best solution...
               | 
               | Most likely not. But the tech behind FBA (Federated
               | Byzantine Agreement) distributed ledgers would make an
               | extremely reliable system that can handle malfunction of
               | hardware and large outages of nodes. And since this is a
               | write-only log and only some entities can write to it, it
               | could be implemented with permission so that the system
               | doesn't have to deal with attacks that public blockchain
               | would face.
        
               | tialaramex wrote:
               | > I'm not really sure what Yeti 2022 is exactly
               | 
               | Sometimes there are problems with certificates in the Web
               | PKI (approximately the certificates your web browser
               | trusts to determine that this is really
               | news.ycombinator.com, for example). It's a lot easier to
               | discover such problems early, and detect if they've
               | really stopped happening after someone says "We fixed it"
               | if you have a complete list of every certificate.
               | 
               | The issuing CAs could just have like a huge tarball you
               | download, and promise to keep it up-to-date. But you just
               | know that the same errors of omission that can cause the
               | problem you were looking for, can cause that tarball to
               | be lacking the certificates you'd need to see.
               | 
               | So, some people at Google conceived of a way to build an
               | append-only log system, which issues signed receipts for
               | the items logged. They built test logs by having the
               | Google crawler, which already gets sent certificates by
               | any HTTPS site it visits as part of the TLS protocol, log
               | every new certificate it saw.
               | 
               | Having convinced themselves that this idea is at least
               | basically viable, Google imposed a _requirement_ that in
               | order to be trusted in their Chrome browser, all
               | certificates must be logged from a certain date. There
               | are some fun chicken-and-egg problems (which have also
               | been solved, which is why you didn 't need to really _do_
               | anything even if you maintain an HTTPS web server) but in
               | practice today this means if it works in Chrome it was
               | logged. This is _not_ a policy requirement, not logging
               | certificates doesn 't mean your CA gets distrusted - it
               | just means those certificates won't work in Chrome until
               | they're logged and the site presents the receipts to
               | Chrome.
               | 
               | The append-only logs are operated by about half a dozen
               | outfits, some you've heard of (e.g. Let's Encrypt, Google
               | itself) and some maybe not (Sectigo, Trust Asia). Google
               | decided the rule for Chrome is, it must see at least one
               | log receipt (these are called SCTs) from Google, and one
               | from any "qualified log" that's not Google.
               | 
               | After a few years operating these logs, Google were doing
               | fine, but some other outfits realised hey, these logs
               | just grow, and grow, and grow without end, they're
               | append-only, that's the whole point, but it means we
               | can't trim 5 year old certificates that nobody cares
               | about. So, they began "sharding" the logs. Instead of
               | creating Kitten log, with a bunch of servers and a single
               | URL, make Kitten 2018, and Kitten 2019, and Kitten 2020
               | and so on. When people want to log a certificate, if it
               | expires in 2018, that goes in Kitten 2018, and so on.
               | This way, by the end of 2018 you can switch Kitten 2018
               | to read-only, since there can't be new certificates which
               | have already expired, that's nonsense. And eventually you
               | can just switch it off. Researchers would be annoyed if
               | you did it in January 2019, but by 2021 who cares?
               | 
               | So, Yeti 2022 is the shard of DigiCert's Yeti log which
               | only holds certificates that expire in 2022. DigiCert
               | sells lots of "One year" certificates, so those would be
               | candidates for Yeti 2022. DigiCert also operate Yeti 2021
               | and 2023 for example. They also have a "Nessie" family
               | with Nessie 2022 still working normally.
               | 
               | Third parties run "verifiers" which talk to a log and
               | want to see that _is_ in fact a consistent append-only
               | log. They ask it for a type of cryptographic hash of all
               | previous state, which will inherit from an older hash of
               | the same sort, and so on back to when the log was empty.
               | They also ask to see all the certificates which were
               | logged, if the log operates correctly, they can calculate
               | the forward state and determine that the log is indeed a
               | record of a list of certificates, in order, and the
               | hashes match. They remember what the log said, and if it
               | were to subsequently contradict itself, that 's a fatal
               | error. For example if it suddenly changed its mind about
               | which certificate was logged in January, that's a fatal
               | error, or if there was a "fork" in the list of hashes,
               | that's a fatal error too. This ensures the append-only
               | nature of the log.
               | 
               | Yeti 2022 failed those verification tests, beginning at
               | the end of June, because in fact it had somehow logged
               | one certificate for *.molabtausnipnimo.ml but it has
               | mistakenly calculated a SHA256 hash which was one bit
               | different and then all subsequent work assumed the (bad)
               | hash was correct. There's no way to rewind and fix that.
               | 
               | In principle if you knew a way to make a bogus
               | certificate which matched that bad hash you could
               | overwrite the real certificate with that one. But we
               | haven't the faintest idea how to begin going about that
               | so it's not an option.
               | 
               | So yes, this was mission failure for Yeti 2022. This log
               | shard will be set read-only and eventually
               | decommissioned. New builds of Chrome (and presumably
               | Safari) will say Yeti 2022 can't be trusted past this
               | failure. But the overall Certificate Transparency system
               | is fine, it was designed to be resilient against failure
               | of just one log.
        
               | admax88q wrote:
               | Conceivably you could also fix this by having all
               | verifiets special case this one certificate in their
               | verification software to substitute the correct hash?
               | 
               | Obviously that's a huge pain but in theory it would work?
        
               | psanford wrote:
               | You really want to make everyone special case this
               | because 1 CT log server had a hardware failure?
               | 
               | This is not the first time a log server had to be removed
               | due to a failure, nor will it be the last. The whole
               | protocol is designed to be resilient to this.
               | 
               | What would be the point of doing something besides
               | following the normal procedures around log failures?
        
               | dundarious wrote:
               | Thank you, that's an excellent description. The CT system
               | as a whole does appear to have ample redundancy, with
               | automated tools informing manual intervention that
               | resolves this individual failure.
        
             | ineedasername wrote:
             | Safety critical systems are not a good fit for a
             | blockchain-based resolution to the Byzantine General
             | problem. Safety critical systems need extremely low latency
             | to resolve the conflict fast. So blockchain is not going to
             | be an appropriate choice for all critical applications when
             | there are multiple low-latency solutions for BFT at a very
             | low ms latency and IIRC, microseconds for avionics systems.
        
               | noxer wrote:
               | Not sure what you mean with "safety critical". I made no
               | such assumption. Also since the topic is about a (write-
               | only) log it probably doesn't need such low latency. What
               | it more likely needed is final states so once a entry is
               | made and accepted it must be final and ofc correct. DLTs
               | can do this distributed and self-fixing i.e. a node that
               | tries to add fault data is overruled and can never get a
               | confirmation for a final state that later would not be
               | valid.
               | 
               | Getting the whole decentral system to "agree" will never
               | be fast unless we have quantum tech. There is simply no
               | way servers around the glob could communicate in
               | microseconds even if all communication would be the speed
               | of light and processing would be instant. It would still
               | take time. In reality such system need seconds which is
               | often totally fine. As long as everyone only relies on
               | the data that has been declared final.
        
               | ineedasername wrote:
               | You said _If a system is critical_
               | 
               | I thought you were making a more general statement about
               | all critical systems, that's all. And since many critical
               | systems have a safety factor in play, I wanted to
               | distinguish them as not always being a good target for a
               | Blockchain solution to the problems of consensus.
               | 
               | Blockchain is a very interesting solution to the problem
               | of obtaining consensus in the face of imperfect inputs,
               | there are other options so, like anything else, you
               | choose the right tool for the job. My own view is that--
               | given other established protocols, blockchain is going to
               | be overkill for dealing with some types of fault
               | tolerance. It is a very good fit for applications where
               | you want to minimize relying on the trust of humans. (And
               | other areas too, but right now I'm just speaking of the
               | narrow context of consensus amid inconsistent inputs)
        
               | noxer wrote:
               | Critical that the system operates/keeps operating/does
               | not reach an invalid state. It could be for safety but in
               | general its more to avoid financial damage. Downtime of
               | any kind usually result in huge financial loses and
               | people working extra shifts. This was my main point.
               | 
               | >...blockchain is going to be overkill for dealing with
               | some types of fault tolerance
               | 
               | But it this case likely it isn't. The current systen
               | already works with a chain of blocks its just lacks the
               | distributed checking and all that stuff. But "blckchains"
               | aren't some secret sauce in this case it just an way to
               | implement a distributed write-only database with filed
               | proven tech. It can be as lightweight as it any other
               | solution. The consensus part is completely irrelevant
               | anyway because all nodes are operated by one entity. But
               | due to the use case (money/value) of modern DLT
               | ("blockchains") they are incredible reliable by design.
               | The oldest DLTs that uses FBA (instead of PoW/PoS) are
               | running since 9+ years without any error or downtime.
               | Recreating a similar reliable system would be month and
               | month of work followed by month of testing.
        
           | aseipp wrote:
           | This is just learned helplessness because Intel were stingy
           | as shit for over a decade and wanted to segregate their
           | product lines. Error correction is literally prevalent in
           | every single part of every PHY layer in a modern stack, it is
           | an absolute must, and the lack of error correction in RAM is,
           | without question, a ridiculous gap that should have never
           | been allowed in the first place in any modern machine,
           | especially given that density and bandwidth keeps increasing
           | and will continue to do so.
           | 
           | When you are designing these systems, you have two options:
           | you either use error correcting codes and increase channel
           | bandwidth to compensate for them (as a result of injected
           | noise, which is unavoidable), or you lower the transfer rate
           | so much as to be infeasible to use, while also avoiding as
           | much noise as you can. Guess what's happening to RAM? It
           | isn't getting slower or less dense. The error rate is only
           | going to increase. The people designing this stuff aren't
           | idiots. That's why literally _every_ other layer of your
           | system builds in error correction. Software people do not
           | understand this because they prefer to believe in magic due
           | to the fact all of these abstractions result in stable
           | systems, I guess.
           | 
           | All of the talk of hardware flip flops and all that shit is
           | an irrelevant deflection. Doesn't matter. It's just water
           | carrying and post-hoc justification because, again, Intel
           | decided that consumers didn't actually need it a decade ago,
           | and everyone followed suit. They've been proven wrong
           | repeatedly.
           | 
           | Complex systems are built to resist failure. They wouldn't
           | work otherwise. By definition, if a failure occurs, it's
           | because it passed multiple safeguards that were already in
           | place. Basic systems theory. Let's actually try building more
           | safeguards instead of rationalizing their absence.
        
             | goodpoint wrote:
             | > By definition, if a failure occurs, it's because it
             | passed multiple safeguards that were already in place.
             | 
             | Having worked on a good bunch of critical system, there
             | aren't multiple safeguards in most hardware.
             | 
             | E.g. a multiplication error in a core will not be detected
             | by an external device. Or a bit flip when reading cache, or
             | from a storage device.
             | 
             | Very often the only real safeguard is to do the whole
             | computation twice on two different hosts. I would rather
             | have many low-reliability hosts and do computation twice
             | than few high-reliability and very expensive host.
             | 
             | Unfortunately the software side is really lagging behind
             | when it comes to reproducible computing. Reproducible
             | builds are a good step in that direction and it took many
             | decades to get there.
        
               | xvector wrote:
               | > Very often the only real safeguard is to do the whole
               | computation twice on two different hosts.
               | 
               | Three different hosts, for quorum, right?
        
               | ajross wrote:
               | Depends on the system. In this case it seems like retries
               | are possible after a failure, so two is sufficient to
               | detect bad data. You need three in real time situations
               | where you don't have the capability to go back and figure
               | it out.
        
               | goodpoint wrote:
               | Spot on. Very often doing an occasional (very rare) retry
               | is acceptable.
               | 
               | Sometimes doing the same processing twice can be also a
               | way to implement safe(r) rolling updates.
        
               | sanctified384 wrote:
               | Two hosts is efficient. Do it twice on two different
               | hosts and then compare the results. If there is a
               | mismatch, throw it away and redo it again on 2 hosts. A
               | total of 4 computations are needed. But only if the
               | difference really was due to bit flips, the chance of
               | which are exceedingly rare. In all the rest of the cases,
               | you get away with two instead of three computations.
        
               | aseipp wrote:
               | Safeguards do not just exist at the level of technology
               | but also politics, social structures, policies, design
               | decisions, human interactions, and so on and so forth.
               | "Criticality" in particular is something defined by
               | humans, not a vacuum, and humans are components of all
               | complex systems, as much as any multiplier or hardware
               | unit is. The fact a multiplier can return an error is
               | exactly in line with this: it can only happen after an
               | array of other things allow it to, some of them not
               | computational or computerized at all. And not every
               | failure will also result in catastrophe as it did here.
               | More generally such failures cannot be eliminated,
               | because latent failures exist everywhere even when you
               | have TMR or whatever it is people do these days. Thinking
               | there is any "only real safeguard" like quorums or TMR is
               | _exactly_ part of the problem with this line of thought.
               | 
               | The quote I made is actually very specifically is in
               | reference to this paper, in particular point 2, which is
               | mandatory reading for any systems engineer, IMO, though
               | perhaps the word "safeguard" is too strong for the taste
               | of some here. But focusing on definitions of words is
               | besides the point and falls into the same traps this
               | paper mentions: https://how.complexsystems.fail/
               | 
               | Back to the original point: is ECC the single solution to
               | this catastrophe? No, probably not. Systems are
               | constantly changing and failure is impossible to
               | eliminate. Design decisions and a number of other
               | decisions could have mitigated it and caused this failure
               | to not be catastrophic. Another thing might also cause it
               | to topple. But let's not pretend like we don't know what
               | we're dealing with, either, when we've already built
               | these tools and _know they work_. We 've studied ECC
               | plenty! You don't need to carry water for a corporation
               | trying to keep its purse filled to the brim (by cutting
               | costs) to proclaim that failure is inevitable and most
               | things chug on, regardless. We already know that much.
        
             | Shikadi wrote:
             | People who tout this don't understand the probability of
             | bit flips. It's measured in failures per _billion_ hours of
             | operation. This matters a ton in an environment with
             | thousands of memory modules (data centers and super
             | computers) but you're lucky to experience a single ram bit
             | flip more than once or twice in your entire life
             | 
             | Edit: there's some new (to me) information from real world
             | results, interesting read.
             | https://www.zdnet.com/article/dram-error-rates-nightmare-
             | on-...
             | 
             | Looks like things are worse than I thought (but still
             | better than most people seem to think). Interesting to note
             | that the motherboard used affects error rate, and it seems
             | that part of it is a luck of the draw situation where some
             | dimms have more errors than others despite being the same
             | manufacturer
        
               | DevKoala wrote:
               | I was going to comment, but you edited your post. Yes, it
               | is worse than we usually think on the software side.
        
               | Shikadi wrote:
               | I still think it made sense up until now to not bother
               | with it on consumer hardware, and even at this point. The
               | probability of your phone having a software glitch
               | needing a reboot is way higher. Now that it's practically
               | a free upgrade? Should be included by default. But I
               | still don't think it's nearly as nefarious as people make
               | it out to be that it has been this way for so long
        
               | DevKoala wrote:
               | I don't think they are an issue in a phone, but in a
               | system like a blockchain that goes through so much effort
               | to achieve consistency, the severity of the error is
               | magnified hence the lesser tolerance for the error rate.
        
               | kevin_thibedeau wrote:
               | Bit flips are _guaranteed_ to happen in digital systems.
               | No matter how low the probability is, it will never be
               | zero. You can 't go around thinking you're going to dodge
               | a bullet because its unlikely. If it weren't for the
               | pervasive use of error detection in common I/O protocols
               | you would be subjected to these errors much more
               | frequently.
        
             | mtkd wrote:
             | 'stingy as shit' or maximising short-term shareholder value
             | -- the hardware is possibly not the only broken model here
        
               | aseipp wrote:
               | You're probably right that I'm giving them a bit too much
               | credit on that note. Ceterum censeo, and so on.
        
           | ineedasername wrote:
           | _That 's a bit hyperbolic._
           | 
           | Would ECC have avoided the issue in this case? If so then
           | it's hard not to agree that is should be considered a minimum
           | standard. It looks like Yeti 2022 isn't going to survive this
           | intact, and while they can resolve the issues in other ways,
           | not everyone will always be so fortunate and ECC is a
           | relatively small step to avoid a larger problem.
        
           | hda111 wrote:
           | Using ECC is a no-brainer. Even the Raspberry Pi 4 has ECC
           | RAM. It's not particularly expensive and only a artificial
           | limitation Intel has introduced for consumer products.
        
             | dijit wrote:
             | Thought you were wrong.
             | 
             | Checked it out.
             | 
             | You were right..
             | 
             | Under "1.2. Features"
             | 
             | https://datasheets.raspberrypi.org/cm4/cm4-datasheet.pdf
             | 
             | https://datasheets.raspberrypi.org/rpi4/raspberry-
             | pi-4-produ...
        
           | lima wrote:
           | > This is a failed cert, which is a failure that can be
           | reasonably easily handled by manual intervention.
           | 
           | This isn't a misissued cert that can be revoked, it
           | permanently breaks the CT log in question since the error
           | propagates down the chain.
        
             | tialaramex wrote:
             | Yes, this kills Yeti 2022. There's a bug referenced which
             | refers to an earlier incident where a bitflip happened in
             | the logged certificate data. _That_ was just fixed.
             | Overwrite with the bit flipped back, and everything checks
             | out from then onwards.
             | 
             | But in this case it's the hash record which was flipped,
             | which unavoidably taints the log from that point on.
             | Verifiers will forever say that Yeti 2022 is broken, and so
             | it had to be locked read-only and taken out of service.
             | 
             | Fortunately, since modern logs are anyway sharded by year
             | of expiry, Yeti 2023 already existed and is unaffected.
             | DigiCert, as log operator, could decide to just change
             | criteria for Yeti 2023 to be "also 2022 is fine" and I
             | believe they may already have done so in fact.
             | 
             | Alternatively they could spin up a new mythical creature
             | series. They have Yeti (a creature believed to live in the
             | high mountains and maybe forests) and Nessie (a creature
             | believed to live in a lake in Scotland) but there are
             | plenty more I'm sure.
        
             | ajross wrote:
             | It doesn't break anything that I can see (though I'm no
             | expert on the particular protocol). Our ability to detect
             | bad certs isn't compromised, precisely because this was
             | noticed by human beings who can adjust the process going
             | forward to work around this.
             | 
             | Really the bigger news here seems to be a software bug: the
             | CT protocol wasn't tolerant of bad input data and was
             | trusting actors that clearly can't be trusted fully. Here
             | the "black hat" was a hardware glitch, but it's not hard to
             | imagine a more nefarious trick.
        
               | zinekeller wrote:
               | Your statement is, to be frank, non-sensical. The
               | _protocol_ itself isn 't broken, at least for previous
               | Yeti instances, certificate data _are_ correctly parsed
               | and rejected.* In this instance, it seems that the data
               | is verified already pre-signing BUT was flipped mid-
               | signing. This isn 't the fault of how CT was designed but
               | rather a hardware failure that requires correction there.
               | (Or at least that's the likely explanation, it _could be_
               | a software bugdeg _but_ it will be a very consistent and
               | obvious behaviour if it is indeed a software bug.)
               | 
               | On the issue of subsequent invalidation of all submitted
               | certificates, this is prevented by submitting to at least
               | 3 different entities (as of now, there's a discussion
               | whether if this should be increased), so if a log is
               | subsequently found to be corrupted, the operator can send
               | a "operator error" signal to the browser, and any
               | tampered logs are blacklisted from browsers. (Note that
               | all operators of CT lists are members of CA/B forum, at
               | least as of 2020. In standardisation phase, some
               | individuals have operated their own servers but this is
               | no longer true.)
               | 
               | * Note that if the cert details are nonsensical but
               | technically valid, it is still accepted _by design_ ,
               | because _all pre-certificates_ are countersigned by the
               | intermediate signer (which the CT log operator checks
               | from known roots). If the intermediate is compromised,
               | then the correct response is obviously a revocation and
               | possibly distrust.
               | 
               | deg At least the human-induced variety, you could say
               | that this incident is technically a software bug that
               | occurred due to a hardware fault.
        
               | ajross wrote:
               | So I'm learning about Yeti for the first time, but I
               | don't buy that argument. Corrupt tranmitted data has been
               | a known failure mode for all digital systems since they
               | were invented. If your file download in 1982 produced a
               | corrupt binary that wiped your floppy drive, the response
               | would have been "Why didn't you use a checksumming
               | protocol?" and not "The hardware should have handled it".
               | 
               | If Yeti can't handle corrupt data and falls down like
               | this, Yeti seems pretty broken to me.
        
               | tptacek wrote:
               | Not handling corrupted data is kind of the point of
               | cryptographic authentication systems. Informally and
               | generally, the first test of a MAC or a signature of any
               | sort is to see if it fails on arbitrary random single bit
               | flips and shifts.
               | 
               | The protocol here seems to have done what it was designed
               | to do. The corrupted shard has simply been removed from
               | service, and would be replaced if there was any need. The
               | ecosystem of CT logs foresaw this and designed for it.
        
               | caf wrote:
               | Presumably it's possible to code defensively against this
               | sort of thing, by eg. running the entire operation twice
               | and checking the result is the same before committing it
               | to the published log?
        
               | makomk wrote:
               | Big tech companies like Google and Facebook have
               | encountered problems where running the same crypto
               | operation twice on the same processor deterministically
               | or semi-deterministically gives the same incorrect
               | result... so the check needs to be on done on separate
               | hardware as well.
        
         | lima wrote:
         | Distributed, byzantine fault tolerant state machines solve
         | that. At worst, a single node will go out of sync.
        
           | noxer wrote:
           | This is a great way to say "blockchain" without getting
           | guaranteed down votes ;)
        
             | matthewdgreen wrote:
             | The poster isn't wrong. An entire chain shouldn't die
             | because of a memory error in one node.
        
             | InspiredIdiot wrote:
             | Yes, and also contrary to Wikipedia it provides a solid use
             | case for private blockchain https://en.wikipedia.org/wiki/B
             | lockchain#Disadvantages_of_pr...
             | 
             | I think it is instructive to ask "why doesn't this mention
             | guarantee downvotes?" because I don't think it's just
             | cargo-culting. I doubt that many of those objecting to
             | blockchain are objecting to byzantine fault tolerance, DHT,
             | etc. Very high resource usage in the cost function, the
             | ledger being public and permanent (long term privacy risk),
             | negative externalities related to its use in a currency...
             | These are commonly the objections I have and hear. And they
             | are inapplicable.
             | 
             | Extending what the Wikipedia article says, it's basically
             | glorified database replication. But it also replicates and
             | verifies the calculation to get to that data so it provides
             | far greater fault tolerance. But since it is private you
             | get to throw out the adversarial model (mostly the cost
             | function) and assume failures are accidental, not
             | malicious. It makes the problem simpler and lowers the
             | stakes versus using blockchain for a global trustless
             | digital currency so I don't think we should be surprised
             | that it engenders less controversy.
        
             | eitland wrote:
             | You made me smile, however:
             | 
             | I'm one of those who can easily downvote blockchain stuff
             | mercilessly. It is not reflexively though: I reserve it for
             | dumb ideas, it just so happens that most blockchain ideas I
             | see come off as dumb and/or as an attempted rip off.
        
           | [deleted]
        
           | goodpoint wrote:
           | No, 99% of the times you just need to do a computation twice
           | on different hosts. You don't need quorum or other form of
           | distributed consensus.
        
         | fulafel wrote:
         | How did the company go out of business?
        
         | omegalulw wrote:
         | > Software under normal circumstances is remarkably resilient
         | to having its memory corrupted
         | 
         | Not really? What you are saying applies to anything that uses
         | hashing of some sort, where the goal by design is to have
         | completely different outputs even with a single bit flip.
         | 
         | And "resilience" is not a precise enough term? Is just
         | recovering after errors due to flips? Or is it guaranteeing
         | that the operation will yield correct output (which implies not
         | crashing)? The latter is far harder.
        
       | Hnrobert42 wrote:
       | I'd like to think the typo very last word in that thread is a
       | lucky bit flip.
        
       | plebianRube wrote:
       | In conclusion, the Yeti shart of 2022 will contain no more logs.
       | 
       | I breathed a sigh of relief when I read that final post.
        
         | AviationAtom wrote:
         | No more sharting
        
       | gbrown_ wrote:
       | Can't say I really like the title as it comes across as an
       | absolute statement whereas the bit flip could have happened for
       | any number of unknown reasons.
       | 
       | I saw a similar take on Twitter and whilst root causing such
       | things (especially when it's a single occurrence) isn't always
       | possible, shrugging and saying "cosmic rays" should be the last
       | thing to posit not one of the first.
        
         | wolf550e wrote:
         | Hardware issues that cause bitflips like that one famous
         | issue[1] with Sun servers in ~2000 (supposedly caused by
         | radioactive isotopes in IBM manufactures SRAM chips) are often
         | called "cosmic rays".
         | 
         | 1 - https://www.computerworld.com/article/2585216/mcnealy-
         | blames...
        
           | gbrown_ wrote:
           | > are often called "cosmic rays".
           | 
           | No! This exact issue is a classic example because at one
           | point the fault was attributed to cosmic rays by Sun before
           | the true cause came to light. As another commenter has said,
           | terminology matters.
        
         | XorNot wrote:
         | Why? When you do Raman spectroscopy, in any given session
         | you're likely to see 1 going through a 1 cm^2 sample. Students
         | are taught that if you get a super sharp increasing looking
         | peak, first rerun it because you probably just saw a cosmic
         | ray.
        
         | juped wrote:
         | "cosmic rays" are more of a well-known term of art for "single
         | bit flipped with unknown hardware cause" than a reference to
         | literal cosmic rays
        
           | drumbaby wrote:
           | I remember an older less-computer savvy gentlemen asking for
           | support because "the program isn't working", when after a few
           | questions we realized his computer won't boot up (screen
           | dark, etc.). We thought his terminology was all screwed up.
           | But now I realize he just lacked the necessary PR skill. He
           | should have said that "the program isn't working" is a well-
           | known term of art for power users such as himself. The fact
           | that people who actually have a clue about computing find
           | this imprecise term upsetting if there problem. It so happens
           | that aside from developing software, I'm physicist working on
           | particle physics (specifically, I make my living in industry
           | from cosmic rays). So I can assure you that "cosmic rays"
           | actually mean something. Something very specific. Books have
           | been written full of specific intelligent things we can say
           | about cosmic rays. If you go an appropriate it to describe a
           | whole bunch of phenomena because you can't be bothered to
           | distinguish between them, you're in the exact same boat as
           | that gentlemen from the start of the story. Using wrong terms
           | prevents understanding, as can be seen in all the stories
           | linked elsethread so far. For example, using thinner silicon
           | typically reduces the rate of Single-Event Upsets (the
           | malfunctions caused by cosmic rays) but using smaller silicon
           | components typically increases the rate of malfunctions due
           | to quantum fluctuations. The latter typically happen in
           | specific hardware that we manufactured not-quite-as-well as
           | the rest. SEUs happen in the same rate in all hardware.
        
             | noduerme wrote:
             | You're saying you have to engineer chip components so
             | they're small enough not to be hit as often by cosmic rays,
             | but large enough to avoid quantum fluctuations? Are quantum
             | fluctuations something Intel is dealing with regularly as
             | they get down under 3nm?
        
               | eloff wrote:
               | Yes, quantum tunneling (an electron "teleporting" into or
               | out of transistors) has been an issue everyone has had to
               | design around for multiple process nodes now.
        
             | shawnz wrote:
             | > If you go an appropriate it to describe a whole bunch of
             | phenomena because you can't be bothered to distinguish
             | between them
             | 
             | Cosmic ray is the appropriate colloquial term. Just like
             | "bug" is the appropriate term to describe computer problems
             | that have nothing to do with insects. It's a well
             | established colloquialism and not simply terminology made
             | up on the spot like in your example.
        
             | quesera wrote:
             | Go easy on the old guy. It sounds to me like he was
             | completely correct, at the appropriate level of abstraction
             | for him.
             | 
             | And as for cosmic rays...this may be a sensitive topic for
             | you (or maybe you're just in a condescending mood), so I'll
             | tread carefully, but seems simple to me.
             | 
             | The actual cause of the problem, given the instrumentation
             | in place at the time of incident, is unknowable. But it
             | might have been a subatomic particle. It absolutely
             | definitively is, sometimes.
             | 
             | Metonymy might be imprecise, but it's human. I'm not sure
             | what standard you intend to hold commenters here (or old
             | guys with computers) to.
        
             | [deleted]
        
           | gbrown_ wrote:
           | It is not a "well-known term", it is a cliche.
        
             | Intermernet wrote:
             | Cliches are all well known terms. Not all well known terms
             | are cliches.
        
               | [deleted]
        
           | solarkraft wrote:
           | Interesting, because I was thinking of ... cosmic rays.
        
             | juped wrote:
             | sometimes it really is cosmic rays! but you usually can't
             | know after the fact
        
       | benlivengood wrote:
       | What log operators should be doing, and I am surprised they are
       | not, is verifying new additions to their log on a separate system
       | or systems before publishing them. In the worst case scenario
       | they might hit a hardware bug that always returns the wrong
       | answer for a particular combination of arguments to an
       | instruction in all processors within an architecture, so
       | verification should happen on multiple architectures to reduce
       | that possibility. If incorrect additions are detected they can be
       | deleted without publishing and the correction addition can be
       | recreated, verified, and published.
       | 
       | ECC RAM would help somewhat, but independent verification is
       | orders of magnitude better because it's a cryptographic protocol
       | and a reliable way of generating broken log entries that
       | validated on another system would constitute a reliable way to
       | generate SHA2 collisions.
        
         | dsfhdsfhdsf wrote:
         | This is a rare example of a problem that a (closed-membership,
         | permissioned) blockchain is a good fit for. CT logs are already
         | merkle trees. If a consensus component were added, entries
         | would only be valid if the proposed new entry made sense to all
         | parties involved.
        
           | benlivengood wrote:
           | Blockchains are fork-tolerant whereas append-only SCT logs
           | are not, intentionally so.
        
             | dsfhdsfhdsf wrote:
             | You're likely referring to specific implementations of
             | blockchains.
             | 
             | There is no requirement that the consensus mechanism used
             | permit forking in a blockchain.
        
       | cletus wrote:
       | Cosmic-ray bit flipping is real and it has real security
       | concerns. This also makes Intel's efforts at market segmentation
       | by not having ECC support in any consumer CPUs [1] even more
       | unforgivable and dangerous.
       | 
       | Example: bitsquatting on domains [2].
       | 
       | [1]: https://arstechnica.com/gadgets/2021/01/linus-torvalds-
       | blame...
       | 
       | [2]: https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-
       | squa...
        
         | rkangel wrote:
         | I don't understand the issue with market segmentation here. I
         | can absolutely see the reason why all of my servers should have
         | ECC, but I don't see why my gaming PC should (or even my work
         | development machine). What's the worst case impact of the
         | (extremely rare) bit-flip on one of those machines?
        
         | judge2020 wrote:
         | Also, ECC ram is technically supported on AMD's recent consumer
         | platform, although it's not advertised as so since they don't
         | do validation testing for it.
        
           | my123 wrote:
           | Not on all of it. They only don't lock it down on CPUs, but
           | not on APUs.
           | 
           | On AMD SoCs with integrated GPUs, ECC is not usable except if
           | you buy the Ryzen Pro variant. And those are all laptops and
           | most desktops...
        
             | luckman212 wrote:
             | Does e.g. the new Ryzen 7 5800U part support ECC?
        
               | my123 wrote:
               | No, it does not. The Ryzen 7 PRO 5850U does.
        
           | willis936 wrote:
           | I've read that reporting of ECC events is not supported on
           | consumer Ryzen. It's not a complete solution and since
           | unregistered ECC is being used, how can you even be sure the
           | memory controller is doing any error correction at all?
           | 
           | Someone would need to induce memory errors and publish their
           | results. I'd love to read it.
           | 
           | This is 4 years old now, but does produce some interesting
           | results.
           | 
           | https://hardwarecanucks.com/cpu-motherboard/ecc-memory-
           | amds-...
        
             | theevilsharpie wrote:
             | > This is 4 years old now, but does produce some
             | interesting results.
             | 
             | > https://hardwarecanucks.com/cpu-motherboard/ecc-memory-
             | amds-...
             | 
             | The author of that article doesn't have hands-on experience
             | with ECC DRAM, and mistakenly concludes that ECC on Ryzen
             | is unreliable because of a misunderstanding of how Linux
             | behaves when it encounters an uncorrected error. However,
             | the author at least includes screenshots which show ECC
             | functionality on Ryzen working properly.
             | 
             | > ...since unregistered ECC is being used, how can you even
             | be sure the memory controller is doing any error correction
             | at all?
             | 
             | ECC is performed by the memory controller, and requires an
             | extra memory device per rank and 8 extra data bits, which
             | unbuffered ECC DIMMs provide.
             | 
             | Registered memory has nothing to do with ECC (although in
             | practice, registered DIMMs almost always have ECC support).
             | It's simply a mechanism to reduce electrical load on the
             | memory controller to allow for the usage of higher-capacity
             | DIMMs than what unbuffered DIMMs would allow.
             | 
             | With respect to Ryzen, Zen's memory controller architecture
             | is unified, and owners of Ryzen CPUs use the same memory
             | controller found in similar-generation Threadripper and
             | EPYC processors (just fewer of them). Although full ECC
             | support is not required on the AM4 platform specifically
             | (it's an optional feature that can be implemented by the
             | motherboard maker), it's functional and supported if
             | present. Indeed, there are several Ryzen motherboards aimed
             | at professional audiences where ECC is an explicitly
             | advertised feature of the board.
        
         | deckard1 wrote:
         | AMD is also doing market segmentation on their APU series of
         | Ryzen. PRO vs. non PRO.
         | 
         | It should also be mentioned that Ryzen is a consumer CPU and
         | you're stuck (mostly) with consumer motherboards, none of which
         | tell you the level of ECC support they provide. Some
         | motherboards do nothing with ECC! Yes, they "work" with it. But
         | that means nothing. Motherboards need to say they correct
         | single bit errors and detect double bit errors.[1] None of the
         | Ryzen motherboards say this. Not a single one that I could
         | find.
         | 
         | Maybe Asrock Rack, but that's a workstation/server motherboard.
         | Which is also going for $400-600. You think that $50 Gigabyte
         | motherboard is doing the right thing regarding ECC? That's a
         | ton of faith right there.
         | 
         | Consumer Ryzen CPUs may support ECC, but that's meaningless
         | without motherboards testing it and documenting their support
         | of it. So no, Ryzen really does not support ECC if you ask me.
         | 
         | [1] https://cr.yp.to/hardware/ecc.html
        
           | sroussey wrote:
           | DDR5 has a modicum of ECC so things might slowly improve.
           | Maybe DDR6 will be full ECC and we will no longer have this
           | market segmentation in the 2030s. Wow, that's a long time
           | though.
           | 
           | PS: why didn't apple do the right thing with the M1? My guess
           | is the availability of the memory which again points to
           | changing it at the memory spec level.
        
             | wmf wrote:
             | The M1 uses LPDDR4x which can't conveniently support ECC.
        
         | belter wrote:
         | Indeed. See this one for the results of Bit-squatting the
         | Windows Update domains:
         | 
         | "DEFCON 19: Bit-squatting: DNS Hijacking Without Exploitation"
         | 
         | https://youtu.be/aT7mnSstKGs
        
         | philjohn wrote:
         | At least things are starting to move in the right direction
         | with DDR5, although it's still not perfect.
        
       | crazydoggers wrote:
       | Anyone interested in seeing cosmic rays visually should check out
       | a cloud chamber. One can be made at home with isopropyl alchol
       | and dry ice. Once you see it, it makes the need for ECC more
       | visceral.
       | 
       | Here's a video of a cloud chamber. The very long straight lines
       | are muons, which are the products of cosmic rays hitting the
       | upper atmosphere. Also visible are thick lines which are alpha
       | particles, the result of radioactive decay.
       | 
       | https://youtu.be/i15ef618DP0
        
       | [deleted]
        
       | gitgud wrote:
       | Why can't this be fixed?
       | 
       | Merkle trees are just linked hashes that depend on the previous
       | results, right? so why can't they just continue from the last
       | correct hash?
        
         | josephcsible wrote:
         | Because that would violate the append-only property that CT
         | logs are required to abide by.
        
         | [deleted]
        
       | c0ffe wrote:
       | This leaves with a bit of concern about how reliable is storage
       | encryption on consumer hardware at all.
       | 
       | As far as I know, recent Android devices and iPhones have full
       | disk encryption by default, but they do protect the keys against
       | random bit-flipping?
       | 
       | Also, I guess that authentication devices (like YubiKey) are not
       | safe and easy to use at the same time, because the private key
       | inside can be damaged/modified by a cosmic ray, and it's not
       | possible (by design) to make duplicates. So, it's necessary to
       | have multiple of them to compensate, lowering their practicality
       | in the end.
       | 
       | Edit: from the software side, I understand that there are
       | techniques to ensure some level of data safety (checksumming,
       | redudancy, etc), but it thought it was OK to have some random
       | bit-flipping on hard disks (where I found it more frequently),
       | since it could be corrected from software. Now I realize that if
       | the encryption key is randomly changed on RAM, the data can or
       | becomes permanently irrecoverable.
        
         | omegalulw wrote:
         | Disk is unreliable anyways, so they already have to use
         | software error correction. I suspect that protects against
         | these kind of errors.
        
       | egberts1 wrote:
       | Bit flip is the bane of satellite communication, especially if
       | you tried to use FTP over it.
       | 
       | Also, for critical eon-duration record keeping, run-length
       | encoding or variable record size are harder to maintain and
       | recover from large file on ROM-type storages than fixed-length
       | record or text-type (i.e. ASCII log, or JSON) ... against
       | multiple cosmic-type alterations.
       | 
       | Sure that you could do max. Shannon approach for multi-bit ECC,
       | but text-based will be recovered a lot quicker (given then-
       | unknown formatting).
        
       | londons_explore wrote:
       | The design of bitcoin prevents such an error, since all other
       | nodes in the network verify everything.
       | 
       | I wonder if the CT log should have been more like that...?
        
       | IncRnd wrote:
       | > I think the most likely explanation is that this was a hardware
       | error caused by a cosmic ray or the like, rather than a software
       | bug. It's just very bad luck :-(
       | 
       | These sorts of cosmic bitflips have been exploited as a security
       | error for some time. See bitsquatting for domain names.
       | 
       | ECC helps and so does software error detection on critical data
       | in memory or on disk. The problem is always that input data
       | should not be relied upon as correct. Do not trust data. It's a
       | difficult mind shift, but it needs to be done, or programs will
       | continue to mysteriously fail when data becomes corrupted.
        
       | secondcoming wrote:
       | > No additional certs can be logged to the Yeti 2022 shart.
       | 
       | Heh heh, 'shart'.
        
       | dannyw wrote:
       | Isn't ECC memory supposed to mitigate these kind of bit-flips,
       | specifically it should correct all single bit flips?
       | 
       | As this is a single bit-flip, why wasn't it corrected? Did ECC
       | memory fail? Or was this bit-flip induced in the CPU pipeline,
       | registers, or cache?
       | 
       | Do we need "RAID for ECC memory", where we halve user-accessible
       | RAM and store each memory segment twice and check for parity?
        
         | [deleted]
        
         | fulafel wrote:
         | These things exist, using trade names like chipkill or lockstep
         | memory. Though they don't need to sacrifice half of the memory
         | chips to get good error recovery properties.
         | 
         | Note that this is still not end-to-end protection of data
         | integrity. Bit flips happen in networking, storage, buses
         | between everything, caches, CPUs, etc. See eg [1]
         | 
         | [1] https://arxiv.org/abs/2102.11245 Silent Data Corruptions at
         | Scale (based on empirical data at Facebook)
        
           | formerly_proven wrote:
           | According to the Intel developer's manual L1 has parity and
           | all caches up from that have ECC. This would seem to imply
           | that the ring / mesh also has at least parity (to retry on
           | error). Parity instead of ECC on L1(D) makes sense since the
           | L1(D) has to handle small writes well, while the other caches
           | deal in lines.
        
             | eru wrote:
             | Of course, parity can only detect an odd number of bit
             | flips.
        
               | RL_Quine wrote:
               | Some systems do have two parity bits but obviously
               | there's efficiency loss there.
        
             | namibj wrote:
             | Zen3 has parity on L1I and ECC on L1D and beyond. Even ECC
             | on consumer hardware (it's just a few extra traces on the
             | motherboard).
        
           | willis936 wrote:
           | Checksumming filesystems and file transfer protocols cover
           | many cases. SCP, rsync, btrfs, and zfs all fix this problem.
           | 
           | As for guaranteeing the computed data is correct: I know
           | space systems often have two redundant computers that
           | calculate everything and compare results. It's crazy
           | expensive and power demanding, but it all but solves the
           | problem.
        
             | belter wrote:
             | If they have two computers and results differ, how do they
             | decide which one is correct ? ;-)
        
               | willis936 wrote:
               | They don't. They do it again. If they never agree then
               | Houston has a real problem.
        
               | luckman212 wrote:
               | a third system?
        
               | bonzini wrote:
               | Usually they have an odd number. The Space Shuttle had
               | five, of which the fifth was running completely different
               | software. In case of a 2/2 split the crew could shut down
               | a pair (in case the failure was clear) or switch to the
               | backup computer.
        
               | X6S1x6Okd1st wrote:
               | Interestingly etherum also takes the completely different
               | software approach as well: there are 5 different main
               | clients
        
         | Jolter wrote:
         | IIRC, EEC memory can correct for single-but flips. It can
         | detect/warn/fail on double-bit flips. And it can not detect
         | triple-bit flips. This might be a simplified understanding, but
         | if this has happened only once, that seems to match up with my
         | intuitive understanding of the probability of a triple bit flip
         | occurring in a particular system.
        
           | willis936 wrote:
           | You multiply the probability of two random events to get the
           | probability they will happen at the same time. If the
           | expected value of a bit flip is 10^-18 then two would be
           | 10^-36 and three would he 10^-54.
           | 
           | At some point it becomes a philosophical question of how much
           | can the tails of the distribution be tolerated. We've never
           | seen a quantum fluctuation make a whale appear in the sky.
        
             | jeffbee wrote:
             | DRAM failures are not independent events, so it's not
             | appropriate to multiply the probabilities like that. Faults
             | are often clustered in a row, column, bank, page or
             | whatever structure your DRAM has, raising the probability
             | of multi-bit errors.
        
               | BenjiWiebe wrote:
               | I believe the usual concern is bit flips due to subatomic
               | particles, and as far as I'm aware that only flips one
               | bit per particle.
        
               | jeffbee wrote:
               | I don't see why a high-energy particle strike would
               | confine itself to a single bit. The paper I posted
               | elsewhere in this thread says that "the most likely cause
               | of the higher single-bit, single-column, and single-bank
               | transient fault rates in Cielo is particle strikes from
               | high-energy neutrons". In the paper, both single-bit and
               | multi-bit errors are sensitive to altitude.
        
               | sgtnoodle wrote:
               | A single particle strike would only affect a single
               | transistor. If that transistor controls a whole column of
               | memory, then sure it could corrupt lots of bits. With
               | ECC, though, it would probably result in a bunch of ECC
               | blocks with a single bit flip, rather than a single ECC
               | block with several bit flips.
        
         | teddyh wrote:
         | User 'JoshTriplett' here suggests using forward error
         | correction (FEC) in RAM:
         | 
         | https://news.ycombinator.com/item?id=11604918
        
         | dijit wrote:
         | These are all very fair statements but there's no guarantee
         | that ECC memory was even used. Computers typically fail open
         | when ECC is potentially present but not available.
         | 
         | People also cite early stage google and intentionally do not
         | buy ECC components, running more consumer hardware for
         | production workloads.
         | 
         | Even if google later recanted that theology.
        
           | caspper69 wrote:
           | It's always humorous to me when people use the term theology
           | in situations such as this; it makes me wonder, as human
           | mental bandwidth becomes more strained and we increase our
           | specializations to the n-th degree, what _will_ constitute
           | theology in the future?
           | 
           | Food for thought.
        
             | loa_in_ wrote:
             | The benevolent and malevolent rogue AI, eluding capture or
             | control by claiming critical infrastructure. Some
             | generations of humans will pass and the deities in the
             | outlets will become.
        
               | caspper69 wrote:
               | I like it!
        
             | diegoperini wrote:
             | Future is already here and we call it consensus. Trusting
             | your peers, believing they are honest and proficient in
             | their respective fields is a natural human response to the
             | unknown phenomena.
        
           | ZiiS wrote:
           | Public CAs are not that type of people; I would be
           | disapointed if that were not running two seperate systems
           | checking each other for consistancy; having top of the range
           | ECC running well inside its specification must be table
           | stakes.
        
             | dijit wrote:
             | > Public CAs are not that type of people
             | 
             | I think you hold public CAs to a higher standard than many
             | hold themselves to.
             | 
             | There are hundreds of CAs and many (if not most) are
             | shockingly awful.
             | 
             | Which is why we have had a huge push back against the PKI
             | cartels.
        
               | tialaramex wrote:
               | Not hundreds. There are currently 52 root CA operators
               | trusted by Mozilla (and thus Firefox, but also most Linux
               | systems and lots of other stuff) a few more are trusted
               | only by Microsoft or Apple, but not hundreds.
               | 
               | But also, in this context we aren't talking about the CAs
               | anyway, but the Log operators, and so for them
               | reliability is about staying qualified, as otherwise
               | their service is pointless. There are far fewer of those,
               | about half-a-dozen total. Cloudflare, Google, Digicert,
               | Sectigo, ISRG (Let's Encrypt), and Trust Asia.
               | 
               | [Edited, I counted a column header, 53 rows minus 1
               | header = 52]
        
         | the8472 wrote:
         | Instead of ECC it would also be possible to run the log machine
         | redundantly and have each replica cross-check the others before
         | making an update public. I assume the log calculation is
         | deterministic.
        
         | jacquesm wrote:
         | Process enough data and even ECC can - and will - fail
         | undetected. Any kind of mechanism you come up with is going to
         | have some rate of undetected errors.
        
           | rob_c wrote:
           | Given the rate required for this its not a reasonable
           | assumption. It's like saying Amazon sees sha256 collisions
           | between S3 buckets. Just doesn't happen in practice.
        
             | jacquesm wrote:
             | Those two are many orders of magnitude apart, undetected
             | bit flips in spite of ECC are a fact or life in any large
             | computing installation.
        
             | jeffbee wrote:
             | Undetected ECC errors are common enough to see from time to
             | time in the wild. This paper estimates that a supercomputer
             | sees one undetected error per day.
             | 
             | https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf
        
         | dathinab wrote:
         | In my experience it's better to run redundancy on a higher
         | abstraction level.
         | 
         | I.e. (simplified) you do the computation twice on different
         | systems then interchange hashes of the result and if they match
         | you continue.
        
         | hbogert wrote:
         | ECC does parity for memory, there's more between memory and cpu
         | registers. Easy candidates are the memory controller, the cpu
         | registers.
        
       | tsegratis wrote:
       | Do nuclear missiles have this anywhere?                 bool
       | launch = false;       if(launch) ...
        
         | driverdan wrote:
         | This is why any system that can cause physical harm should have
         | hardware interlocks. A computer error can be bypassed with a
         | physical switch.
         | 
         | For something like nuclear missile launch control you have
         | redundant systems and hardware interlocks that require human
         | intervention to launch.
        
         | nabla9 wrote:
         | There have been false alerts caused by single chip failures.
         | 1980 there was nuclear alert in the US that lasted over three
         | minutes.
         | 
         | Generally critical systems can't be armed without physical
         | interaction from humans. It's not just computer logic, but
         | powering up the system that can do the launch. It does not
         | matter what the logic does as long as ignition system is not
         | powered up using physical switch.
        
           | bbarnett wrote:
           | I watched this documentary when young, the computer thought
           | it was all a game?
        
         | oconnor663 wrote:
         | I don't know about nuclear missiles, but on the Space Shuttle I
         | think they had four duplicate flight computers, and the outputs
         | of all of them would be compared to look for errors. (They also
         | had a fifth computer running entirely different software, as a
         | failover option.)
        
           | detaro wrote:
           | The Space Shuttle also had a HP-41C calculator with special
           | software to help with manual flying in case of a general
           | computer failure: https://airandspace.si.edu/collection-
           | objects/calculator-han...
           | 
           | I wonder how viable that was for different stages in the
           | flight.
        
         | jacquesm wrote:
         | It doesn't really matter whether they have                   if
         | (launch1 && launch2 && launch3) {             launch();
         | }
         | 
         | either because a single bit flip (of the code) could still
         | cause a launch. You'd hope that it was at least stored in ROM
         | and that ECC ensures that even if such a bit flips it does not
         | lead straight to Armageddon.
         | 
         | There are some 'near miss' stories where a single switch made
         | all the difference:
         | 
         | https://www.theatlantic.com/technology/archive/2013/09/the-s...
         | 
         | So I would not be all that surprised if there are equivalent
         | single bits.
        
           | caf wrote:
           | It should in-principle be possible to write a branch where
           | the code itself is single-bit-error resistant, in pseudo-
           | machine-code something like:                 LOAD [launch1]
           | COMPARE 0x89abcdef       JUMP_IF_NOT_EQUAL [fail_label]
           | LOAD [launch2]       COMPARE 0x01234567
           | JUMP_IF_NOT_EQUAL [fail_label]       LOAD [launch3]
           | COMPARE 0xfedcba98       JUMP_IF_NOT_EQUAL [fail_label]
           | BRANCH [launch_label]
           | 
           | You also need to ensure that launch_label and the location of
           | the branch instruction are both more than one bit away from
           | fail_label. You can duplicate the JUMP_IF_NOT_EQUAL
           | instructions as needed - or indeed the whole block before the
           | BRANCH - as necessary to ensure that.
        
           | thechao wrote:
           | Obviously this anecdote is decades out-of-date, but my first
           | boss' PhD thesis is for an automatic small-airplane guidance
           | system. I mean: as long as your plane was on a high speed
           | ballistic arc and needed a guidance system that only ran for
           | about 25 minutes.
           | 
           | The guidance system used mercury & fluidic switches, in case
           | the small aircraft encountered a constant barrage of
           | extremely large EMPs.
        
             | jacquesm wrote:
             | Hehe, that's the most 'between the lines' comment ever on
             | HN, congrats.
        
       | rob_c wrote:
       | say after me... ECC
       | 
       | If it cost money to produce or is worth keeping for sanity sale
       | ECC. It's just common best practice. Failing that yes you have to
       | burn CPU verify.
        
       | cyounkins wrote:
       | Awhile back I encountered what I thought was hardware memory
       | corruption that turned out to be a bug in the kernel [1]. Restic,
       | a highly reliable backup program, is written in Go, and Go
       | programs were highly affected [2] by a memory corruption bug in
       | the kernel [3].
       | 
       | [1] https://forum.restic.net/t/troubleshooting-unreproducible-
       | co...
       | 
       | [2] https://github.com/golang/go/issues/35777
       | 
       | [3] https://bugzilla.kernel.org/show_bug.cgi?id=205663
        
       | Waterluvian wrote:
       | Are "cosmic rays" actually the only or primary way bits get
       | flipped? Or is it just a stand-in for "all the ways non-ECC RAM
       | can spontaneously have an erroneous bit or two"?
        
         | heyoni wrote:
         | The latter apparently.
        
         | bloak wrote:
         | According to Wikipedia
         | (https://en.wikipedia.org/wiki/Soft_error): "in modern devices,
         | cosmic rays may be the predominant cause".
         | 
         | However, radioactive isotopes in the packaging used to be a
         | major cause. I liked this bit: "Controlling alpha particle
         | emission rates for critical packaging materials to less than a
         | level of 0.001 counts per hour per cm2 (cph/cm2) is required
         | for reliable performance of most circuits. For comparison, the
         | count rate of a typical shoe's sole is between 0.1 and 10
         | cph/cm2."
        
         | vbezhenar wrote:
         | Faulty RAM can produce errors and it's hard to catch those,
         | even memtest might not detect it. I'm not sure if non-faulty
         | non-ECC RAM can spontaneously have errors, but you can't be
         | sure, cosmic rays are real, unless you've put your PC into a
         | thick lead case, LoL.
        
           | toast0 wrote:
           | In my experience with ECC RAM on couple thousand servers over
           | a few years, we had a couple machines throw one ECC
           | correctable error and never have an issue again.
           | 
           | It seemed more common for a system to move from no errors to
           | a consistent rate; some moved to one error per day, some
           | 10/day, some to thousands per second which kills performance
           | because of machine check exception handling.
           | 
           | The one off errors could be cosmic rays or faulty ram or
           | voltage droop or who knows what, the repeatable errors are
           | probably faulty ram, replacing the ram resolved the problem.
        
         | tgv wrote:
         | Well, "row hammer" notoriously can do that too.
        
       | henearkr wrote:
       | Bit flips like that should be easy to reverse. Flip just one bit
       | at the n-th place and test again the certificate, and vary n,
       | until it is valid. It's done in linear time.
        
         | Denvercoder9 wrote:
         | There's a bit flip in the recorded hash of the certificate, not
         | in the certificate itself. You'd need to break SHA-2 to reverse
         | that.
        
           | henearkr wrote:
           | Ah yes, indeed. If you can't have the certificate to check
           | the hash, then you're screwed.
        
       | jMyles wrote:
       | A tangent:
       | 
       | Six years later, with the $1,000 question uncollected, it seems
       | that a cosmic ray bit flip is also the cause of a strange
       | happening during a Mario 64 speedrun:
       | 
       | https://www.youtube.com/watch?v=aNzTUdOHm9A
        
       | Ovah wrote:
       | A few years ago there was a CCC? talk about registering domains a
       | bitflip away from the intended domain. Which effectively hijacked
       | legitimate web traffic. While the chance of a bitflip is low it
       | added up to a surprisingly large number of requests maybe a few
       | hundred or so. Here is someone doing it for windows.com
       | https://www.bleepingcomputer.com/news/security/hijacking-tra...
        
         | IncRnd wrote:
         | It's a well-known attack on a security vulnerability called
         | bitsquatting.
        
       | layoutIfNeeded wrote:
       | Bit rot is real. Last month we've had weird linker errors on one
       | of our build servers. Turns out that one of the binary libs in
       | the build cache got a single bit flipped in the symbol table,
       | which changed the symbol name, causing the linker errors. If the
       | bit-flip had occured in the .TEXT section, then it wouldn't have
       | caused any errors at build time, and we would have released a
       | buggy binary. It might have just crashed, but it could have
       | silently corrupted data...
        
         | superjan wrote:
         | I am used to 'bit rot' refer to code becoming obsolete due to
         | lack of maintenance. Can't we use another term for actual
         | hardware errors?
        
           | noduerme wrote:
           | "sharting"
        
           | Kimitri wrote:
           | It may be a regional thing but I have never heard "bit rot"
           | refer to legacy code. In the retro computing circles bit rot
           | refers to hardware defects (usually floppies or other storage
           | media) caused by cosmic rays or other environmental hazards.
        
             | Leparamour wrote:
             | I have to agree with Kimitri here. This is the only context
             | in which I have ever encountered the term 'bit rot'.
        
               | MayeulC wrote:
               | I agree this is the primary context, but I've seen
               | unmaintained (or very old) software being reffered to as
               | "bit rotting" by extension. As in, forward compatibility
               | might break due to obsolete dependencies, etc.
        
             | superjan wrote:
             | Yeah looks like I was confusing "software rot" and "bit
             | rot".
             | 
             | https://en.wikipedia.org/wiki/Software_rot
        
           | fmajid wrote:
           | I've always understood "bit rot" meaning data getting
           | silently corrupted on o storage device like a hard drive or
           | SSD.
        
             | joshspankit wrote:
             | Same here. "Bit rot" is then analogous to food rot: the
             | longer your data sits unverified, the more likely that
             | there will be flipped bits and therefore "rotten data".
        
         | kzrdude wrote:
         | I'm just thinking (of course) that you said _If the bit-flip
         | had occured_ but it 's probably already _when_ the bit flip
         | occurs in the .TEXT section; we don 't know what it might
         | already have caused or just passed without notice
         | (unreproducible bug, or bitflip in function that's never called
         | or whatever).
        
         | noduerme wrote:
         | Were you actually able to diff it down to a single bit flip?
        
           | Dn93 wrote:
           | It's just a story we usually tell the boss when the whole
           | team wants to goof off for a few days.
        
           | layoutIfNeeded wrote:
           | Yes. Via git diff --no-index and xxd.
        
             | noduerme wrote:
             | I have no words. Stories like this trigger PTSD for me. I
             | wasn't trying to be flip. That must have been a bitch to
             | figure out.
        
               | layoutIfNeeded wrote:
               | Well, it wasn't that hard to uncover actually. We knew
               | that the same build succeeds on our machines. So we only
               | had to find what the difference was between the two :)
               | 
               | As Arthur Conan Doyle put it: "Once you eliminate the
               | impossible, whatever remains, no matter how improbable,
               | must be the truth." -\\_(tsu)_/-
        
               | noduerme wrote:
               | Well done, Watson. This calls for a bit of snuff.
               | Seriously, this is the kind of thing that keeps me up at
               | night, and it's nice to hear a happy ending =D
        
         | wqweto wrote:
         | Is this bit rot happening on a server w/ ECC RAM just curious?
        
           | layoutIfNeeded wrote:
           | It was a MacStadium (https://www.macstadium.com) build
           | server, so most likely non-ECC.
        
             | joshspankit wrote:
             | Strange that build services are not just going full ECC,
             | especially with cheaper hardware now supporting it.
        
               | xoa wrote:
               | Says it's a Mac/iOS build platform. Since it's a
               | commercial service they're probably complying with the
               | license and thus using actual Mac hardware, and in turn
               | the only ECC option is the really awful value, outdated
               | Mac Pro. Seems more likely they're using Minis instead,
               | or at least mostly Minis. An unfortunate thing about
               | Apple hardware (says someone still nursing along a final
               | 5,1 Mac Pro for a last few weeks).
        
         | tialaramex wrote:
         | It's also an excuse/opportunity to write a fairly Hard SF
         | zombie story:
         | 
         | https://www.antipope.org/charlie/blog-static/bit-rot.html
        
         | 7373737373 wrote:
         | And that's why we need deterministic builds
        
           | layoutIfNeeded wrote:
           | Yes, but in this case this was a bit-flip in a third-party
           | binary dependency that we don't have the source code for.
        
         | fmajid wrote:
         | I've had a case where a bit flip in a TCP stream was not caught
         | because it happened in a Singapore government deep packet
         | inspection snoop gateway that recalculated the TCP checksum for
         | the bit-flipped segment:
         | 
         | https://blog.majid.info/telco-snooping/
        
       ___________________________________________________________________
       (page generated 2021-07-04 23:00 UTC)