[HN Gopher] Software engineering lessons from RCAs of greatest d...
       ___________________________________________________________________
        
       Software engineering lessons from RCAs of greatest disasters
        
       Author : philofsofia
       Score  : 172 points
       Date   : 2023-08-17 08:38 UTC (14 hours ago)
        
 (HTM) web link (anoopdixith.com)
 (TXT) w3m dump (anoopdixith.com)
        
       | formerly_proven wrote:
       | (Root Cause Analysis)
        
         | BiggusDijkus wrote:
         | This whole RCA terminology has to die. The idea that there
         | exists a single "Root Cause" that causes major disasters is
         | fundamentally flawed.
        
           | mytailorisrich wrote:
           | This is a generic term. It does not imply that there has to
           | be a single cause.
        
           | NikolaNovak wrote:
           | If we call it Root Causes Analysis can we keep the acronym?
        
           | rob74 wrote:
           | Actually everywhere "defense in depth" is used (not only in
           | computing, but also in e.g. aviation), there _can 't_ be one
           | single cause for a disaster - each of the layers has to fail
           | for a disaster to happen.
        
             | jacquesm wrote:
             | Almost every accident, even where there is no defense in
             | depth has more than one cause. Car accident: person A was
             | on the phone, person B didn't spot their deviation in time
             | to do anything about it: accident. A is the root cause. If
             | B would have reacted faster then there wouldn't be an
             | accident but there would still be cause for concern and
             | there would still be a culprit. The number of such near
             | misses and saves by others is similar to the defense in
             | depth in effect even if it wasn't engineered in. But person
             | B isn't liable even though their lack of attention to what
             | was going on is a contributory factor. So root causes
             | matter, that's the first and most clear thing to fix. Other
             | layers may be impacted and may require work but that isn't
             | always the case.
             | 
             | In software the root cause is often a very simple one:
             | assumption didn't hold.
        
           | jacquesm wrote:
           | Yes, that's true in the general sense. But root causes are
           | interesting because they are the things that can lead to
           | insights that can help the lowest levels of engineering to
           | become more robust. But at a higher level it is all about
           | systems and the way parts of those systems interact, fault
           | tolerance (massively important) and ensuring faults do not
           | propagate beyond the systems they originate in. That's what
           | can turn a small problem into a huge disaster. And without
           | knowing the root cause you won't be able to track those
           | proximate causes and do something about it. So RCA is a
           | process, not a way to identify the single culprit. So this is
           | more about the interpretation of the term RCA than about what
           | RCA _really_ does.
        
           | tonyarkles wrote:
           | I think if the people you're working with insist on narrowing
           | it down to a single Root Cause, they're missing the entire
           | point of the exercise. I work with large drones day to day
           | and when we do an accident investigation we're always looking
           | for root causes, but there's almost always multiple. I don't
           | think we've ever had a post-accident RCA investigation that
           | resulted in only one corrective action. Several times we have
           | narrowed it down to a single software bug, but to get to the
           | point where a software bug causes a crash, there's always a
           | number of other factors that have to align (e.g. pilot was
           | unfamiliar with the recovery procedure, multiple cascaded
           | failures, etc)
        
         | PostOnce wrote:
         | I was excited that it might be about 1960s RCA computers being
         | software incompatible with IBM s/360
         | 
         | https://en.m.wikipedia.org/wiki/RCA_Spectra_70
         | 
         | Then I finished the headline and opened this Wikipedia article
        
       | WalterBright wrote:
       | It isn't just software.
       | 
       | The Deepwater Horizon was the result of multiple single points of
       | failure, that zippered to catastrophe. Each of those single
       | points of failure could have been snipped at little to no cost
       | 
       | The same with the Fukushima disaster. For example, venting the
       | excess hydrogen into the building, where it could accumulate
       | until a random spark.
        
       | hliyan wrote:
       | I think the software industry itself has accumulated enough bugs
       | over the past few decades. E.g.
       | 
       | F-22 navigation system core dumps when crossing the international
       | date line: https://medium.com/alfonsofuggetta-it/software-bug-
       | halts-f-2...
       | 
       | Loss of Mars probe due to metric-imperial conversion error
       | 
       | I've a few of these myself (e.g. a misplaced decimal that made
       | $12mil into $120mil), but sadly cannot devulge details.
        
         | RetroTechie wrote:
         | In terms of engineering (QC, processes etc), modern day
         | software industry is worse than almost any other industry out
         | there. :-(
         | 
         | And no, just plain complexity or fast-moving environment, is a
         | factor but not _the_ issue. It 's that steps are skipped which
         | are _not_ skipped in other branches of engineering (eg.
         | continous improvement of processes, learning from mistakes  &
         | implementing those. In software land: same mistakes made again
         | & again & again, poorly designed languages remain in use, the
         | list goes on).
         | 
         | A long way still to go.
        
           | fredley wrote:
           | In any other field of engineering, the engineers are all
           | trained and qualified. In software 'engineering', not so
           | much.
        
             | LeonenTheDK wrote:
             | This is exactly why I support "engineer" being a protected
             | term, like Doctor. It should tell you that a certain level
             | of training and qualification has been met, to the point
             | that the engineer is responsible and accountable for the
             | work they do and sign off on. Especially for things that
             | affect safety.
             | 
             | Many software engineers these days are often flying by the
             | seat of their pants, moving quickly and breaking things.
             | Thankfully this _seems_ to largely be in places that aren
             | 't going to affect life or limb, but I'm still rubbed the
             | wrong way seeing people (including myself mind you)
             | building run of the mill CRUD apps under the title of
             | engineer.
             | 
             | Is it a big deal? Not really. It's probably even
             | technically correct to use the term this way. But I do
             | think it dilutes it a bit. For context, I'm in Canada where
             | engineer is technically a protected term, and there are
             | governing bodies that qualify and designate professional
             | engineers.
        
               | zvmaz wrote:
               | Are all "engineers" trained on human error, safety
               | principles, and the like? The failures described in the
               | article are precisely _not_ software failures.
        
               | jacquesm wrote:
               | The important distinction is that engineers are
               | professionally liable.
        
               | mrguyorama wrote:
               | Yes? Most engineering programs (it might even be an
               | accreditation requirement) involve ethics classes and
               | learning from past failures.
               | 
               | My CS degree program required an ethics class and
               | discussed things like the CFAA and famous cases like
               | Therac-25, but nobody took it seriously because STEM
               | majors think they are god's gift to an irrational world.
        
               | jacquesm wrote:
               | > Thankfully this seems to largely be in places that
               | aren't going to affect life or limb
               | 
               | I've seen Agile teams doing medical stuff using the
               | latest hotness. Horrorshow.
               | 
               | I've also seen very, very clean software and firmware
               | development at really small companies.
               | 
               | It's all over the place and you have to look inside to
               | know what is going on. Though job advertisements
               | sometimes can be pretty revealing.
        
               | freeopinion wrote:
               | I'm curious how you think the word "Doctor" is protected.
               | 
               | Do you mean that History PhDs can't call themselves
               | Doctors?
               | 
               | Or chiropractors can't pass themselves off as doctors?
               | 
               | Or you mean Doctor J was licensed to perform basketball
               | bypass surgeries?
               | 
               | Or perhaps podiatrists can't deliver babies?
        
             | sanderjd wrote:
             | That training and qualification is only as good as the
             | processes and standards being trained for and qualified on.
             | We don't have those processes and standards to train
             | against (and frankly I'm not convinced we should or even
             | can) for generic "software engineers". I have a number of
             | friends who are PEs, and it isn't the training and
             | certification process that differentiates their work from
             | mine, it is that there are very clear standards for how you
             | engineer a safe structure or machine. But I contend that
             | there is not a way to write such standards for "software".
             | It's just too broad a category of thing. Writing control
             | software for physical systems is just very different from
             | writing a UX-driven web application. It would be odd and
             | wasteful to have the same standards for both things.
             | 
             | I do think qualification would make sense for more narrow
             | swathes of the "software engineering" practice. For
             | instance, "Automotive Control Software Engineer", etc.
        
           | zvmaz wrote:
           | > In terms of engineering (QC, processes etc), modern day
           | software industry is worse than almost any other industry out
           | there. :-(
           | 
           | How do you know that?
        
             | RetroTechie wrote:
             | Take airplane safety: plane crashes, cause of the crash is
             | thoroughly investigated, report recommends procedures to
             | avoid that type of cause for planecrashes. Sometimes such
             | recommendations become enforced across the industry.
             | Result: air travel safer & safer to the point where sitting
             | in a (flying!) airplane all day is safer than sitting on a
             | bench on the street.
             | 
             | Building regulations: similar.
             | 
             | Foodstuffs (hygiene requirements for manufacturers):
             | similar.
             | 
             | Car parts: see ISO9000 standards & co.
             | 
             | Software: eg. memory leaks - been around forever, but every
             | day new software is released that has 'm.
             | 
             | C: ancient, not memory safe, should _really_ only be used
             | for niche domains. Yet it still is everywhere.
             | 
             | New AAA game: pay $$ after year(s?) of development,
             | download many-MB patch on day 1 because game is buggy.
             | _Could_ have been tested better, but released anyway
             | 'cause getting it out & making sales weighed heavier than
             | shipping reliable working product.
             | 
             | All of this = not improving methods.
             | 
             | I'm not arguing C v. Rust here or whatever. Just pointing
             | out: better tools, better procedures exist, but using them
             | is more exception than the rule.
             | 
             | Like I said the list goes on. Other branches of engineering
             | don't (can't) work like that.
        
               | jacquesm wrote:
               | Exactly. The driving force is there but what is also good
               | is that the industry - for the most part at least -
               | realizes that safety is what keeps them in business. So
               | not only is there a structure of oversight and
               | enforcement, there is also an strongly internalized
               | culture of safety created over decades to build on. An
               | engineer that would propose something obviously unsafe
               | would not get to finish their proposal, let alone
               | implement it.
               | 
               | In 'regular' software circles you can find the marketing
               | department with full access to raw data and front end if
               | you're unlucky.
        
             | jacquesm wrote:
             | Experience?
        
               | talldatethrow wrote:
               | On HN and reddit, experience doesn't count. Only reading
               | about others experiences after they've been paid to write
               | research experiences.
        
               | pdntspa wrote:
               | Don't forget to cite your sources! The nerds will rake
               | you over the coals for not doing so.
        
           | serjester wrote:
           | If you're making the same mistakes over and over again I
           | think that says more about your company than it does about
           | the software industry.
           | 
           | My first job was at a major automotive manufacturer.
           | Implementing half the procedures they had would slow down any
           | software company 10X - just look at the state of most car
           | infotainment systems. If something is safety critical,
           | obviously this makes sense but the reality is 85% of software
           | isn't.
        
             | jacquesm wrote:
             | GP was speaking in the general sense not about their
             | company.
        
               | KnobbleMcKnees wrote:
               | Is that not coming from experience of working at a
               | software company? As I believe you said elsewhere
        
               | jacquesm wrote:
               | It could easily be from looking from the outside in, as
               | it is in my case.
        
           | jacquesm wrote:
           | And reading this thread it doesn't look as if there is much
           | awareness of that.
        
         | Exuma wrote:
         | My worst bug was a typo in a single line of html that removed
         | 3DS protection from many millions of dollars of credit card
         | payments
        
           | ownagefool wrote:
           | Pretty epic.
           | 
           | I was working for a webhosting company, and someone asked me
           | to rush a change just before leaving. Instead of updating
           | 1500 A records, I updated about 50k. Someone senior managed
           | to turn off the cron though, so what I actually lost was the
           | delta of changes between last backup and my SQL.
           | 
           | I was in the room for this though:
           | https://www.theregister.com/2008/08/28/flexiscale_outage/
        
             | CTDOCodebases wrote:
             | I love the title to that article "Engineer accidentally
             | deletes cloud".
             | 
             | It's like a single individual managed to delete the
             | monolithic cloud where everyone's files are stored.
        
               | jiscariot wrote:
               | That is eerily similar to what happened to us in IBM
               | "Cloud", in a previous gig. An engineer was doing
               | "account cleanup" and somehow our account got on the list
               | and all our resources were blown away. The most
               | interesting conversation was convincing the support
               | person, that those deletion audit events were in fact not
               | us, but rather (according to the engineer's Linked-In
               | page) an SRE at IBM.
        
               | ownagefool wrote:
               | This was ~14 years ago and both MS & AWS had loss of data
               | incidents iirc.
        
               | CTDOCodebases wrote:
               | Although it probably wasn't funny at the time I can
               | imagine how comical that conversation was.
               | 
               | Thinking about it further the term "cloud" is a good
               | metaphor for storing files on someone else's computer
               | because clouds just disappear.
        
               | jacquesm wrote:
               | I if any such pathways remain at AWS, Google, Apple and
               | MS that would still allow a thing like that to happen.
        
               | swalsh wrote:
               | At this point there's basically 3 clouds, and then
               | everyone else.
        
               | epolanski wrote:
               | AWS, Azure and Cloudflare?
        
               | yossi_peti wrote:
               | And Google Cloud Platform
        
               | ownagefool wrote:
               | Bare in mind, this was a small startup in 2008 that
               | claims to be the 2nd cloud in the world ( read on-demand
               | iaas provider ).
               | 
               | Flexiscale at the time was a single region backed by a
               | netapp. Each VM essentially had a thin-provisioned lun (
               | logical volume ), basically you copy on write the
               | underlying OS image.
               | 
               | So when someone accidently deletes vol0, they take out a
               | whopping 6TB of data, that takes a ~20TB to restore
               | because you're rebuilding filesystems from safe mode (
               | thanks netapp support ). It's fairly monolithic in that
               | sense.
               | 
               | I guess I was 23 at the time, but I'd written the v2 API,
               | orchestrator & scheduling later. It was fairly naive, but
               | filled the criteria of a cloud, i.e. elastic, on-demand,
               | metered usage, despite using a SAN.
        
           | swalsh wrote:
           | My worst bug was changing how a zip code zone was fetched
           | from the cache in a large ecommerce site with tens of
           | thousands users using it all day long. Worked great in DEV :D
           | but when the thundering herd hit it, the entire site came
           | down.
        
             | jacquesm wrote:
             | Startup, shutdown and migration are all periods of
             | significantly elevated risk. Especially for systems that
             | have been in the air for a long time there are all kinds of
             | ways in which things can go pear shaped. Drives that die on
             | shutdown (or the subsequent boot up), raids that fail to
             | rebuild, cascading failures, power supplies that fail,
             | UPS's that fail, generators that don't start (or that run
             | for 30 seconds and then quit because someone made off with
             | the fuel) and so on.
        
           | hliyan wrote:
           | I posted this, in case we want to collect these gems:
           | https://news.ycombinator.com/item?id=37160295
        
           | GreenVulpine wrote:
           | You could call that a feature making payments easier for
           | customers! All 3DS does is protect the banks by
           | inconveniencing consumers since banks are responsible for
           | fraud.
        
             | feldrim wrote:
             | It's mostly the payment processor. It may or may not be the
             | bank itself.
        
             | jrockway wrote:
             | I believe they pass on the risk to merchants now. If you
             | let fraud through, $30 per incident or whatever. So
             | typically things like 3DS are turned on because that cost
             | got too high, and the banks assure you that it will fix
             | everything.
        
         | gostsamo wrote:
         | my funniest was a wrong param in a template generator which
         | turned off escaping parameter values provided indirectly by the
         | users. good that it was discovered during the yearly pen
         | testing analysis because it lead to shell execution in the
         | cloud environment.
        
         | kefabean wrote:
         | The worst bug I encountered was when physically relocating a
         | multi rack storage array for a mobile provider. The array had
         | never been powered down(!) so we anticipated that a good number
         | of the spindles would fail to come up on restart. So we added
         | an extra mirror to protect each existing raid set. Problem is a
         | bug in the firmware meant the mere existence of this extra
         | mirror caused the entire arrays volume layout to become
         | corrupted at reboot time. Fortunately a field engineer managed
         | to reconstruct the layout, but not before a lot of hair had
         | been whitened.
        
           | jacquesm wrote:
           | Close call. I know of a similar case where a fire suppression
           | systems check ended up with massive data loss in a very large
           | storage array.
        
         | zenkat wrote:
         | Does anyone have a similar compendium specifically for software
         | engineering disasters?
         | 
         | Not of nasty bugs like the F-22 -- those are fun stories, but
         | they don't really illustrate the systemic failures that led to
         | the bug being deployed in the first place. Much more interested
         | in systemic cultural/practice/process factors that led to a
         | disaster.
        
           | mrguyorama wrote:
           | Find and take a CS ethics class.
        
           | two_handfuls wrote:
           | Yes, the RISK mailing list.
        
       | roenxi wrote:
       | I don't think this list hits any fundamental truths. The great
       | depression doesn't have parallels to software failures beyond the
       | fact that complex systems fail. And many of the lessons are vague
       | and unactionable - "Put an end to information hoarding within
       | orgs/teams" for example, says nothing. The Atlassian copy that
       | section links to also says nothing. A lot of the lessons lack
       | meaty learnings, and good luck to anyone trying to put everything
       | in to practice simultaneously.
       | 
       | Makes a fun list of big disasters though and I respect this guy's
       | eye for website design. The site design was probably stronger
       | than the link content, there is a lot to like about it.
        
         | jacquesm wrote:
         | Complex systems fail, but they don't all fail in the same way
         | and analyzing _how_ they fail can help in engineering new and
         | hopefully more robust complex systems. I 'm a huge fan of Risk
         | Digest and there isn't a disaster small enough that we can't
         | learn from it.
         | 
         | Obviously the larger the disaster the more complex the failure
         | and the harder to analyze the root cause. But one interesting
         | takeaway for me from this list is that all of them were
         | preventable and in all but few of the cases the root cause may
         | have been the trigger but the setup of the environment is what
         | allowed the fault to escalate in the way that it did. In a
         | resilient system faults happen as well, but they do not
         | propagate.
         | 
         | And that's the big secret to designing reliable systems.
        
           | roenxi wrote:
           | > ...one interesting takeaway for me from this list is that
           | all of them were preventable...
           | 
           | Every disaster is preventable. Everything on the list was
           | happening in human-engineered environments - as do most
           | things that affect humans. The human race has been the master
           | of its own destiny since the 1900s. The questions are how far
           | before the disaster we need to look to find somewhere to act
           | and what needed to be given up to change the flow of events.
           | 
           | But that doesn't have any implications for software
           | engineering. Studying a software failure post mortem will be
           | a lot more useful than studying 9/11.
        
             | jacquesm wrote:
             | > Every disaster is preventable.
             | 
             | No, there is such a thing as residual risk and there are
             | always disasters that you can't prevent such as natural
             | disasters. But even then you can have risk mitigation and
             | strategies for dealing with the aftermath of an incident to
             | limit the effects.
             | 
             | > Everything on the list was happening in human-engineered
             | environments - as do most things that affect humans.
             | 
             | That is precisely why they were picked and make for good
             | examples.
             | 
             | > The human race has been the master of its own destiny
             | since the 1900s.
             | 
             | That isn't true and it likely will never be true. We are
             | tied 1:1 to the fate of our star and may well go down with
             | it. There is a small but non-zero chance that we can change
             | our destiny but I wouldn't bet on it. And even then in the
             | even longer term it still won't matter. We are passengers,
             | the best we can do is be good stewards of the ship we've
             | inherited.
             | 
             | > The questions are how far before the disaster we need to
             | look to find somewhere to act and what needed to be given
             | up to change the flow of events.
             | 
             | Indeed. So in the case of each of the items listed the RCA
             | gives a point in time where the accident given the
             | situation as it existed was no longer a theoretical
             | possibility but an event in progress. Situation and
             | responses determined how far it got and in each of the
             | cases outlined you can come up with a whole slew of ways in
             | which the risk could have been reduced and possibly how the
             | whole thing may have been averted once the root cause had
             | triggered. But that doesn't mean that the root cause
             | doesn't matter, it matters a lot. But the root cause isn't
             | always a major thing. An O-ring, a horseshoe...
             | 
             | > But that doesn't have any implications for software
             | engineering.
             | 
             | If that is your takeaway then for you it indeed probably
             | does not. But I see such things in software engineering
             | every other week or so and I think there are _many_ lessons
             | from these events that apply to software engineering. As do
             | the people that design reliable systems, which is why many
             | of us are arguing for liability for software. Because once
             | producers of software are held liable for their product a
             | large number of the bad practices and avoidable incidents
             | (not just security) would become subject to the Darwinian
             | selection process: bad producers would go out of business.
             | 
             | > Studying a software failure post mortem will be a lot
             | more useful than studying 9/11.
             | 
             | You can learn _lots_ of things from other fields, if you
             | are open to learning in general. Myopically focusing on
             | your own field is useful and can get you places but it will
             | always result in  'deep' approaches, never in 'wide'
             | approaches and for a really important system both of these
             | approaches are valid and complementary.
             | 
             | To make your life easier the author has listed in the right
             | hand column which items from the non-software disasters
             | carry over into the software world, which I think is a
             | valuable service. A middlebrow dismissal of that effort is
             | throwing away an opportunity to learn, for free, from
             | incidents that have all made the history books. And if you
             | don't learn from your own and others' mistakes then you are
             | bound to repeat that history.
             | 
             | Software isn't special in this sense. Not at all. What is
             | special is the arrogance of some software people who
             | believe that their field is so special that they can ignore
             | the lessons from the world around them. And as a corollary:
             | that they can ignore all the lessons already learned in
             | software systems in the past. We are in an eternal cycle of
             | repeating past mistakes with newer and shinier tools and we
             | urgently need to break out of it.
        
               | roenxi wrote:
               | It is 2023. The damage of natural disasters can be
               | mitigated. When the San Andreas fault goes it'll probably
               | get an entry on that list with a "why did we build so
               | much infrastructure on this thing? Why didn't we prepare
               | more for the inevitable?".
               | 
               | And this article is throwing out generic all-weather good
               | sounding platitudes which are tangential to the disasters
               | listed. He drew a comparison between the Challenger
               | disaster and bitrot! Anyone who thinks that is a profound
               | connection should avoid the role of software architect.
               | The link is spurious. Challenger was about catastrophic
               | management and safety practices. Bitrot is neither of
               | those things.
               | 
               | I mean, if we want to learn from Douglas Adams he
               | suggested that we can deduce the nature of all things by
               | studying cupcakes. That is a few steps down the path from
               | this article, but the direction is similar. It is not
               | useful to connect random things in other fields to random
               | things in software. Although I do appreciate the effort
               | the gentleman went to, it is a nice site and the
               | disasters are interesting. Just not relevantly linked to
               | software in a meaningful way.
               | 
               | > We are tied 1:1 to the fate of our star and may well go
               | down with it
               | 
               | I'm just going to claim that is false and live in the
               | smug comfort that when circumstances someday prove you
               | right neither of us will be around to argue about it. And
               | if you can draw lessons from that which apply to
               | practical software development then that is quite
               | impressive.
        
               | jacquesm wrote:
               | > It is 2023.
               | 
               | So? Mistakes are still being made, every day. Nothing has
               | changed since the stone age except for our ability - and
               | hopefully willingness - to learn from previous mistakes.
               | If we want to.
               | 
               | > The damage of natural disasters can be mitigated.
               | 
               | You wish.
               | 
               | > When the San Andreas fault goes it'll probably get an
               | entry on that list with a "why did we build so much
               | infrastructure on this thing? Why didn't we prepare more
               | for the inevitable?".
               | 
               | Excellent questions. And in fairness to the people living
               | on the San Andreas fault - and near volcanoes, in
               | hurricane alley and in countries below sea level - we
               | have an uncanny ability to ignore history.
               | 
               | > And this article is throwing out generic all-weather
               | good sounding platitudes which are tangential to the
               | disasters listed.
               | 
               | I see these errors all the time in the software world, I
               | don't care what hook he uses to _again_ bring them to
               | attention but they are probably responsible for a very
               | large fraction of all software problems.
               | 
               | > He drew a comparison between the Challenger disaster
               | and bitrot!
               | 
               | So let's see your article on this subject then that will
               | obviously do a much better job.
               | 
               | > Anyone who thinks that is a profound connection should
               | avoid the role of software architect.
               | 
               | Do you care? It would be better to say that those that
               | fail to be willing to learn from the mistakes of others
               | should avoid the role of software architect because on
               | balance that's where the problems come from. You seem to
               | have a very narrow viewpoint here: that because you don't
               | like the precision or the links that are being made that
               | you can't appreciate the intent and the subject matter.
               | Of course a better article could have been written and of
               | course you are able to dismiss it entirely because of its
               | perceived shortcomings. But that is _exactly_ the
               | attitude that leads to a lot of software problems: the
               | inability to ingest information when it isn 't presented
               | in the recipients preferred form. This throws out the
               | baby with the bath water, the authors intent is to
               | educate you and others on the ways in which software
               | systems break and uses something called a narrative hook
               | to serve as a framework. That these won't match 100% is a
               | given. Spurious connection or not, documentation and
               | actual fact creeping out of spec aka the normalization of
               | deviation in disguise is _exactly_ the lesson from the
               | Challenger disaster and if you don 't like the wording
               | I'm looking forward to your improved version.
               | 
               | > Challenger was about catastrophic management and safety
               | practices.
               | 
               | That was a small but critical part in the whole, I highly
               | recommend reading the entire report on the subject, it
               | makes for fascinating reading, there are a great many
               | lessons to be learned from this.
               | 
               | https://www.govinfo.gov/content/pkg/GPO-
               | CRPT-99hrpt1016/pdf/...
               | 
               | https://en.wikipedia.org/wiki/Rogers_Commission_Report
               | 
               | And many useful and interesting supporting documents.
               | 
               | > I mean, if we want to learn from Douglas Adams he
               | suggested that we can deduce the nature of all things by
               | studying cupcakes.
               | 
               | That's a complete nonsensical statement. Have you
               | considered that your initial response to the article
               | precludes you from getting any value from it?
               | 
               | > It is not useful to connect random things in other
               | fields to random things in software.
               | 
               | But they are not random things. The normalization of
               | deviation in whatever guise it comes is the root cause of
               | many, many real world incidents, both in software as well
               | as outside of it. You could argue with the wording, but
               | not with the intent or the connection.
               | 
               | > Although I do appreciate the effort the gentleman went
               | to, it is a nice site and the disasters are interesting.
               | Just not relevantly linked to software in a meaningful
               | way.
               | 
               | To you. But they are.
               | 
               | > > We are tied 1:1 to the fate of our star and may well
               | go down with it > I'm just going to claim that is false
               | and live in the smug comfort that when circumstances
               | someday prove you right neither of us will be around to
               | argue about it.
               | 
               | So, you are effectively saying that you persist in being
               | wrong simply because the timescale works to your
               | advantage?
               | 
               | > And if you can draw lessons from that which apply to
               | practical software development then that is quite
               | impressive.
               | 
               | Well, for starters I would argue that many software
               | developers indeed create work that serves just long
               | enough to hold until they've left the company and that
               | that attitude is an excellent thing to lose and a
               | valuable lesson to draw from this discussion.
        
               | yuliyp wrote:
               | So the article had a list of disasters and some useful
               | lessons learned in its left and center columns. It also
               | had lists of truisms about software engineering in the
               | right column. They had nothing fundamental to do with
               | each other.
               | 
               | For instance, it tries to draw an equivalence between
               | "Titanic's Captain Edward Smith had shown an
               | "indifference to danger [that] was one of the direct and
               | contributing causes of this unnecessary tragedy." and
               | "Leading during the time of a software crisis (think
               | production database dropped, security vulnerability
               | found, system-wide failures etc.) requires a leader who
               | can stay calm and composed, yet think quickly and ACT."
               | which are completely unrelated: one is a statement about
               | needing to evaluate risks to avoid incidents, another is
               | talking about the type of leadership needed once an
               | incident has already happened. Similarly, the discussion
               | about Chernobyl is also confused: the primary lessons
               | there are about operational hygiene, but the article
               | draws "conclusions" about software testing which is in a
               | completely different lifecycle phase.
               | 
               | There are certainly lessons to be learned from past
               | incidents both software and not, but the article linked
               | is a poor place to do so.
        
               | jacquesm wrote:
               | So let's take those disasters and list the lessons that
               | _you_ would have learned from them. That 's the way to
               | constructively approach an article like this, out-of-hand
               | dismissal is just dumb and unproductive.
               | 
               | FWIW I've seen the leaders of software teams all the way
               | up to the CTO run around like headless chickens during
               | (often self inflicted) crisis. I think the biggest lesson
               | from the Titanic is that you're never invulnerable, even
               | when you have been designed to be invulnerable.
               | 
               | None of these are exhaustive and all of them are open to
               | interpretation. Good, so let's improve on them.
               | 
               | One general takeaway: managing risk is hard, especially
               | when working with a limited budget (which is almost
               | always the case) and just the exercise of assessing and
               | estimating likelihood and impact are already very
               | valuable but plenty of organizations have never done any
               | of that. They simply are utterly blind to the risks their
               | org is exposed to.
               | 
               | Case in point: a company that made in-car boxes that
               | could be upgraded OTA. And nobody thought to verify that
               | the vehicle wasn't in motion...
        
               | mrguyorama wrote:
               | There are two useful lessons from the Titanic that can
               | apply to software:
               | 
               | 1) Marketing that you are super duper and special is
               | meaningless if you've actually built something terrible
               | (the Titanic was not even remotely as unsinkable as
               | claimed, with "water tight" compartments that weren't
               | actually watertight)
               | 
               | 2) When people below you tell you "hey we are in danger",
               | listen to them. Don't do things that are obviously
               | dangerous and make zero effort to mitigate the danger.
               | The danger of atlantic icebergs was well understood, and
               | the Titanic was warned multiple times! Yet the captain
               | still had inadequate monitoring, and did not slow down to
               | give the ship more time to react to any threat.
        
               | jacquesm wrote:
               | Good stuff, thank you. This is useful, and it (2) ties
               | into the Challenger disaster as well.
        
               | mrguyorama wrote:
               | The one hangup with "Listen to people warning you" is
               | that they produce enough false positives as to create a
               | boy who cried wolf effect for some managers.
        
               | jacquesm wrote:
               | Yes, that's true. So the hard part is to know who is
               | alarmist and who actually has a point. In the case of
               | NASA the ignoring bit seemed to be pretty wilful. By the
               | time multiple engineers warn you that this is not a good
               | idea and you push on anyway I think you are out of
               | excuses. Single warnings not backed up by data can
               | probably be ignored.
        
               | stonemetal12 wrote:
               | An Ariane 5 failed because of bitrot, so the headline
               | comparison of rocket failures makes sense. Not testing
               | software with new performance parameters before launch
               | sounds like catastrophic management to me.
        
               | krisoft wrote:
               | > It is 2023. The damage of natural disasters can be
               | mitigated.
               | 
               | That is a comforting belief, but it is probably not true.
               | We have no plan for a near-Earth supernova explosion. Not
               | even in theory.
               | 
               | Then there are asteroid impacts. In theory we could have
               | plowed all of our resources into planetary defences, but
               | in practice in 2023 we can very easily get sucker punched
               | by a bolide and go the way of the dinosaurs.
        
       | rob74 wrote:
       | Another train crash that holds a valuable lesson was
       | https://en.wikipedia.org/wiki/Eschede_train_disaster
       | 
       | This demonstrates that sometimes "if you see something, say
       | something" isn't enough - if a large piece of metal penetrates
       | into the passenger compartment of a train from underneath, it's
       | better to take the initiative and _pull the emergency brake
       | yourself_.
        
         | mannykannot wrote:
         | Up to a point, but the "sometimes" makes it difficult to say
         | anything definite. There's no shortage of stories where
         | immediate intervention has made things worse, such as a burning
         | train being stopped in a tunnel.
         | 
         | Furthermore, this sort of counterfactual or "if only" analysis
         | can be used to direct attention away from what matters, as was
         | done in the hounding of the SS Californian's captain during the
         | inquiry into the sinking of the Titanic.
         | 
         | Here, one cannot fault the passenger for first getting himself
         | and his family out of the compartment, and he correctly
         | determined that the train manager's "follow the rules" response
         | was inadequate in the circumstances - in fact, the inquiry
         | might have considered the incongruity of having an emergency
         | brake available for any passenger to use at any time, while
         | restricting its use by train crew.
         | 
         | RCA quite properly focuses on the causes of the event, which
         | would have been of equal significance even if the train had
         | been halted in time, and which would continue to present
         | objective risks unless addressed.
        
         | embik wrote:
         | Not entirely true:
         | 
         | > Dittmann could not find an emergency brake in the corridor
         | and had not noticed that there was an emergency brake handle in
         | his own compartment.
         | 
         | The learning from that should maybe instead be to keep non-
         | technical management out of engineering decisions. The
         | Wikipedia article fails to mention there was a specific manager
         | who pushed the new wheel design into production and then went
         | on to have a long successful career.
        
           | rob74 wrote:
           | The English article sounds more like he started looking for
           | an emergency brake _after_ he had notified the conductor (and
           | apparently failed to convince him of the urgency of the
           | situation), not before. The German article is much longer,
           | but only mentions that both the passenger and the conductor
           | could have prevented the accident if they would have pulled
           | the emergency brake immediately, but that the conductor was
           | acting  "by the book" when he insisted on inspecting the
           | damage himself before pulling the brake.
        
             | jacquesm wrote:
             | In the movie 'Kursk' there is exactly such a scene and the
             | literal quote is 'By the book, better start praying. I'm
             | not reli...".
        
           | svrtknst wrote:
           | I dont think it should be "instead". Suggesting that
           | emergency brakes are inadequate due to one passenger failing
           | to locate one is kinda cheap.
           | 
           | We could also easily construe your argument as "engineers
           | would never design a flaw", which is demonstrably untrue. We
           | should both work to minimize errors, and to provide a variety
           | of corrective measures in case they happen.
        
             | embik wrote:
             | > Suggesting that emergency brakes are inadequate due to
             | one passenger failing to locate one is kinda cheap.
             | 
             | That's not what I wanted to say at all - the op talked
             | about the willingness to pull the emergency brake, but my
             | understanding is that he was willing to but due to human
             | error failed to find it. I didn't mean to suggest in any
             | way that emergency brakes are not important.
             | 
             | > We could also easily construe your argument as "engineers
             | would never design a flaw"
             | 
             | Another thing I didn't say. The whole original link is
             | proof that engineers make mistakes all the time.
        
           | namaria wrote:
           | I dislike the current trend of calling lessons "learnings". I
           | don't understand the shift in meaning. Learning is the act of
           | acquiring knowledge. The bit of knowledge acquired has a long
           | established name: lesson. What's the issue with that?
        
       | askbookz1 wrote:
       | The software industry has enough disasters of its own that we
       | don't need parallels from other industries to learn. That
       | actually makes it look like there are super non-obvious things
       | that we could apply to software when it fact it's all pretty
       | mundane.
        
       | gdevenyi wrote:
       | This is why software engineering is a protected profession in
       | some parts of the world (Canada at least), as civil
       | responsibility and safety, along with formal legal liability is
       | part of licensure
        
         | tra3 wrote:
         | Care to elaborate? I know professional engineers in Canada get
         | a designation but I'm not aware of anything similar for
         | software engineers.
        
           | charles_f wrote:
           | Software engineers are the same as all other engineering
           | professions, and regulated by the same provincial PEG
           | associations. While most employers don't care about it, some
           | software positions where the safety of people is in line (eg
           | aeronautics) or there's a special stake _do_ have
           | requirements to employ professional software engineers.
           | 
           | I think you're actually not even supposed to call yourself an
           | engineer unless you're a professional engineer.
        
           | dacox wrote:
           | Engineer is a regulated term and profession in Canada, with
           | professional designations like the P.eng - they get really
           | mad when people the term engineer more loosely, as is common
           | in the tech industry.
           | 
           | Because of this, there are "B.Seng" programs at some Canadian
           | universities, as well as the standard "B.Sc" computer science
           | program.
           | 
           | The degree was very new when I attended uni, so went for Comp
           | sci intead as it seemed more "real". The B.Seng kids seemed
           | to focus a lot more on industry things (classes on object
           | oriented programming), which everyone picked up when doing
           | internships anyways. They also had virtually no room for
           | electives, whereas the CS calendar was stacked with very
           | interesting electives which imo were vastly more useful in my
           | career.
           | 
           | In practice, no one gives a hoot which degree you have, and
           | we tend to just use the term SWeng regardless.
           | 
           | It honestly kinda feels like a bunch of crotchety old civil
           | engineers trying to regulate an industry they're not a part
           | of. I have _never_ seen a job require this degree.
        
       | lazystar wrote:
       | ah, a topic related to organizational decay and decline. this is
       | an area I've been studying a lot over the last few years, and I
       | encourage the author of this blog post to read this paper on the
       | Challenger disaster.
       | 
       | Organizational disaster and organizational decay: the case of the
       | National Aeronautics and Space Administration
       | http://www.sba.oakland.edu/faculty/schwartz/Org%20Decay%20at...
       | 
       | some highlights:
       | 
       | > There are a number of aspects of organizational decay. In this
       | paper, I shall consider three of them. First is what I call the
       | institutionalization of the fiction, which represents the
       | redirection of its approved beliefs and discourse, from the
       | acknowledgement of reality to the maintenance of an image of
       | itself as the organization ideal. Second is the change in
       | personnel that parallels the institutionalization of the fiction.
       | Third is the narcissistic loss of reality which represents the
       | mental state of management in the decadent organization.
       | 
       | > Discouragement and alienation of competent individuals
       | 
       | > Another result of this sort of selection must be that realistic
       | and competent persons who are committed to their work must lose
       | the belief that the organization's real purpose is productive
       | work and come to the conclusion that its real purpose is self-
       | idealization. They then are likely to see their work as being
       | alien to the purposes of the organization. Some will withdraw
       | from the organization psychologically. Others will buy into the
       | nonsense around them, cynically or through self-deception
       | (Goffman, 1959), and abandon their concern with reality. Still
       | others will conclude that the only way to save their self-esteem
       | is to leave the organization. Arguably, it is these last
       | individuals who, because of their commitment to productive work
       | and their firm grasp of reality, are the most productive members
       | of the organization. Trento cites a number of examples of this
       | happening at NASA. Considerations of space preclude detailed
       | discussion here.
       | 
       | Schwartz, H.S., 1989. Organizational disaster and organizational
       | decay: the case of the National Aeronautics and Space
       | Administration. Industrial Crisis Quarterly, 3: 319-334.
        
       | navels wrote:
       | The Therac-25 disaster should be on this list:
       | https://en.wikipedia.org/wiki/Therac-25
        
         | [deleted]
        
       | LorenPechtel wrote:
       | The Challenger disaster is on the list but should be expanded
       | upon:
       | 
       | 1) Considerable pressure from NASA to cover up the true sequence
       | of events. They pushed the go button even when the engineers said
       | no. And they failed to even tell Morton-Thiokol about the actual
       | temperatures. NASA dismissed the observed temperatures as
       | defective--never mind that in "correcting" them so the
       | temperature at the failed joint was as expected meant that now a
       | bunch of other measurements are above ambient. (The offending
       | joint was being cooled by boiloff from the LOX tank that under
       | the weather conditions at the time ended up cooling that part of
       | the booster.)
       | 
       | 2) Then they doubled down on the error with Columbia. They had
       | multiple cases of tile damage from the ET insulating foam. They
       | fixed the piece of foam that caused a near-disaster--but didn't
       | fix the rest because it had never damaged the orbiter.
       | 
       | Very much a culture of painting over the rust.
        
       | yafbum wrote:
       | As I heard one engineering leader say, "it's okay to make
       | mistakes -- _once_ ". Meaning, we're all fallible, mistakes
       | happen, but failure to learn from past mistakes is not optional.
       | 
       | That said, a challenge I have frequently run into, and I feel is
       | not uncommon, is a tension between the desire not to repeat
       | mistakes and ambitions that do generally involve some amount of
       | risk-taking. The former can turn into a fixation on risk and risk
       | mitigation that becomes a paralyzing force; to some leaders,
       | lists like these might just look like "a thousand reasons to do
       | nothing" and be discarded. Yet history is full of clear cases
       | where a poor appreciation of risk destroyed fortunes, with a
       | chorus of "I told you sos" in their wake.
       | 
       | It is a difficult part of leadership to weigh the risk tradeoffs
       | for a particular mission, and presenting things in absolute terms
       | of "lessons learned" rarely makes sense, in my experience. The
       | belt-and-suspenders approaches that make sense for authoring the
       | critical control software for a commercial passenger aircraft or
       | an industrial control system for nuclear plants probably do not
       | make sense for an indie mobile game studio, even if they're all
       | in some way "software engineering".
        
         | tempodox wrote:
         | Failure is not optional. Definitely true :)
        
         | pc86 wrote:
         | Mistakes happen, things go wrong, for the vast majority of us a
         | bug doesn't mean someone dies or lights go out or planes don't
         | take off. For most of us here, the absolute worst case scenario
         | is that a bug means a company nobody has ever heard of makes
         | slightly less money for a few minutes or hours until it gets
         | rolled back. Again, worst case. The average case is probably
         | closer to a company nobody has ever heard of makes exactly the
         | same amount of money but some arbitrary feature nobody asked
         | for ships a day or two later because we spent time fixing this
         | other thing instead.
         | 
         | It's really hard to drive a balance between "pushing untested
         | shitcode into prod multiple times a week" and "that ticket to
         | change our CTA button color is done but now needs to go through
         | 4 days of automated and manual testing." I think as an industry
         | most of us are probably too far on the latter side of the
         | spectrum in relation to the stakes of what we're actually doing
         | day to day.
        
       | svrtknst wrote:
       | IMO this list isn't specific enough with the causes and
       | takeaways, and esp. not specific enough to highlight minor
       | critical omissions.
       | 
       | Most of these feel like "They didnt act fast enough or do a good
       | enough job. They should do a better job".
        
       | seanhunter wrote:
       | One of my favourites that I ever heard about was from a friend of
       | mine who used to work on safety-critical systems in defense
       | applications. He told me fighter jets have a safety system that
       | disables the weapons systems if a (weight) load is detected on
       | the landing gear so that if the plane is on the ground and the
       | pilot bumps the wrong button they don't accidentally blow up
       | their own airbase[1]. So anyway when the Eurofighter Typhoon did
       | its first live weapons test the weapons failed to launch. When
       | they did the RCA they found something like[2]
       | bool check_the_landing_gear_before_shootyshoot(double
       | weight_from_sensor, double threshold) {             //FIXME:
       | Remember to implement this before we go live             return
       | false         }
       | 
       | So when the pilot pressed the button the function disabled the
       | weapons as if the plane had been on the ground. Because the
       | "correctness" checks were against the Z spec and this function
       | didn't have a unit test because it was deemed too trivial, the
       | problem wasn't found before launch, so this cost several millions
       | to redeploy the (one-line) fix to actually check the weight from
       | the sensor was less than the threshold.
       | 
       | [1] Yes this means that scene from the cheesy action movie (can't
       | remember which one) where Arnold Schwartzenegger finds himself on
       | the ground in the cockpit of a russian plane and proceeds to blow
       | up all the badguys while on the ground couldn't happen in real
       | life.
       | 
       | [2] Not the actual code which was in some weird version of ADA
       | apparently.
        
         | jowea wrote:
         | You would think "grep the codebase for FIXME" would be in the
         | checklist before deployment.
        
         | metadat wrote:
         | Regarding [1], was it Red Heat, True Lies, or Eraser?
        
           | dragonwriter wrote:
           | True Lies was an airborne Harrier, not a Russian plane on the
           | ground. So, while there are many reasons the scene was
           | unrealistic, "weight on gear disables weapons" isn't one.
        
             | metadat wrote:
             | I know, such a great movie, total classic of my childhood!
             | It was the only R-rated film my mother ever allowed and
             | even endorsed us watching, "Because Jamie Lee Curtis is
             | hot." :D
             | 
             | I wasn't sure if there might've been a scene I'd forgotten.
        
         | LorenPechtel wrote:
         | Yup, watch videos of actual missile launches--the missile
         | descends below the plane that fired it. Can't do that on the
         | ground, although you won't blow up your base because the weapon
         | will not have armed by the time it goes splat.
        
         | jacquesm wrote:
         | > Yes this means that scene from the cheesy action movie (can't
         | remember which one) where Arnold Schwartzenegger finds himself
         | on the ground in the cockpit of a russian plane and proceeds to
         | blow up all the badguys while on the ground couldn't happen in
         | real life.
         | 
         | I think you meant "Tomorrow Never Dies" and the actor was
         | Pierce Brosnan. Took me forever to find that, is it the right
         | one?
        
           | seanhunter wrote:
           | Yeah maybe. I think that sort of rings a bell.
        
             | jacquesm wrote:
             | https://youtu.be/hcIgZ4kJ5Ow?t=340
             | 
             | this?
        
         | tialaramex wrote:
         | > if a (weight) load is detected on the landing gear
         | 
         | This state "weight on wheels" is used in a lot of other
         | functionality, not just on military aircraft, as the hard stop
         | for things that don't make sense if we're not airborne. So that
         | makes sense (albeit obviously somebody needed to actually write
         | this function)
         | 
         | Most obviously the gear retraction is disabled on planes which
         | have retractable landing gear.
        
         | MisterTea wrote:
         | > weird version of ADA
         | 
         | Spark:
         | https://en.wikipedia.org/wiki/SPARK_(programming_language)
         | 
         | And it's Ada not ADA which makes me think of the Americans with
         | Disabilities Act.
        
           | seanhunter wrote:
           | Aah thank you on both counts yes. One interesting feature he
           | told me about is they wrote a "reverse compiler" that would
           | take Spark code and turn it into the associated formal (Z)
           | spec so they could compare that to the actual Z spec to prove
           | they were the same. Kind of nifty.
        
             | WalterBright wrote:
             | Sounds like Tranfor which would convert FORTRAN code to
             | flowcharts, because government contracts required
             | flowcharts.
        
         | JackFr wrote:
         | Torpedoes typically have an inertial switch which disarms them
         | if they turn 180 degrees, so they don't accidentally hit their
         | source. When a torpedo accidentally arms and activates on board
         | a submarine (hot running) the emergency procedure is to
         | immediately turn the sub around 180 degrees to disarm the
         | torpedo.
        
           | mwcremer wrote:
           | Lest someone think this is purely hypothetical:
           | https://en.wikipedia.org/wiki/USS_Tang_(SS-306)
        
             | mrguyorama wrote:
             | A mark 14 torpedo actually sinking something? What a bad
             | stroke of luck!
        
               | kakwa_ wrote:
               | The Mark 14 ended-up being a really good torpedo by the
               | end of WWII.
               | 
               | It even remained in service until the 80ies.
               | 
               | In truth, and going back to this subject, the Mark 14
               | debacle highlights the need for a good and unbiased QA.
               | 
               | This also holds true for software engineering.
        
               | mrguyorama wrote:
               | My understanding is the BeuOrd (or BeuShip? I don't
               | remember which) "didn't want to waste money on testing
               | it", so instead we wasted hundreds of them fired at
               | japanese shipping that didn't even impact their target,
               | or never had a hope of detonating.
               | 
               | Remember these kind of things next time someone pushes
               | for move fast and break things in the name of efficiency
               | and speed. Slow is fast.
        
               | jacquesm wrote:
               | In NL folklore this is codified as 'the longest road is
               | often the shortest'.
        
               | kakwa_ wrote:
               | Pre-war, it was more a case of "Penny wise and Pound
               | foolish" partly due to budget limitation (they did things
               | like testing only with foam warheads to recover test
               | torpedoes).
               | 
               | But after Perl Harbor, a somewhat biased BuOrd was
               | reluctant to admit the mark 14 flaws. It took a few
               | "unauthorized" tests and 2 years to fix the issues.
               | 
               | In fairness, this sure makes for an entertaining story
               | (ex Drachinifel video on yt), but I'm not completely sold
               | on the depiction of BuOrd as some sort of arrogant
               | bureaucrats. However, bias and pride (plus other issues
               | like low production) certainly have played a role in the
               | early mark 14 debacle.
               | 
               | Going back to software development, I'm always amazed how
               | bugs immediately pop-up whenever I put a piece of
               | software in the hands of users for the first time, and
               | that's regardless how well I tested it. I try to be as
               | thorough as possible, but being the developer I'm always
               | bias, often tunnel visioning on one way to use the
               | software I created. That's why, in my opinion you need
               | some form of external QA/testing (like these
               | "unauthorized" Mark 14 tests).
        
         | yafbum wrote:
         | > this cost several millions to redeploy the (one-line) fix to
         | actually check the weight from the sensor was less than the
         | threshold
         | 
         | Well maybe this is the other, compounding problem. Engineering
         | complex machines with such a high cost of bugfix deployment
         | seems like a big issue. It's funny that as an industry we now
         | know how to safely deploy software updates to hundreds of
         | millions of phones, with security checks, signed firmwares,
         | etc, but doing that on applications with a super high unit
         | price tag seems out of reach...
         | 
         | Or maybe, a few millions is like only a thousand hours of
         | flying in jet fuel costs alone, not a big deal...
        
           | nvy wrote:
           | >It's funny that as an industry we now know how to safely
           | deploy software updates to hundreds of millions of phones,
           | with security checks, signed firmwares, etc, but doing that
           | on applications with a super high unit price tag seems out of
           | reach
           | 
           | A bunch of JavaScript dudebros yeeting code out into the
           | ether is not at all comparable to deploying avionics software
           | to a fighter jet. Give your head a shake.
        
             | Two4 wrote:
             | I don't think they're referring to dudebros' js, they're
             | referring to systems software and the ability to deliver
             | relatively secure updates over insecure channels. I've even
             | delivered a signed firmware update to a microprocessor in a
             | goddamn washing machine over UART. Why can't we do this for
             | a jet?
        
           | digging wrote:
           | This makes no sense and is difficult to even respond to
           | coherently.
           | 
           | > It's funny that as an industry we now know how to safely
           | deploy software updates to hundreds of millions of phones,
           | with security checks, signed firmwares, etc,
           | 
           | Either you're completely wrong, because we "as an industry"
           | still push bugs and security flaws, or you're comparing two
           | completely different things.
           | 
           | > doing that on applications with a super high unit price tag
           | seems out of reach...
           | 
           | is true _because of_
           | 
           | > a few millions is like only a thousand hours of flying in
           | jet fuel costs alone
           | 
           | like do you really think they spent millions pushing a line
           | of code? or do you think it's just inherently expensive to
           | fly a jet, and so doing it twice costs more?
        
             | the_sleaze9 wrote:
             | I would generally pass this comment by, but it's just so
             | distastefully hostile because you totally missed the point.
             | 
             | GP's comment was expressing sardonic disbelief that a
             | modern jet wouldn't be able to receive remote software
             | updates, considering it's so ubiquitous and reliable in
             | other fields, even those with much, much lower costs. Not
             | that developers don't release faults.
        
               | namaria wrote:
               | People tend to opine on systems engineering as if we had
               | some sort of information superconductor connecting all
               | minds involved.
               | 
               | Systems are Hard and complex systems are Harder. Thinking
               | of entire class of failures as 'solved' is kinda like
               | talking about curing cancer. There isn't one thing called
               | cancer, there's hundreds.
               | 
               | There's no way to solve complex systems problems for
               | good. Reality, technologies, tooling, people, language,
               | everything changes all the time. And complex systems
               | failure modes that happen today will happen forever.
        
               | digging wrote:
               | Ahh, then I did misread it entirely. Thanks for stopping
               | by to call me out.
               | 
               | It's still probably not a matter of capability... I
               | wouldn't be so cavalier about software updates on my
               | phone if it was holding me thousands of feet above the
               | ground at the time.
        
               | jacquesm wrote:
               | I already commented on this elsewhere but I came across a
               | company that did OTA updates on a control box in vehicles
               | without checking if the vehicle was in motion or not. And
               | it didn't even really surprise me, it was just one of
               | those things that came up when prepping for that job from
               | a risk assessment. They never even thought of it.
        
               | Balooga wrote:
               | Imagine remotely bricking a fleet of fighter jets.
               | 
               | https://news.ycombinator.com/item?id=35983866
        
               | jacquesm wrote:
               | > Imagine remotely bricking a fleet of fighter jets.
               | 
               | > https://news.ycombinator.com/item?id=35983866
               | 
               | That's about routers, was that the article you meant?
        
               | kayodelycaon wrote:
               | Remote software updates on military vehicles? Hasn't
               | anyone seen the new Battlestar Galactica? :)
        
           | DavidVoid wrote:
           | > Or maybe, a few millions is like only a thousand hours of
           | flying in jet fuel costs alone, not a big deal...
           | 
           | Pretty much tbh. For example, the development of the Saab JAS
           | 39 Gripen ( _JAS-projektet_ ) is the most expensive
           | industrial project in modern Swedish history at a cost of
           | 120+ billion SEK (11+ billion USD).
           | 
           | It was also almost cancelled after a very public crash in
           | central Stockholm at the 1993 Stockholm Water Festival [1]. A
           | crash that should not have happened because the flight should
           | not have been approved in the first place, because they
           | weren't yet confident that they'd completely solved the
           | Pilot-Induced Oscillation (PIO) related issues that wrecked
           | the first prototype 4 years prior (with the same test pilot)
           | [2].
           | 
           | It was basically a miracle that no one was killed or
           | seriously hurt in the Stockholm crash, had the plane hit the
           | nearby bridge or any of the other densely crowded areas then
           | it would've been a very different story.
           | 
           | [1] https://youtu.be/mkgShfxTzmo?t=122
           | 
           | [2] https://www.youtube.com/watch?v=k6yVU_yYtEc
        
           | datadrivenangel wrote:
           | a few million dollars works out to a surprisingly small
           | amount of time when you add overhead.
           | 
           | Call the bug fix a development team of 20 people taking 3
           | months end to end from bug discovery to fix deployment.
           | You'll probably have twice that much people time again in
           | project management and communication overhead (1:2 ratio of
           | dev time to communication overhead is actually amazing in
           | defense contexts). Assume total cost per person of 200k per
           | year (after factoring in benefits, overhead, and pork), so 60
           | people * 3 months * $200k/12 months = 3,000,000 USD.
           | 
           | It takes a lot of people to build an aircraft.
        
       | WalterBright wrote:
       | Instead of attempting to design a perfect system that cannot
       | fail, the idea is to design a system that can tolerate failure of
       | any component. (This is how airliners are designed, and is why
       | they are so incredibly reliable.)
       | 
       | Safe Systems from Unreliable Parts
       | https://www.digitalmars.com/articles/b39.html
       | 
       | Designing Safe Software Systems part 2
       | https://www.digitalmars.com/articles/b40.html
        
       | masfuerte wrote:
       | There is or was an old-school website called something like
       | "Byzantine failures" that had case studies of bizarre failures
       | from many engineering fields. It was entertaining but I am unable
       | to find it now. Does anyone know it?
        
         | jacquesm wrote:
         | I think you're talking about Risks Digest.
         | 
         | http://catless.ncl.ac.uk/Risks/
         | 
         | It's very much old-school.
        
           | masfuerte wrote:
           | Thank you, that is the site. I don't know where I got
           | "Byzantine failures" from.
        
       | boobalyboo wrote:
       | [flagged]
        
         | toss1 wrote:
         | Nice Ad Homenim argument you've got there -- criticizing the
         | person bringing the argument and not the argument itself.
         | 
         | If your point is that the person is likely to be ignored by
         | his/her management, there's likely a better way to phrase it or
         | worth adding a few words to clarify.
        
         | BiggusDijkus wrote:
         | Bruh! Does it matter if a sensible thought comes from a high
         | scool kid or a parrot, as long as it is sensible?
        
           | richardwhiuk wrote:
           | I think the display of lessons shows a lack of experience in
           | understanding software project management, which isn't
           | surprising if they are a high schooler.
        
       | KuriousCat wrote:
       | Thanks for sharing this. I am reminded of the talks by Nickolas
       | Means (https://www.youtube.com/watch?v=1xQeXOz0Ncs)
        
       ___________________________________________________________________
       (page generated 2023-08-17 23:02 UTC)