[HN Gopher] Software engineering lessons from RCAs of greatest d...
___________________________________________________________________
Software engineering lessons from RCAs of greatest disasters
Author : philofsofia
Score : 172 points
Date : 2023-08-17 08:38 UTC (14 hours ago)
(HTM) web link (anoopdixith.com)
(TXT) w3m dump (anoopdixith.com)
| formerly_proven wrote:
| (Root Cause Analysis)
| BiggusDijkus wrote:
| This whole RCA terminology has to die. The idea that there
| exists a single "Root Cause" that causes major disasters is
| fundamentally flawed.
| mytailorisrich wrote:
| This is a generic term. It does not imply that there has to
| be a single cause.
| NikolaNovak wrote:
| If we call it Root Causes Analysis can we keep the acronym?
| rob74 wrote:
| Actually everywhere "defense in depth" is used (not only in
| computing, but also in e.g. aviation), there _can 't_ be one
| single cause for a disaster - each of the layers has to fail
| for a disaster to happen.
| jacquesm wrote:
| Almost every accident, even where there is no defense in
| depth has more than one cause. Car accident: person A was
| on the phone, person B didn't spot their deviation in time
| to do anything about it: accident. A is the root cause. If
| B would have reacted faster then there wouldn't be an
| accident but there would still be cause for concern and
| there would still be a culprit. The number of such near
| misses and saves by others is similar to the defense in
| depth in effect even if it wasn't engineered in. But person
| B isn't liable even though their lack of attention to what
| was going on is a contributory factor. So root causes
| matter, that's the first and most clear thing to fix. Other
| layers may be impacted and may require work but that isn't
| always the case.
|
| In software the root cause is often a very simple one:
| assumption didn't hold.
| jacquesm wrote:
| Yes, that's true in the general sense. But root causes are
| interesting because they are the things that can lead to
| insights that can help the lowest levels of engineering to
| become more robust. But at a higher level it is all about
| systems and the way parts of those systems interact, fault
| tolerance (massively important) and ensuring faults do not
| propagate beyond the systems they originate in. That's what
| can turn a small problem into a huge disaster. And without
| knowing the root cause you won't be able to track those
| proximate causes and do something about it. So RCA is a
| process, not a way to identify the single culprit. So this is
| more about the interpretation of the term RCA than about what
| RCA _really_ does.
| tonyarkles wrote:
| I think if the people you're working with insist on narrowing
| it down to a single Root Cause, they're missing the entire
| point of the exercise. I work with large drones day to day
| and when we do an accident investigation we're always looking
| for root causes, but there's almost always multiple. I don't
| think we've ever had a post-accident RCA investigation that
| resulted in only one corrective action. Several times we have
| narrowed it down to a single software bug, but to get to the
| point where a software bug causes a crash, there's always a
| number of other factors that have to align (e.g. pilot was
| unfamiliar with the recovery procedure, multiple cascaded
| failures, etc)
| PostOnce wrote:
| I was excited that it might be about 1960s RCA computers being
| software incompatible with IBM s/360
|
| https://en.m.wikipedia.org/wiki/RCA_Spectra_70
|
| Then I finished the headline and opened this Wikipedia article
| WalterBright wrote:
| It isn't just software.
|
| The Deepwater Horizon was the result of multiple single points of
| failure, that zippered to catastrophe. Each of those single
| points of failure could have been snipped at little to no cost
|
| The same with the Fukushima disaster. For example, venting the
| excess hydrogen into the building, where it could accumulate
| until a random spark.
| hliyan wrote:
| I think the software industry itself has accumulated enough bugs
| over the past few decades. E.g.
|
| F-22 navigation system core dumps when crossing the international
| date line: https://medium.com/alfonsofuggetta-it/software-bug-
| halts-f-2...
|
| Loss of Mars probe due to metric-imperial conversion error
|
| I've a few of these myself (e.g. a misplaced decimal that made
| $12mil into $120mil), but sadly cannot devulge details.
| RetroTechie wrote:
| In terms of engineering (QC, processes etc), modern day
| software industry is worse than almost any other industry out
| there. :-(
|
| And no, just plain complexity or fast-moving environment, is a
| factor but not _the_ issue. It 's that steps are skipped which
| are _not_ skipped in other branches of engineering (eg.
| continous improvement of processes, learning from mistakes &
| implementing those. In software land: same mistakes made again
| & again & again, poorly designed languages remain in use, the
| list goes on).
|
| A long way still to go.
| fredley wrote:
| In any other field of engineering, the engineers are all
| trained and qualified. In software 'engineering', not so
| much.
| LeonenTheDK wrote:
| This is exactly why I support "engineer" being a protected
| term, like Doctor. It should tell you that a certain level
| of training and qualification has been met, to the point
| that the engineer is responsible and accountable for the
| work they do and sign off on. Especially for things that
| affect safety.
|
| Many software engineers these days are often flying by the
| seat of their pants, moving quickly and breaking things.
| Thankfully this _seems_ to largely be in places that aren
| 't going to affect life or limb, but I'm still rubbed the
| wrong way seeing people (including myself mind you)
| building run of the mill CRUD apps under the title of
| engineer.
|
| Is it a big deal? Not really. It's probably even
| technically correct to use the term this way. But I do
| think it dilutes it a bit. For context, I'm in Canada where
| engineer is technically a protected term, and there are
| governing bodies that qualify and designate professional
| engineers.
| zvmaz wrote:
| Are all "engineers" trained on human error, safety
| principles, and the like? The failures described in the
| article are precisely _not_ software failures.
| jacquesm wrote:
| The important distinction is that engineers are
| professionally liable.
| mrguyorama wrote:
| Yes? Most engineering programs (it might even be an
| accreditation requirement) involve ethics classes and
| learning from past failures.
|
| My CS degree program required an ethics class and
| discussed things like the CFAA and famous cases like
| Therac-25, but nobody took it seriously because STEM
| majors think they are god's gift to an irrational world.
| jacquesm wrote:
| > Thankfully this seems to largely be in places that
| aren't going to affect life or limb
|
| I've seen Agile teams doing medical stuff using the
| latest hotness. Horrorshow.
|
| I've also seen very, very clean software and firmware
| development at really small companies.
|
| It's all over the place and you have to look inside to
| know what is going on. Though job advertisements
| sometimes can be pretty revealing.
| freeopinion wrote:
| I'm curious how you think the word "Doctor" is protected.
|
| Do you mean that History PhDs can't call themselves
| Doctors?
|
| Or chiropractors can't pass themselves off as doctors?
|
| Or you mean Doctor J was licensed to perform basketball
| bypass surgeries?
|
| Or perhaps podiatrists can't deliver babies?
| sanderjd wrote:
| That training and qualification is only as good as the
| processes and standards being trained for and qualified on.
| We don't have those processes and standards to train
| against (and frankly I'm not convinced we should or even
| can) for generic "software engineers". I have a number of
| friends who are PEs, and it isn't the training and
| certification process that differentiates their work from
| mine, it is that there are very clear standards for how you
| engineer a safe structure or machine. But I contend that
| there is not a way to write such standards for "software".
| It's just too broad a category of thing. Writing control
| software for physical systems is just very different from
| writing a UX-driven web application. It would be odd and
| wasteful to have the same standards for both things.
|
| I do think qualification would make sense for more narrow
| swathes of the "software engineering" practice. For
| instance, "Automotive Control Software Engineer", etc.
| zvmaz wrote:
| > In terms of engineering (QC, processes etc), modern day
| software industry is worse than almost any other industry out
| there. :-(
|
| How do you know that?
| RetroTechie wrote:
| Take airplane safety: plane crashes, cause of the crash is
| thoroughly investigated, report recommends procedures to
| avoid that type of cause for planecrashes. Sometimes such
| recommendations become enforced across the industry.
| Result: air travel safer & safer to the point where sitting
| in a (flying!) airplane all day is safer than sitting on a
| bench on the street.
|
| Building regulations: similar.
|
| Foodstuffs (hygiene requirements for manufacturers):
| similar.
|
| Car parts: see ISO9000 standards & co.
|
| Software: eg. memory leaks - been around forever, but every
| day new software is released that has 'm.
|
| C: ancient, not memory safe, should _really_ only be used
| for niche domains. Yet it still is everywhere.
|
| New AAA game: pay $$ after year(s?) of development,
| download many-MB patch on day 1 because game is buggy.
| _Could_ have been tested better, but released anyway
| 'cause getting it out & making sales weighed heavier than
| shipping reliable working product.
|
| All of this = not improving methods.
|
| I'm not arguing C v. Rust here or whatever. Just pointing
| out: better tools, better procedures exist, but using them
| is more exception than the rule.
|
| Like I said the list goes on. Other branches of engineering
| don't (can't) work like that.
| jacquesm wrote:
| Exactly. The driving force is there but what is also good
| is that the industry - for the most part at least -
| realizes that safety is what keeps them in business. So
| not only is there a structure of oversight and
| enforcement, there is also an strongly internalized
| culture of safety created over decades to build on. An
| engineer that would propose something obviously unsafe
| would not get to finish their proposal, let alone
| implement it.
|
| In 'regular' software circles you can find the marketing
| department with full access to raw data and front end if
| you're unlucky.
| jacquesm wrote:
| Experience?
| talldatethrow wrote:
| On HN and reddit, experience doesn't count. Only reading
| about others experiences after they've been paid to write
| research experiences.
| pdntspa wrote:
| Don't forget to cite your sources! The nerds will rake
| you over the coals for not doing so.
| serjester wrote:
| If you're making the same mistakes over and over again I
| think that says more about your company than it does about
| the software industry.
|
| My first job was at a major automotive manufacturer.
| Implementing half the procedures they had would slow down any
| software company 10X - just look at the state of most car
| infotainment systems. If something is safety critical,
| obviously this makes sense but the reality is 85% of software
| isn't.
| jacquesm wrote:
| GP was speaking in the general sense not about their
| company.
| KnobbleMcKnees wrote:
| Is that not coming from experience of working at a
| software company? As I believe you said elsewhere
| jacquesm wrote:
| It could easily be from looking from the outside in, as
| it is in my case.
| jacquesm wrote:
| And reading this thread it doesn't look as if there is much
| awareness of that.
| Exuma wrote:
| My worst bug was a typo in a single line of html that removed
| 3DS protection from many millions of dollars of credit card
| payments
| ownagefool wrote:
| Pretty epic.
|
| I was working for a webhosting company, and someone asked me
| to rush a change just before leaving. Instead of updating
| 1500 A records, I updated about 50k. Someone senior managed
| to turn off the cron though, so what I actually lost was the
| delta of changes between last backup and my SQL.
|
| I was in the room for this though:
| https://www.theregister.com/2008/08/28/flexiscale_outage/
| CTDOCodebases wrote:
| I love the title to that article "Engineer accidentally
| deletes cloud".
|
| It's like a single individual managed to delete the
| monolithic cloud where everyone's files are stored.
| jiscariot wrote:
| That is eerily similar to what happened to us in IBM
| "Cloud", in a previous gig. An engineer was doing
| "account cleanup" and somehow our account got on the list
| and all our resources were blown away. The most
| interesting conversation was convincing the support
| person, that those deletion audit events were in fact not
| us, but rather (according to the engineer's Linked-In
| page) an SRE at IBM.
| ownagefool wrote:
| This was ~14 years ago and both MS & AWS had loss of data
| incidents iirc.
| CTDOCodebases wrote:
| Although it probably wasn't funny at the time I can
| imagine how comical that conversation was.
|
| Thinking about it further the term "cloud" is a good
| metaphor for storing files on someone else's computer
| because clouds just disappear.
| jacquesm wrote:
| I if any such pathways remain at AWS, Google, Apple and
| MS that would still allow a thing like that to happen.
| swalsh wrote:
| At this point there's basically 3 clouds, and then
| everyone else.
| epolanski wrote:
| AWS, Azure and Cloudflare?
| yossi_peti wrote:
| And Google Cloud Platform
| ownagefool wrote:
| Bare in mind, this was a small startup in 2008 that
| claims to be the 2nd cloud in the world ( read on-demand
| iaas provider ).
|
| Flexiscale at the time was a single region backed by a
| netapp. Each VM essentially had a thin-provisioned lun (
| logical volume ), basically you copy on write the
| underlying OS image.
|
| So when someone accidently deletes vol0, they take out a
| whopping 6TB of data, that takes a ~20TB to restore
| because you're rebuilding filesystems from safe mode (
| thanks netapp support ). It's fairly monolithic in that
| sense.
|
| I guess I was 23 at the time, but I'd written the v2 API,
| orchestrator & scheduling later. It was fairly naive, but
| filled the criteria of a cloud, i.e. elastic, on-demand,
| metered usage, despite using a SAN.
| swalsh wrote:
| My worst bug was changing how a zip code zone was fetched
| from the cache in a large ecommerce site with tens of
| thousands users using it all day long. Worked great in DEV :D
| but when the thundering herd hit it, the entire site came
| down.
| jacquesm wrote:
| Startup, shutdown and migration are all periods of
| significantly elevated risk. Especially for systems that
| have been in the air for a long time there are all kinds of
| ways in which things can go pear shaped. Drives that die on
| shutdown (or the subsequent boot up), raids that fail to
| rebuild, cascading failures, power supplies that fail,
| UPS's that fail, generators that don't start (or that run
| for 30 seconds and then quit because someone made off with
| the fuel) and so on.
| hliyan wrote:
| I posted this, in case we want to collect these gems:
| https://news.ycombinator.com/item?id=37160295
| GreenVulpine wrote:
| You could call that a feature making payments easier for
| customers! All 3DS does is protect the banks by
| inconveniencing consumers since banks are responsible for
| fraud.
| feldrim wrote:
| It's mostly the payment processor. It may or may not be the
| bank itself.
| jrockway wrote:
| I believe they pass on the risk to merchants now. If you
| let fraud through, $30 per incident or whatever. So
| typically things like 3DS are turned on because that cost
| got too high, and the banks assure you that it will fix
| everything.
| gostsamo wrote:
| my funniest was a wrong param in a template generator which
| turned off escaping parameter values provided indirectly by the
| users. good that it was discovered during the yearly pen
| testing analysis because it lead to shell execution in the
| cloud environment.
| kefabean wrote:
| The worst bug I encountered was when physically relocating a
| multi rack storage array for a mobile provider. The array had
| never been powered down(!) so we anticipated that a good number
| of the spindles would fail to come up on restart. So we added
| an extra mirror to protect each existing raid set. Problem is a
| bug in the firmware meant the mere existence of this extra
| mirror caused the entire arrays volume layout to become
| corrupted at reboot time. Fortunately a field engineer managed
| to reconstruct the layout, but not before a lot of hair had
| been whitened.
| jacquesm wrote:
| Close call. I know of a similar case where a fire suppression
| systems check ended up with massive data loss in a very large
| storage array.
| zenkat wrote:
| Does anyone have a similar compendium specifically for software
| engineering disasters?
|
| Not of nasty bugs like the F-22 -- those are fun stories, but
| they don't really illustrate the systemic failures that led to
| the bug being deployed in the first place. Much more interested
| in systemic cultural/practice/process factors that led to a
| disaster.
| mrguyorama wrote:
| Find and take a CS ethics class.
| two_handfuls wrote:
| Yes, the RISK mailing list.
| roenxi wrote:
| I don't think this list hits any fundamental truths. The great
| depression doesn't have parallels to software failures beyond the
| fact that complex systems fail. And many of the lessons are vague
| and unactionable - "Put an end to information hoarding within
| orgs/teams" for example, says nothing. The Atlassian copy that
| section links to also says nothing. A lot of the lessons lack
| meaty learnings, and good luck to anyone trying to put everything
| in to practice simultaneously.
|
| Makes a fun list of big disasters though and I respect this guy's
| eye for website design. The site design was probably stronger
| than the link content, there is a lot to like about it.
| jacquesm wrote:
| Complex systems fail, but they don't all fail in the same way
| and analyzing _how_ they fail can help in engineering new and
| hopefully more robust complex systems. I 'm a huge fan of Risk
| Digest and there isn't a disaster small enough that we can't
| learn from it.
|
| Obviously the larger the disaster the more complex the failure
| and the harder to analyze the root cause. But one interesting
| takeaway for me from this list is that all of them were
| preventable and in all but few of the cases the root cause may
| have been the trigger but the setup of the environment is what
| allowed the fault to escalate in the way that it did. In a
| resilient system faults happen as well, but they do not
| propagate.
|
| And that's the big secret to designing reliable systems.
| roenxi wrote:
| > ...one interesting takeaway for me from this list is that
| all of them were preventable...
|
| Every disaster is preventable. Everything on the list was
| happening in human-engineered environments - as do most
| things that affect humans. The human race has been the master
| of its own destiny since the 1900s. The questions are how far
| before the disaster we need to look to find somewhere to act
| and what needed to be given up to change the flow of events.
|
| But that doesn't have any implications for software
| engineering. Studying a software failure post mortem will be
| a lot more useful than studying 9/11.
| jacquesm wrote:
| > Every disaster is preventable.
|
| No, there is such a thing as residual risk and there are
| always disasters that you can't prevent such as natural
| disasters. But even then you can have risk mitigation and
| strategies for dealing with the aftermath of an incident to
| limit the effects.
|
| > Everything on the list was happening in human-engineered
| environments - as do most things that affect humans.
|
| That is precisely why they were picked and make for good
| examples.
|
| > The human race has been the master of its own destiny
| since the 1900s.
|
| That isn't true and it likely will never be true. We are
| tied 1:1 to the fate of our star and may well go down with
| it. There is a small but non-zero chance that we can change
| our destiny but I wouldn't bet on it. And even then in the
| even longer term it still won't matter. We are passengers,
| the best we can do is be good stewards of the ship we've
| inherited.
|
| > The questions are how far before the disaster we need to
| look to find somewhere to act and what needed to be given
| up to change the flow of events.
|
| Indeed. So in the case of each of the items listed the RCA
| gives a point in time where the accident given the
| situation as it existed was no longer a theoretical
| possibility but an event in progress. Situation and
| responses determined how far it got and in each of the
| cases outlined you can come up with a whole slew of ways in
| which the risk could have been reduced and possibly how the
| whole thing may have been averted once the root cause had
| triggered. But that doesn't mean that the root cause
| doesn't matter, it matters a lot. But the root cause isn't
| always a major thing. An O-ring, a horseshoe...
|
| > But that doesn't have any implications for software
| engineering.
|
| If that is your takeaway then for you it indeed probably
| does not. But I see such things in software engineering
| every other week or so and I think there are _many_ lessons
| from these events that apply to software engineering. As do
| the people that design reliable systems, which is why many
| of us are arguing for liability for software. Because once
| producers of software are held liable for their product a
| large number of the bad practices and avoidable incidents
| (not just security) would become subject to the Darwinian
| selection process: bad producers would go out of business.
|
| > Studying a software failure post mortem will be a lot
| more useful than studying 9/11.
|
| You can learn _lots_ of things from other fields, if you
| are open to learning in general. Myopically focusing on
| your own field is useful and can get you places but it will
| always result in 'deep' approaches, never in 'wide'
| approaches and for a really important system both of these
| approaches are valid and complementary.
|
| To make your life easier the author has listed in the right
| hand column which items from the non-software disasters
| carry over into the software world, which I think is a
| valuable service. A middlebrow dismissal of that effort is
| throwing away an opportunity to learn, for free, from
| incidents that have all made the history books. And if you
| don't learn from your own and others' mistakes then you are
| bound to repeat that history.
|
| Software isn't special in this sense. Not at all. What is
| special is the arrogance of some software people who
| believe that their field is so special that they can ignore
| the lessons from the world around them. And as a corollary:
| that they can ignore all the lessons already learned in
| software systems in the past. We are in an eternal cycle of
| repeating past mistakes with newer and shinier tools and we
| urgently need to break out of it.
| roenxi wrote:
| It is 2023. The damage of natural disasters can be
| mitigated. When the San Andreas fault goes it'll probably
| get an entry on that list with a "why did we build so
| much infrastructure on this thing? Why didn't we prepare
| more for the inevitable?".
|
| And this article is throwing out generic all-weather good
| sounding platitudes which are tangential to the disasters
| listed. He drew a comparison between the Challenger
| disaster and bitrot! Anyone who thinks that is a profound
| connection should avoid the role of software architect.
| The link is spurious. Challenger was about catastrophic
| management and safety practices. Bitrot is neither of
| those things.
|
| I mean, if we want to learn from Douglas Adams he
| suggested that we can deduce the nature of all things by
| studying cupcakes. That is a few steps down the path from
| this article, but the direction is similar. It is not
| useful to connect random things in other fields to random
| things in software. Although I do appreciate the effort
| the gentleman went to, it is a nice site and the
| disasters are interesting. Just not relevantly linked to
| software in a meaningful way.
|
| > We are tied 1:1 to the fate of our star and may well go
| down with it
|
| I'm just going to claim that is false and live in the
| smug comfort that when circumstances someday prove you
| right neither of us will be around to argue about it. And
| if you can draw lessons from that which apply to
| practical software development then that is quite
| impressive.
| jacquesm wrote:
| > It is 2023.
|
| So? Mistakes are still being made, every day. Nothing has
| changed since the stone age except for our ability - and
| hopefully willingness - to learn from previous mistakes.
| If we want to.
|
| > The damage of natural disasters can be mitigated.
|
| You wish.
|
| > When the San Andreas fault goes it'll probably get an
| entry on that list with a "why did we build so much
| infrastructure on this thing? Why didn't we prepare more
| for the inevitable?".
|
| Excellent questions. And in fairness to the people living
| on the San Andreas fault - and near volcanoes, in
| hurricane alley and in countries below sea level - we
| have an uncanny ability to ignore history.
|
| > And this article is throwing out generic all-weather
| good sounding platitudes which are tangential to the
| disasters listed.
|
| I see these errors all the time in the software world, I
| don't care what hook he uses to _again_ bring them to
| attention but they are probably responsible for a very
| large fraction of all software problems.
|
| > He drew a comparison between the Challenger disaster
| and bitrot!
|
| So let's see your article on this subject then that will
| obviously do a much better job.
|
| > Anyone who thinks that is a profound connection should
| avoid the role of software architect.
|
| Do you care? It would be better to say that those that
| fail to be willing to learn from the mistakes of others
| should avoid the role of software architect because on
| balance that's where the problems come from. You seem to
| have a very narrow viewpoint here: that because you don't
| like the precision or the links that are being made that
| you can't appreciate the intent and the subject matter.
| Of course a better article could have been written and of
| course you are able to dismiss it entirely because of its
| perceived shortcomings. But that is _exactly_ the
| attitude that leads to a lot of software problems: the
| inability to ingest information when it isn 't presented
| in the recipients preferred form. This throws out the
| baby with the bath water, the authors intent is to
| educate you and others on the ways in which software
| systems break and uses something called a narrative hook
| to serve as a framework. That these won't match 100% is a
| given. Spurious connection or not, documentation and
| actual fact creeping out of spec aka the normalization of
| deviation in disguise is _exactly_ the lesson from the
| Challenger disaster and if you don 't like the wording
| I'm looking forward to your improved version.
|
| > Challenger was about catastrophic management and safety
| practices.
|
| That was a small but critical part in the whole, I highly
| recommend reading the entire report on the subject, it
| makes for fascinating reading, there are a great many
| lessons to be learned from this.
|
| https://www.govinfo.gov/content/pkg/GPO-
| CRPT-99hrpt1016/pdf/...
|
| https://en.wikipedia.org/wiki/Rogers_Commission_Report
|
| And many useful and interesting supporting documents.
|
| > I mean, if we want to learn from Douglas Adams he
| suggested that we can deduce the nature of all things by
| studying cupcakes.
|
| That's a complete nonsensical statement. Have you
| considered that your initial response to the article
| precludes you from getting any value from it?
|
| > It is not useful to connect random things in other
| fields to random things in software.
|
| But they are not random things. The normalization of
| deviation in whatever guise it comes is the root cause of
| many, many real world incidents, both in software as well
| as outside of it. You could argue with the wording, but
| not with the intent or the connection.
|
| > Although I do appreciate the effort the gentleman went
| to, it is a nice site and the disasters are interesting.
| Just not relevantly linked to software in a meaningful
| way.
|
| To you. But they are.
|
| > > We are tied 1:1 to the fate of our star and may well
| go down with it > I'm just going to claim that is false
| and live in the smug comfort that when circumstances
| someday prove you right neither of us will be around to
| argue about it.
|
| So, you are effectively saying that you persist in being
| wrong simply because the timescale works to your
| advantage?
|
| > And if you can draw lessons from that which apply to
| practical software development then that is quite
| impressive.
|
| Well, for starters I would argue that many software
| developers indeed create work that serves just long
| enough to hold until they've left the company and that
| that attitude is an excellent thing to lose and a
| valuable lesson to draw from this discussion.
| yuliyp wrote:
| So the article had a list of disasters and some useful
| lessons learned in its left and center columns. It also
| had lists of truisms about software engineering in the
| right column. They had nothing fundamental to do with
| each other.
|
| For instance, it tries to draw an equivalence between
| "Titanic's Captain Edward Smith had shown an
| "indifference to danger [that] was one of the direct and
| contributing causes of this unnecessary tragedy." and
| "Leading during the time of a software crisis (think
| production database dropped, security vulnerability
| found, system-wide failures etc.) requires a leader who
| can stay calm and composed, yet think quickly and ACT."
| which are completely unrelated: one is a statement about
| needing to evaluate risks to avoid incidents, another is
| talking about the type of leadership needed once an
| incident has already happened. Similarly, the discussion
| about Chernobyl is also confused: the primary lessons
| there are about operational hygiene, but the article
| draws "conclusions" about software testing which is in a
| completely different lifecycle phase.
|
| There are certainly lessons to be learned from past
| incidents both software and not, but the article linked
| is a poor place to do so.
| jacquesm wrote:
| So let's take those disasters and list the lessons that
| _you_ would have learned from them. That 's the way to
| constructively approach an article like this, out-of-hand
| dismissal is just dumb and unproductive.
|
| FWIW I've seen the leaders of software teams all the way
| up to the CTO run around like headless chickens during
| (often self inflicted) crisis. I think the biggest lesson
| from the Titanic is that you're never invulnerable, even
| when you have been designed to be invulnerable.
|
| None of these are exhaustive and all of them are open to
| interpretation. Good, so let's improve on them.
|
| One general takeaway: managing risk is hard, especially
| when working with a limited budget (which is almost
| always the case) and just the exercise of assessing and
| estimating likelihood and impact are already very
| valuable but plenty of organizations have never done any
| of that. They simply are utterly blind to the risks their
| org is exposed to.
|
| Case in point: a company that made in-car boxes that
| could be upgraded OTA. And nobody thought to verify that
| the vehicle wasn't in motion...
| mrguyorama wrote:
| There are two useful lessons from the Titanic that can
| apply to software:
|
| 1) Marketing that you are super duper and special is
| meaningless if you've actually built something terrible
| (the Titanic was not even remotely as unsinkable as
| claimed, with "water tight" compartments that weren't
| actually watertight)
|
| 2) When people below you tell you "hey we are in danger",
| listen to them. Don't do things that are obviously
| dangerous and make zero effort to mitigate the danger.
| The danger of atlantic icebergs was well understood, and
| the Titanic was warned multiple times! Yet the captain
| still had inadequate monitoring, and did not slow down to
| give the ship more time to react to any threat.
| jacquesm wrote:
| Good stuff, thank you. This is useful, and it (2) ties
| into the Challenger disaster as well.
| mrguyorama wrote:
| The one hangup with "Listen to people warning you" is
| that they produce enough false positives as to create a
| boy who cried wolf effect for some managers.
| jacquesm wrote:
| Yes, that's true. So the hard part is to know who is
| alarmist and who actually has a point. In the case of
| NASA the ignoring bit seemed to be pretty wilful. By the
| time multiple engineers warn you that this is not a good
| idea and you push on anyway I think you are out of
| excuses. Single warnings not backed up by data can
| probably be ignored.
| stonemetal12 wrote:
| An Ariane 5 failed because of bitrot, so the headline
| comparison of rocket failures makes sense. Not testing
| software with new performance parameters before launch
| sounds like catastrophic management to me.
| krisoft wrote:
| > It is 2023. The damage of natural disasters can be
| mitigated.
|
| That is a comforting belief, but it is probably not true.
| We have no plan for a near-Earth supernova explosion. Not
| even in theory.
|
| Then there are asteroid impacts. In theory we could have
| plowed all of our resources into planetary defences, but
| in practice in 2023 we can very easily get sucker punched
| by a bolide and go the way of the dinosaurs.
| rob74 wrote:
| Another train crash that holds a valuable lesson was
| https://en.wikipedia.org/wiki/Eschede_train_disaster
|
| This demonstrates that sometimes "if you see something, say
| something" isn't enough - if a large piece of metal penetrates
| into the passenger compartment of a train from underneath, it's
| better to take the initiative and _pull the emergency brake
| yourself_.
| mannykannot wrote:
| Up to a point, but the "sometimes" makes it difficult to say
| anything definite. There's no shortage of stories where
| immediate intervention has made things worse, such as a burning
| train being stopped in a tunnel.
|
| Furthermore, this sort of counterfactual or "if only" analysis
| can be used to direct attention away from what matters, as was
| done in the hounding of the SS Californian's captain during the
| inquiry into the sinking of the Titanic.
|
| Here, one cannot fault the passenger for first getting himself
| and his family out of the compartment, and he correctly
| determined that the train manager's "follow the rules" response
| was inadequate in the circumstances - in fact, the inquiry
| might have considered the incongruity of having an emergency
| brake available for any passenger to use at any time, while
| restricting its use by train crew.
|
| RCA quite properly focuses on the causes of the event, which
| would have been of equal significance even if the train had
| been halted in time, and which would continue to present
| objective risks unless addressed.
| embik wrote:
| Not entirely true:
|
| > Dittmann could not find an emergency brake in the corridor
| and had not noticed that there was an emergency brake handle in
| his own compartment.
|
| The learning from that should maybe instead be to keep non-
| technical management out of engineering decisions. The
| Wikipedia article fails to mention there was a specific manager
| who pushed the new wheel design into production and then went
| on to have a long successful career.
| rob74 wrote:
| The English article sounds more like he started looking for
| an emergency brake _after_ he had notified the conductor (and
| apparently failed to convince him of the urgency of the
| situation), not before. The German article is much longer,
| but only mentions that both the passenger and the conductor
| could have prevented the accident if they would have pulled
| the emergency brake immediately, but that the conductor was
| acting "by the book" when he insisted on inspecting the
| damage himself before pulling the brake.
| jacquesm wrote:
| In the movie 'Kursk' there is exactly such a scene and the
| literal quote is 'By the book, better start praying. I'm
| not reli...".
| svrtknst wrote:
| I dont think it should be "instead". Suggesting that
| emergency brakes are inadequate due to one passenger failing
| to locate one is kinda cheap.
|
| We could also easily construe your argument as "engineers
| would never design a flaw", which is demonstrably untrue. We
| should both work to minimize errors, and to provide a variety
| of corrective measures in case they happen.
| embik wrote:
| > Suggesting that emergency brakes are inadequate due to
| one passenger failing to locate one is kinda cheap.
|
| That's not what I wanted to say at all - the op talked
| about the willingness to pull the emergency brake, but my
| understanding is that he was willing to but due to human
| error failed to find it. I didn't mean to suggest in any
| way that emergency brakes are not important.
|
| > We could also easily construe your argument as "engineers
| would never design a flaw"
|
| Another thing I didn't say. The whole original link is
| proof that engineers make mistakes all the time.
| namaria wrote:
| I dislike the current trend of calling lessons "learnings". I
| don't understand the shift in meaning. Learning is the act of
| acquiring knowledge. The bit of knowledge acquired has a long
| established name: lesson. What's the issue with that?
| askbookz1 wrote:
| The software industry has enough disasters of its own that we
| don't need parallels from other industries to learn. That
| actually makes it look like there are super non-obvious things
| that we could apply to software when it fact it's all pretty
| mundane.
| gdevenyi wrote:
| This is why software engineering is a protected profession in
| some parts of the world (Canada at least), as civil
| responsibility and safety, along with formal legal liability is
| part of licensure
| tra3 wrote:
| Care to elaborate? I know professional engineers in Canada get
| a designation but I'm not aware of anything similar for
| software engineers.
| charles_f wrote:
| Software engineers are the same as all other engineering
| professions, and regulated by the same provincial PEG
| associations. While most employers don't care about it, some
| software positions where the safety of people is in line (eg
| aeronautics) or there's a special stake _do_ have
| requirements to employ professional software engineers.
|
| I think you're actually not even supposed to call yourself an
| engineer unless you're a professional engineer.
| dacox wrote:
| Engineer is a regulated term and profession in Canada, with
| professional designations like the P.eng - they get really
| mad when people the term engineer more loosely, as is common
| in the tech industry.
|
| Because of this, there are "B.Seng" programs at some Canadian
| universities, as well as the standard "B.Sc" computer science
| program.
|
| The degree was very new when I attended uni, so went for Comp
| sci intead as it seemed more "real". The B.Seng kids seemed
| to focus a lot more on industry things (classes on object
| oriented programming), which everyone picked up when doing
| internships anyways. They also had virtually no room for
| electives, whereas the CS calendar was stacked with very
| interesting electives which imo were vastly more useful in my
| career.
|
| In practice, no one gives a hoot which degree you have, and
| we tend to just use the term SWeng regardless.
|
| It honestly kinda feels like a bunch of crotchety old civil
| engineers trying to regulate an industry they're not a part
| of. I have _never_ seen a job require this degree.
| lazystar wrote:
| ah, a topic related to organizational decay and decline. this is
| an area I've been studying a lot over the last few years, and I
| encourage the author of this blog post to read this paper on the
| Challenger disaster.
|
| Organizational disaster and organizational decay: the case of the
| National Aeronautics and Space Administration
| http://www.sba.oakland.edu/faculty/schwartz/Org%20Decay%20at...
|
| some highlights:
|
| > There are a number of aspects of organizational decay. In this
| paper, I shall consider three of them. First is what I call the
| institutionalization of the fiction, which represents the
| redirection of its approved beliefs and discourse, from the
| acknowledgement of reality to the maintenance of an image of
| itself as the organization ideal. Second is the change in
| personnel that parallels the institutionalization of the fiction.
| Third is the narcissistic loss of reality which represents the
| mental state of management in the decadent organization.
|
| > Discouragement and alienation of competent individuals
|
| > Another result of this sort of selection must be that realistic
| and competent persons who are committed to their work must lose
| the belief that the organization's real purpose is productive
| work and come to the conclusion that its real purpose is self-
| idealization. They then are likely to see their work as being
| alien to the purposes of the organization. Some will withdraw
| from the organization psychologically. Others will buy into the
| nonsense around them, cynically or through self-deception
| (Goffman, 1959), and abandon their concern with reality. Still
| others will conclude that the only way to save their self-esteem
| is to leave the organization. Arguably, it is these last
| individuals who, because of their commitment to productive work
| and their firm grasp of reality, are the most productive members
| of the organization. Trento cites a number of examples of this
| happening at NASA. Considerations of space preclude detailed
| discussion here.
|
| Schwartz, H.S., 1989. Organizational disaster and organizational
| decay: the case of the National Aeronautics and Space
| Administration. Industrial Crisis Quarterly, 3: 319-334.
| navels wrote:
| The Therac-25 disaster should be on this list:
| https://en.wikipedia.org/wiki/Therac-25
| [deleted]
| LorenPechtel wrote:
| The Challenger disaster is on the list but should be expanded
| upon:
|
| 1) Considerable pressure from NASA to cover up the true sequence
| of events. They pushed the go button even when the engineers said
| no. And they failed to even tell Morton-Thiokol about the actual
| temperatures. NASA dismissed the observed temperatures as
| defective--never mind that in "correcting" them so the
| temperature at the failed joint was as expected meant that now a
| bunch of other measurements are above ambient. (The offending
| joint was being cooled by boiloff from the LOX tank that under
| the weather conditions at the time ended up cooling that part of
| the booster.)
|
| 2) Then they doubled down on the error with Columbia. They had
| multiple cases of tile damage from the ET insulating foam. They
| fixed the piece of foam that caused a near-disaster--but didn't
| fix the rest because it had never damaged the orbiter.
|
| Very much a culture of painting over the rust.
| yafbum wrote:
| As I heard one engineering leader say, "it's okay to make
| mistakes -- _once_ ". Meaning, we're all fallible, mistakes
| happen, but failure to learn from past mistakes is not optional.
|
| That said, a challenge I have frequently run into, and I feel is
| not uncommon, is a tension between the desire not to repeat
| mistakes and ambitions that do generally involve some amount of
| risk-taking. The former can turn into a fixation on risk and risk
| mitigation that becomes a paralyzing force; to some leaders,
| lists like these might just look like "a thousand reasons to do
| nothing" and be discarded. Yet history is full of clear cases
| where a poor appreciation of risk destroyed fortunes, with a
| chorus of "I told you sos" in their wake.
|
| It is a difficult part of leadership to weigh the risk tradeoffs
| for a particular mission, and presenting things in absolute terms
| of "lessons learned" rarely makes sense, in my experience. The
| belt-and-suspenders approaches that make sense for authoring the
| critical control software for a commercial passenger aircraft or
| an industrial control system for nuclear plants probably do not
| make sense for an indie mobile game studio, even if they're all
| in some way "software engineering".
| tempodox wrote:
| Failure is not optional. Definitely true :)
| pc86 wrote:
| Mistakes happen, things go wrong, for the vast majority of us a
| bug doesn't mean someone dies or lights go out or planes don't
| take off. For most of us here, the absolute worst case scenario
| is that a bug means a company nobody has ever heard of makes
| slightly less money for a few minutes or hours until it gets
| rolled back. Again, worst case. The average case is probably
| closer to a company nobody has ever heard of makes exactly the
| same amount of money but some arbitrary feature nobody asked
| for ships a day or two later because we spent time fixing this
| other thing instead.
|
| It's really hard to drive a balance between "pushing untested
| shitcode into prod multiple times a week" and "that ticket to
| change our CTA button color is done but now needs to go through
| 4 days of automated and manual testing." I think as an industry
| most of us are probably too far on the latter side of the
| spectrum in relation to the stakes of what we're actually doing
| day to day.
| svrtknst wrote:
| IMO this list isn't specific enough with the causes and
| takeaways, and esp. not specific enough to highlight minor
| critical omissions.
|
| Most of these feel like "They didnt act fast enough or do a good
| enough job. They should do a better job".
| seanhunter wrote:
| One of my favourites that I ever heard about was from a friend of
| mine who used to work on safety-critical systems in defense
| applications. He told me fighter jets have a safety system that
| disables the weapons systems if a (weight) load is detected on
| the landing gear so that if the plane is on the ground and the
| pilot bumps the wrong button they don't accidentally blow up
| their own airbase[1]. So anyway when the Eurofighter Typhoon did
| its first live weapons test the weapons failed to launch. When
| they did the RCA they found something like[2]
| bool check_the_landing_gear_before_shootyshoot(double
| weight_from_sensor, double threshold) { //FIXME:
| Remember to implement this before we go live return
| false }
|
| So when the pilot pressed the button the function disabled the
| weapons as if the plane had been on the ground. Because the
| "correctness" checks were against the Z spec and this function
| didn't have a unit test because it was deemed too trivial, the
| problem wasn't found before launch, so this cost several millions
| to redeploy the (one-line) fix to actually check the weight from
| the sensor was less than the threshold.
|
| [1] Yes this means that scene from the cheesy action movie (can't
| remember which one) where Arnold Schwartzenegger finds himself on
| the ground in the cockpit of a russian plane and proceeds to blow
| up all the badguys while on the ground couldn't happen in real
| life.
|
| [2] Not the actual code which was in some weird version of ADA
| apparently.
| jowea wrote:
| You would think "grep the codebase for FIXME" would be in the
| checklist before deployment.
| metadat wrote:
| Regarding [1], was it Red Heat, True Lies, or Eraser?
| dragonwriter wrote:
| True Lies was an airborne Harrier, not a Russian plane on the
| ground. So, while there are many reasons the scene was
| unrealistic, "weight on gear disables weapons" isn't one.
| metadat wrote:
| I know, such a great movie, total classic of my childhood!
| It was the only R-rated film my mother ever allowed and
| even endorsed us watching, "Because Jamie Lee Curtis is
| hot." :D
|
| I wasn't sure if there might've been a scene I'd forgotten.
| LorenPechtel wrote:
| Yup, watch videos of actual missile launches--the missile
| descends below the plane that fired it. Can't do that on the
| ground, although you won't blow up your base because the weapon
| will not have armed by the time it goes splat.
| jacquesm wrote:
| > Yes this means that scene from the cheesy action movie (can't
| remember which one) where Arnold Schwartzenegger finds himself
| on the ground in the cockpit of a russian plane and proceeds to
| blow up all the badguys while on the ground couldn't happen in
| real life.
|
| I think you meant "Tomorrow Never Dies" and the actor was
| Pierce Brosnan. Took me forever to find that, is it the right
| one?
| seanhunter wrote:
| Yeah maybe. I think that sort of rings a bell.
| jacquesm wrote:
| https://youtu.be/hcIgZ4kJ5Ow?t=340
|
| this?
| tialaramex wrote:
| > if a (weight) load is detected on the landing gear
|
| This state "weight on wheels" is used in a lot of other
| functionality, not just on military aircraft, as the hard stop
| for things that don't make sense if we're not airborne. So that
| makes sense (albeit obviously somebody needed to actually write
| this function)
|
| Most obviously the gear retraction is disabled on planes which
| have retractable landing gear.
| MisterTea wrote:
| > weird version of ADA
|
| Spark:
| https://en.wikipedia.org/wiki/SPARK_(programming_language)
|
| And it's Ada not ADA which makes me think of the Americans with
| Disabilities Act.
| seanhunter wrote:
| Aah thank you on both counts yes. One interesting feature he
| told me about is they wrote a "reverse compiler" that would
| take Spark code and turn it into the associated formal (Z)
| spec so they could compare that to the actual Z spec to prove
| they were the same. Kind of nifty.
| WalterBright wrote:
| Sounds like Tranfor which would convert FORTRAN code to
| flowcharts, because government contracts required
| flowcharts.
| JackFr wrote:
| Torpedoes typically have an inertial switch which disarms them
| if they turn 180 degrees, so they don't accidentally hit their
| source. When a torpedo accidentally arms and activates on board
| a submarine (hot running) the emergency procedure is to
| immediately turn the sub around 180 degrees to disarm the
| torpedo.
| mwcremer wrote:
| Lest someone think this is purely hypothetical:
| https://en.wikipedia.org/wiki/USS_Tang_(SS-306)
| mrguyorama wrote:
| A mark 14 torpedo actually sinking something? What a bad
| stroke of luck!
| kakwa_ wrote:
| The Mark 14 ended-up being a really good torpedo by the
| end of WWII.
|
| It even remained in service until the 80ies.
|
| In truth, and going back to this subject, the Mark 14
| debacle highlights the need for a good and unbiased QA.
|
| This also holds true for software engineering.
| mrguyorama wrote:
| My understanding is the BeuOrd (or BeuShip? I don't
| remember which) "didn't want to waste money on testing
| it", so instead we wasted hundreds of them fired at
| japanese shipping that didn't even impact their target,
| or never had a hope of detonating.
|
| Remember these kind of things next time someone pushes
| for move fast and break things in the name of efficiency
| and speed. Slow is fast.
| jacquesm wrote:
| In NL folklore this is codified as 'the longest road is
| often the shortest'.
| kakwa_ wrote:
| Pre-war, it was more a case of "Penny wise and Pound
| foolish" partly due to budget limitation (they did things
| like testing only with foam warheads to recover test
| torpedoes).
|
| But after Perl Harbor, a somewhat biased BuOrd was
| reluctant to admit the mark 14 flaws. It took a few
| "unauthorized" tests and 2 years to fix the issues.
|
| In fairness, this sure makes for an entertaining story
| (ex Drachinifel video on yt), but I'm not completely sold
| on the depiction of BuOrd as some sort of arrogant
| bureaucrats. However, bias and pride (plus other issues
| like low production) certainly have played a role in the
| early mark 14 debacle.
|
| Going back to software development, I'm always amazed how
| bugs immediately pop-up whenever I put a piece of
| software in the hands of users for the first time, and
| that's regardless how well I tested it. I try to be as
| thorough as possible, but being the developer I'm always
| bias, often tunnel visioning on one way to use the
| software I created. That's why, in my opinion you need
| some form of external QA/testing (like these
| "unauthorized" Mark 14 tests).
| yafbum wrote:
| > this cost several millions to redeploy the (one-line) fix to
| actually check the weight from the sensor was less than the
| threshold
|
| Well maybe this is the other, compounding problem. Engineering
| complex machines with such a high cost of bugfix deployment
| seems like a big issue. It's funny that as an industry we now
| know how to safely deploy software updates to hundreds of
| millions of phones, with security checks, signed firmwares,
| etc, but doing that on applications with a super high unit
| price tag seems out of reach...
|
| Or maybe, a few millions is like only a thousand hours of
| flying in jet fuel costs alone, not a big deal...
| nvy wrote:
| >It's funny that as an industry we now know how to safely
| deploy software updates to hundreds of millions of phones,
| with security checks, signed firmwares, etc, but doing that
| on applications with a super high unit price tag seems out of
| reach
|
| A bunch of JavaScript dudebros yeeting code out into the
| ether is not at all comparable to deploying avionics software
| to a fighter jet. Give your head a shake.
| Two4 wrote:
| I don't think they're referring to dudebros' js, they're
| referring to systems software and the ability to deliver
| relatively secure updates over insecure channels. I've even
| delivered a signed firmware update to a microprocessor in a
| goddamn washing machine over UART. Why can't we do this for
| a jet?
| digging wrote:
| This makes no sense and is difficult to even respond to
| coherently.
|
| > It's funny that as an industry we now know how to safely
| deploy software updates to hundreds of millions of phones,
| with security checks, signed firmwares, etc,
|
| Either you're completely wrong, because we "as an industry"
| still push bugs and security flaws, or you're comparing two
| completely different things.
|
| > doing that on applications with a super high unit price tag
| seems out of reach...
|
| is true _because of_
|
| > a few millions is like only a thousand hours of flying in
| jet fuel costs alone
|
| like do you really think they spent millions pushing a line
| of code? or do you think it's just inherently expensive to
| fly a jet, and so doing it twice costs more?
| the_sleaze9 wrote:
| I would generally pass this comment by, but it's just so
| distastefully hostile because you totally missed the point.
|
| GP's comment was expressing sardonic disbelief that a
| modern jet wouldn't be able to receive remote software
| updates, considering it's so ubiquitous and reliable in
| other fields, even those with much, much lower costs. Not
| that developers don't release faults.
| namaria wrote:
| People tend to opine on systems engineering as if we had
| some sort of information superconductor connecting all
| minds involved.
|
| Systems are Hard and complex systems are Harder. Thinking
| of entire class of failures as 'solved' is kinda like
| talking about curing cancer. There isn't one thing called
| cancer, there's hundreds.
|
| There's no way to solve complex systems problems for
| good. Reality, technologies, tooling, people, language,
| everything changes all the time. And complex systems
| failure modes that happen today will happen forever.
| digging wrote:
| Ahh, then I did misread it entirely. Thanks for stopping
| by to call me out.
|
| It's still probably not a matter of capability... I
| wouldn't be so cavalier about software updates on my
| phone if it was holding me thousands of feet above the
| ground at the time.
| jacquesm wrote:
| I already commented on this elsewhere but I came across a
| company that did OTA updates on a control box in vehicles
| without checking if the vehicle was in motion or not. And
| it didn't even really surprise me, it was just one of
| those things that came up when prepping for that job from
| a risk assessment. They never even thought of it.
| Balooga wrote:
| Imagine remotely bricking a fleet of fighter jets.
|
| https://news.ycombinator.com/item?id=35983866
| jacquesm wrote:
| > Imagine remotely bricking a fleet of fighter jets.
|
| > https://news.ycombinator.com/item?id=35983866
|
| That's about routers, was that the article you meant?
| kayodelycaon wrote:
| Remote software updates on military vehicles? Hasn't
| anyone seen the new Battlestar Galactica? :)
| DavidVoid wrote:
| > Or maybe, a few millions is like only a thousand hours of
| flying in jet fuel costs alone, not a big deal...
|
| Pretty much tbh. For example, the development of the Saab JAS
| 39 Gripen ( _JAS-projektet_ ) is the most expensive
| industrial project in modern Swedish history at a cost of
| 120+ billion SEK (11+ billion USD).
|
| It was also almost cancelled after a very public crash in
| central Stockholm at the 1993 Stockholm Water Festival [1]. A
| crash that should not have happened because the flight should
| not have been approved in the first place, because they
| weren't yet confident that they'd completely solved the
| Pilot-Induced Oscillation (PIO) related issues that wrecked
| the first prototype 4 years prior (with the same test pilot)
| [2].
|
| It was basically a miracle that no one was killed or
| seriously hurt in the Stockholm crash, had the plane hit the
| nearby bridge or any of the other densely crowded areas then
| it would've been a very different story.
|
| [1] https://youtu.be/mkgShfxTzmo?t=122
|
| [2] https://www.youtube.com/watch?v=k6yVU_yYtEc
| datadrivenangel wrote:
| a few million dollars works out to a surprisingly small
| amount of time when you add overhead.
|
| Call the bug fix a development team of 20 people taking 3
| months end to end from bug discovery to fix deployment.
| You'll probably have twice that much people time again in
| project management and communication overhead (1:2 ratio of
| dev time to communication overhead is actually amazing in
| defense contexts). Assume total cost per person of 200k per
| year (after factoring in benefits, overhead, and pork), so 60
| people * 3 months * $200k/12 months = 3,000,000 USD.
|
| It takes a lot of people to build an aircraft.
| WalterBright wrote:
| Instead of attempting to design a perfect system that cannot
| fail, the idea is to design a system that can tolerate failure of
| any component. (This is how airliners are designed, and is why
| they are so incredibly reliable.)
|
| Safe Systems from Unreliable Parts
| https://www.digitalmars.com/articles/b39.html
|
| Designing Safe Software Systems part 2
| https://www.digitalmars.com/articles/b40.html
| masfuerte wrote:
| There is or was an old-school website called something like
| "Byzantine failures" that had case studies of bizarre failures
| from many engineering fields. It was entertaining but I am unable
| to find it now. Does anyone know it?
| jacquesm wrote:
| I think you're talking about Risks Digest.
|
| http://catless.ncl.ac.uk/Risks/
|
| It's very much old-school.
| masfuerte wrote:
| Thank you, that is the site. I don't know where I got
| "Byzantine failures" from.
| boobalyboo wrote:
| [flagged]
| toss1 wrote:
| Nice Ad Homenim argument you've got there -- criticizing the
| person bringing the argument and not the argument itself.
|
| If your point is that the person is likely to be ignored by
| his/her management, there's likely a better way to phrase it or
| worth adding a few words to clarify.
| BiggusDijkus wrote:
| Bruh! Does it matter if a sensible thought comes from a high
| scool kid or a parrot, as long as it is sensible?
| richardwhiuk wrote:
| I think the display of lessons shows a lack of experience in
| understanding software project management, which isn't
| surprising if they are a high schooler.
| KuriousCat wrote:
| Thanks for sharing this. I am reminded of the talks by Nickolas
| Means (https://www.youtube.com/watch?v=1xQeXOz0Ncs)
___________________________________________________________________
(page generated 2023-08-17 23:02 UTC)