[HN Gopher] The Therac-25 Incident
___________________________________________________________________
The Therac-25 Incident
Author : dagurp
Score : 150 points
Date : 2021-02-15 13:18 UTC (9 hours ago)
(HTM) web link (thedailywtf.com)
(TXT) w3m dump (thedailywtf.com)
| dade_ wrote:
| The article says the developer was never named, as if that had
| anything to do with the actual problems. Everything about this
| project sounds insane & inept.
|
| Some questions in my mind while reading this article (but I
| couldn't find them quickly in a search): Who were the executives
| that were running the company? Sounds like something that should
| be taught to MBA as well as CS students. Further, the AECL was a
| crown corporation of the Canadian government. Who was the
| minister and bureaucrats in charge of the department? What role
| did they have in solving or covering up the issue?
| strken wrote:
| The article has a very long explanation of why one developer is
| not to blame, and why it's entirely the fault of the company
| for having no testing procedure and no review process.
| nickdothutton wrote:
| I mentioned this incident in passing on my post of a few years
| ago. We studied it many years ago in the early 90s at university.
| https://blog.eutopian.io/the-age-of-invisible-disasters/
| b3lvedere wrote:
| Interesting story. I've been on the testing site of medical
| hardware many many moons ago. It's quite amazing what you can and
| must test. For instance: We had to prove that if our equipment
| broke by falling damage that the potential debris flying off
| could not harm the patient.
|
| I always liked the testing philosophy institutes like for
| instance Underwriter Laboratories had: Your product will fail.
| This is stated as fact and is not debatable. What kind of fail
| safes and protection have you made so that when it fails (and it
| will) it cannot not harm the patient?
| buescher wrote:
| Yes. It's amazing the number of engineers that resist doing
| that analysis - "oh that part won't ever break". Some safety
| standards do allow for "reliable components" (i.e. if the
| component is already been scrutinized for safe failure modes,
| you don't have to consider them) and for submitting reliability
| analysis or data. I've never seen reliability analysis or data
| submitted instead of single-point failure analysis though,
| myself.
|
| Single-point failure analysis techniques like fault trees,
| event trees, and especially the tabular "failure modes and
| effects analysis" (FMEA) are so powerful, especially for
| safety-critical hardware, that when people learn them they want
| to apply them to everything, including software.
|
| However, FMEA techniques actually have been found to not apply
| well to software below about the block diagram level. They
| don't find bugs that would not be found by other methods
| (static analysis, code review, requirements analysis etc) and
| they're extremely time and labor intensive. Here's an NRC
| report that goes into some detail: https://www.nrc.gov/reading-
| rm/doc-collections/nuregs/agreem...
| b3lvedere wrote:
| "Through analysis and examples of several real-life
| catastrophes, this report shows that FMEA could not have
| helped in the discovery of the underlying faults. The report
| concludes that the contribution of FMEA to regulatory
| assurance of Complex Logic, especially software, in a nuclear
| power plant safety system is marginal."
|
| Even more interesting! Thank you for this link. I appreciate
| it. Never too old to learn. :)
| ed25519FUUU wrote:
| The very worst part of this story is that the manufacture
| vigorously defended their machine, threatening individuals and
| hospitals with lawsuits if they spoke out publicly. I have zero
| doubt this led to more deaths.
| buescher wrote:
| Here's a bit more background, from Nancy Leveson (now at MIT):
| http://sunnyday.mit.edu/papers/therac.pdf
| meristem wrote:
| Leveson's _Engineering a Safer World_ [1] is excellent, for
| those interested in safety engineering.
|
| [1] https://mitpress.mit.edu/books/engineering-safer-world
| buescher wrote:
| The systems safety case studies in that are great. It's also
| available as "open access" (free-as-in-free-beer) at MIT
| press:
|
| https://direct.mit.edu/books/book/2908/Engineering-a-
| Safer-W...
| time0ut wrote:
| Many years ago, I had an opportunity to work on a similar type of
| system (though more recent than this). In the final round of
| interviews, one of the executives asked if I would be comfortable
| working on a device that could deliver a potentially dangerous
| dose of radiation to a patient. In that moment, my mind flashed
| to this story. I try to be a careful engineer and I am sure there
| are many more safeguards in place now, but, in that moment, I
| realized I would not be able to live with myself if I harmed
| someone that way. I answered truthfully and he thanked me and we
| ended things there.
|
| I do not mean this as a judgement on those who do work on systems
| that can physically harm people. Obviously, we need good
| engineers to design potentially dangerous systems. It is just how
| I realized I really don't have the character to do it.
| mcguire wrote:
| Good for you to realize that, then.
|
| On the other hand, I'm not entirely sure it's appropriate to be
| comfortable with that kind of position in any case.
| sho_hn wrote:
| > In the final round of interviews, one of the executives asked
| if I would be comfortable working on a device that could
| deliver a potentially dangerous dose of radiation to a patient
|
| Automotive software engineer here: I've asked the same in
| interviews.
|
| "We work on multi-ton machines that can kill people" is a
| frequently uttered statement at work.
| Hydraulix989 wrote:
| I would have considered the fact that for a vast majority of
| people suffering from cancer, this device helps them instead of
| harms them. However, I can also imagine leadership at some
| places trying to move fast and pressure ICs into delivering
| something that isn't completely bulletproof in the name of
| bottom line. That is something I would have tried to discern
| from the executives. Similar tradeoffs have been made before
| with cars against expected legal costs.
|
| There are plenty of other high stakes software that involve
| human lives (Uber self driving cars, SpaceX, Patriot missiles)
| and many of them completely scare me and morally frustrate me
| as well to the point where I would not like to work on one, but
| I totally understand if you have a personal profile that is
| different than mine.
| jancsika wrote:
| I found this comment from the article fascinating:
|
| > I am a physician who did a computer science degree before
| medical school. I frequently use the Therac-25 incident as an
| example of why we need dual experts who are trained in both
| fields. I must add two small points to this fantastic summary.
|
| > 1. The shadow of the Therac-25 is much longer than those who
| remember it. In my opinion, this incident set medical informatics
| back 20 years. Throughout the 80s and 90s there was just a
| feeling in medicine that computers were dangerous, even if the
| individual physicians didn't know why. This is why, when I was a
| resident in 2002-2006 we still were writing all of our orders and
| notes on paper. It wasn't until the US federal government slammed
| down the hammer in the mid 2000's and said no payment unless you
| adopt electronic health records, that computers made real inroads
| into clinical medicine.
|
| > 2. The medical profession, and the government agencies that
| regulate it, are accustomed to risk and have systems to manage
| it. The problem is that classical medicine is tuned to
| "continuous risks." If the Risk of 100 mg of aspirin is "1 risk
| unit" and the risk of 200 mg of aspirin is "2 risk units" then
| the risk of 150 mg of aspirin is strongly likely to be between 1
| and 2, and it definitely won't be 1,000,000. The mechanisms we
| use to regulate medicine, with dosing trials, and pharmacokinetic
| studies, and so forth are based on this assumption that both
| benefit and harm are continuous functions of prescribed dose, and
| the physician's job is to find the sweet spot between them.
|
| > When you let a computer handle a treatment you are exposed to a
| completely different kind of risk. Computers are inherently
| binary machines that we sometimes make simulate continuous
| functions. Because computers are binary, there is a potential for
| corner cases that expose erratic, and as this case shows,
| potentially fatal behavior. This is not new to computer science,
| but it is very foreign to medicine. Because of this, medicine has
| a built in blind spot in evaluating computer technology.
| mcguire wrote:
| I'm not sure I buy that. Or, well, I suppose that those in the
| medical field believe it, but I don't think they're right.
|
| Consider something like a surgeon nicking and artery while
| performing some routine surgery, the patient not responding
| normally to anesthesia or the anesthetist not getting the
| mixture right and the patient not coming back the way they went
| in. Or that subset of patients that have poor responses to a
| vaccine.
|
| Everybody likes to think of the world as a linear system, but
| it's not.
| ZuLuuuuuu wrote:
| This is one of the infamous incidents where software failure
| caused harm on humans. I am kind of fascinated about such
| incidents (since I learn so much by reading about them), are
| there any other examples of such incidents that you guys know of?
| It doesn't have to result in harming human, but any software
| failure related incident which resulted in big consequences.
|
| One another example that comes to my mind is the Toyota
| "unintended acceleration" incident. Or the "Mars Climate Orbiter"
| incident.
| crocal wrote:
| Google Ariane 501. Enjoy.
| probably_wrong wrote:
| If big consequences is what you're after, I can think of three
| typical incidents: the "most expensive hyphen in history" (you
| can search it like that), it's companion piece, the Mars
| Climate Orbiter(which I see you added now), and the Denver
| Airport Baggage System fiasco [1] where bad software planning
| caused over $500M in delays.
|
| [1] http://calleam.com/WTPF/?page_id=2086
| FiatLuxDave wrote:
| I find it interesting how often the Therac 25 is mentioned on HN
| (thanks to Dan for the list), but nobody ever mentions that those
| kind of problems never entirely went away. Therac 25 is just the
| famous one. You don't have to go back to 1986, there are
| definitely examples from this century. The root causes are
| somewhat different, and somewhat the same. But no one seems to be
| teaching these more modern cases to aspiring programmers in
| school, at least not to the level where every programmer I know
| has heard of them.
|
| For example, the issue which caused this in 2007:
|
| https://www.heraldtribune.com/article/LK/20100124/News/60520...
|
| Or the process issues which caused this in 2001:
|
| https://www.fda.gov/radiation-emitting-products/alerts-and-n...
| sebmellen wrote:
| I quote this from the article so people may take an interest in
| reading it. This is the opening paragraph:
|
| > _As Scott Jerome-Parks lay dying, he clung to this wish: that
| his fatal radiation overdose -- which left him deaf, struggling
| to see, unable to swallow, burned, with his teeth falling out,
| with ulcers in his mouth and throat, nauseated, in severe pain
| and finally unable to breathe -- be studied and talked about
| publicly so that others might not have to live his nightmare._
| jtchang wrote:
| I can say when I was doing my CS degree this was definitely
| covered. In fact it is one of the lectures that stood out in my
| mind at that time. My professor at the time (Bill Leahy)
| definitely drilled into us the importance of understanding the
| systems in which we were eventually going to work on.
|
| Not sure if this is still covered today.
| Mountain_Skies wrote:
| When I was the graduate teaching assistant for a software
| engineering lab, the students got a week off to do research on
| software failures that harmed humans. For many of the students
| it was the first time they gave any thought to the concept of
| software causing actual physical harm. I'm glad we were able to
| expose them to this reality but also was a bit disheartened as
| they should have thought about it far before a fourth year
| course in their major.
| lxgr wrote:
| It was covered in at least one of my classes as well.
| (Graduated only a few years ago.)
| icelancer wrote:
| Was definitely covered in my small school program in Embedded
| Systems in the early 2000s.
| phlyingpenguin wrote:
| The book I use to instruct software engineering uses it, and I
| do use that chapter.
| Jtsummers wrote:
| Per younger, CS, colleagues who went through school in the last
| 6 years, it was still being taught at their smaller US
| colleges.
| lordnacho wrote:
| How much money was AECL making selling these things? You'd think
| a second pair of eyes on the code would not cost too much. Do I
| blame the one person? Not really, who in this world hasn't
| written a race condition at some point? RCs are also one of those
| things where someone else might spot it a lot sooner than the
| original writer.
|
| I agree with the sentiment that they took the software for
| granted. I get the feeling that happens in a lot of settings,
| most of them less life-threatening than this one. I've come
| across it myself too, in finance. Somehow someone decides they
| have invented a brilliant money-making strategy, if they could
| only get the coders to implement it properly. Of course the
| coders come back to ask questions, and then depending on the
| environment it plays out to a resolution. I get the feeling the
| same thing happened here. Some scientist said "hey all it needs
| is to send this beam into the patient" and assumed their
| description was the only level of abstraction that needed to be
| understood.
| ufmace wrote:
| > The Therac-25 was the first entirely software-controlled
| radiotherapy device. As that quote from Jacky above points out:
| most such systems use hardware interlocks to prevent the beam
| from firing when the targets are not properly configured. The
| Therac-25 did not.
|
| This makes me think - there was only one developer there, I
| guess, who was doing everything in assembly. This software, and
| the process to produce it, must have been designed in the early
| days of their devices, when there would be expected to be
| hardware interlocks to prevent any of the really bad failure
| modes. I bet they never did change much of the software, or their
| procedures for developing, testing, qualifying, and releasing it
| in light of the change from relying on hardware interlocks to the
| quality of the software being the only thing preventing something
| terrible from happening.
| bluGill wrote:
| The software was working just fine for years before on earlier
| versions with the interlocks. They never checked to see how
| often or why the interlocks fired before removing them. Turns
| out those interlocks fired often because of the same bugs.
| brians wrote:
| They had two fuses, so they had a 2:1 safety margin! Just
| like the NASA managers who decided that 30% erosion in an
| O-ring designed for no erosion meant a 3:1 safety margin.
| Gare wrote:
| A quote from the report:
|
| > Related problems were found in the Therac-20 software. These
| were not recognized until after the Therac-25 accidents because
| the Therac-20 included hardware safety interlocks and thus no
| injuries resulted.
|
| The safety fuses were occasionally blowing during the operation
| of Therac-20, but nobody asked why.
| baobabKoodaa wrote:
| > The safety fuses were occasionally blowing during the
| operation of Therac-20, but nobody asked why.
|
| Have you tried turning it off and on again?
| joncrane wrote:
| I feel like this makes it to HN once every few years or so.
|
| I know it well from it being the first and main case study in my
| software testing class as an undergraduate CS major in Washington
| DC in 1999.
|
| It will never not be interesting.
| siltpotato wrote:
| Apparently this is the seventh one. I've never worked on a
| safety critical system but this is the story that makes me
| wonder what it's like to do so.
| Jtsummers wrote:
| It's stressful, but often worthwhile. It requires diligence,
| deliberate action, and patience.
| matthias509 wrote:
| I used to work on public safety radio systems. Things which
| seem like minor issues like clipping the beginning of a
| transmission every now and then are showstopper defects in
| that space.
|
| It's because it can be the difference between "Shoot" and
| "Don't shoot."
| at_a_remove wrote:
| I rather randomly met a woman with a similar sort of background
| and trajectory as I have: trained in physics, got sucked into
| computers via the brain drain. She programmed the models for
| radiation dosing in the metaphorical descendants of Therac-25. I
| asked her just how often it was brought up in her work and she
| mentioned that she trained under someone who was in the original
| group of people brought in to analyze and understand just what
| happened with Therac-25. Fascinating stuff.
| dang wrote:
| (For the curious) the Therac-25 stack on HN:
|
| 2019 https://news.ycombinator.com/item?id=21679287
|
| 2018 https://news.ycombinator.com/item?id=17740292
|
| 2016 https://news.ycombinator.com/item?id=12201147
|
| 2015 https://news.ycombinator.com/item?id=9643054
|
| 2014 https://news.ycombinator.com/item?id=7257005
|
| 2010 https://news.ycombinator.com/item?id=1143776
|
| Others?
| kondro wrote:
| It comes up a lot, but it's an incredibly important story that
| bares repeating often. Especially with similar issues like the
| 737-MAX occurring pretty recently.
| omginternets wrote:
| The featured comment is great, for those who missed it:
|
| I am a physician who did a computer science degree before medical
| school. I frequently use the Therac-25 incident as an example of
| why we need dual experts who are trained in both fields. I must
| add two small points to this fantastic summary.
|
| 1. The shadow of the Therac-25 is much longer than those who
| remember it. In my opinion, this incident set medical informatics
| back 20 years. Throughout the 80s and 90s there was just a
| feeling in medicine that computers were dangerous, even if the
| individual physicians didn't know why. This is why, when I was a
| resident in 2002-2006 we still were writing all of our orders and
| notes on paper. It wasn't until the US federal government slammed
| down the hammer in the mid 2000's and said no payment unless you
| adopt electronic health records, that computers made real inroads
| into clinical medicine.
|
| 2. The medical profession, and the government agencies that
| regulate it, are accustomed to risk and have systems to manage
| it. The problem is that classical medicine is tuned to
| "continuous risks." If the Risk of 100 mg of aspirin is "1 risk
| unit" and the risk of 200 mg of aspirin is "2 risk units" then
| the risk of 150 mg of aspirin is strongly likely to be between 1
| and 2, and it definitely won't be 1,000,000. The mechanisms we
| use to regulate medicine, with dosing trials, and pharmacokinetic
| studies, and so forth are based on this assumption that both
| benefit and harm are continuous functions of prescribed dose, and
| the physician's job is to find the sweet spot between them.
|
| When you let a computer handle a treatment you are exposed to a
| completely different kind of risk. Computers are inherently
| binary machines that we sometimes make simulate continuous
| functions. Because computers are binary, there is a potential for
| corner cases that expose erratic, and as this case shows,
| potentially fatal behavior. This is not new to computer science,
| but it is very foreign to medicine. Because of this, medicine has
| a built in blind spot in evaluating computer technology.
| beerandt wrote:
| It's so short-sighted that he doesn't see that medical records
| being forced so quickly to digital/computers is almost exactly
| the same problem being played out, just not as directly or
| dramatically, but with a much wider net, and way more short-
| and long-term problems (including the software/systems trust
| mentioned).
| pessimizer wrote:
| _' Shocking' hack of psychotherapy records in Finland affects
| thousands_
|
| https://www.theguardian.com/world/2020/oct/26/tens-of-
| thousa...
| jbay808 wrote:
| Similar thing happened in Canada:
|
| https://globalnews.ca/news/6311853/lifelabs-data-hack-
| what-t...
| [deleted]
| dwohnitmok wrote:
| I suspect that a large proportion of ways that abstract
| planning fail are due to discontinuous jumps, foreseen or
| unforeseen. That may be manifested in computer programs,
| government policy, etc.
|
| Continuity of risk, change, incentives, etc. lend themselves to
| far easier analysis and confidence in outcomes. And higher
| degrees of continuity as well as lower values of change only
| make that analysis easier. Of course it's a trade-off: a flat
| line is the easiest thing to analyze, but also the least useful
| thing.
|
| In many ways I view the core enterprise of planning as an
| exercise in trying to smooth out discontinuous jumps (and their
| analogues in higher degree derivatives) to the best of one's
| ability, especially if they exist naturally (e.g. your system's
| objective response may be continuous, but its interpretation by
| humans is discontinuous, how are you going to compensate to try
| to regain as much continuity as possible?).
___________________________________________________________________
(page generated 2021-02-15 23:01 UTC)