[HN Gopher] UK air traffic control meltdown
       ___________________________________________________________________
        
       UK air traffic control meltdown
        
       Author : jameshh
       Score  : 500 points
       Date   : 2023-09-11 01:08 UTC (20 hours ago)
        
 (HTM) web link (jameshaydon.github.io)
 (TXT) w3m dump (jameshaydon.github.io)
        
       | omginternets wrote:
       | Of _course_ they blamed the French ^^
        
         | gumballindie wrote:
         | [flagged]
        
           | scythe wrote:
           | Except that was a completely different incident and it
           | occurred in the United States, not the UK. The Daily Mail did
           | try to make hay out of the idpol angle, but the British can't
           | reasonably be accused of shirking responsibility for the FAA
           | grounding flights in the US.
        
           | sergers wrote:
           | Well that's DailyMail for you, where they tag anything
           | parenting or healthy as "femail" section... cause you know
           | only women are looking at that stuff.
           | 
           | Lol.
           | 
           | Anyways I actually think that's just reasonable response,
           | system goes down/related system goes down , and in reviewing
           | they are making frivolous updates to names that aren't
           | needed.
           | 
           | I would question these updates (while they may be minor part
           | of overall updates occuring).
        
             | cratermoon wrote:
             | At least until the 70s most newspapers had a section called
             | "Women" or something similar. Even the news about the
             | 60s/70s women's movement appeared there, not in the main
             | "news" sections. Those sections were mostly renamed around
             | that time to "Lifestyle", "Home", or just "Features".
        
             | vixen99 wrote:
             | Is this the UK or US edition? It's always easy fun to have
             | a go at the Daily Mail which presumably you read regularly
             | else you wouldn't be commenting. Its sin seems to be that
             | it's not a serious broadsheet. It's a tabloid with very
             | broad appeal that has to be profitable and therefore tries
             | to reflect the requirements of the British public for such
             | a publication. Perhaps you should lower your expectations.
             | 
             | 'Tag anything parenting or healthy ...'? No, that's not
             | correct. Here are a few health & food related items back to
             | mid-September that did not appear in 'female'. You are
             | right about parenting; most parenting in the UK is still
             | undertaken primarily (in terms of executive action) by
             | females so items on this topic are reasonably included in
             | 'female'. The growing number of people who don't have
             | children probably appreciate this sub-grouping by the Mail.
             | You may not approve but this is what happens. Single males
             | with dependent children are not known for objecting to
             | checking out that section. It's not forbidden.
             | 
             | https://www.dailymail.co.uk/wires/pa/article-12505173/Healt
             | h... https://www.dailymail.co.uk/wires/ap/article-12504751/
             | Eggpla... https://www.dailymail.co.uk/health/article-125046
             | 49/Suicide-... https://www.dailymail.co.uk/health/article-1
             | 2504813/Anthony-... https://www.dailymail.co.uk/health/arti
             | cle-12503801/Cancer-n... https://www.dailymail.co.uk/wires/
             | reuters/article-12503815/W... https://www.dailymail.co.uk/w
             | ires/reuters/article-12503299/R...
             | https://www.dailymail.co.uk/news/article-12468365/One-
             | woman-... https://www.dailymail.co.uk/wires/reuters/article
             | -12502685/W... https://www.dailymail.co.uk/wires/ap/article
             | -12501533/Food-r...
             | https://www.dailymail.co.uk/news/article-12490747/How-
             | safe-c...
        
               | sergers wrote:
               | Dailymail is actually site a frequent multiple times a
               | day everyday.
               | 
               | not all content is for everyone, but they got something,
               | they are definitely a tabloid style.
               | 
               | they narrate particular views to the public but cover all
               | different contents, and alot of content i would consider
               | advertisements/plug than actual articles.
               | 
               | i would guess a highly elderly/conservative majoroity
               | base
               | 
               | they pander to lowest common denominator, which is fine
               | -- they are a for profit news/tabloid, i find some of it
               | entertaining (As per daily visits).
               | 
               | do you work for them/just a big fan for doing all that
               | digging in defense of DM overexaggerating i made that ALL
               | content like that is in that category? i didnt take my
               | own comment all that seriously so honest ask.
        
               | gumballindie wrote:
               | > tries to reflect the requirements of the British public
               | 
               | The issue though is that quite often the Dailymail
               | doesn't reflect, but rather controls the requirements of
               | the British public.
        
         | swarnie wrote:
         | As is tradition =)
         | 
         | (We actaully have no major issues with the French at least in
         | my generation, its all just good fun)
        
           | omginternets wrote:
           | It's all good. We still take cheap-shots at English food and
           | English women ;)
           | 
           | Edit: I lived in London for 3 years. I miss it every day.
        
             | laputan_machine wrote:
             | I wouldn't worry about it, we take cheap shots at French
             | food and French people too ;)
        
               | omginternets wrote:
               | Sir, those are dueling words.
        
               | ateng wrote:
               | Hold on, I think we take cheap shots at French people,
               | but _expensive_ shots on the food and wine
        
         | [deleted]
        
       | dang wrote:
       | Related. Others?
       | 
       |  _Coincidentally-identical waypoint names foxed UK air traffic
       | control system_ - https://news.ycombinator.com/item?id=37430384 -
       | Sept 2023 (64 comments)
       | 
       |  _UK air traffic control outage caused by bad data in flight
       | plan_ - https://news.ycombinator.com/item?id=37402766 - Sept 2023
       | (20 comments)
       | 
       |  _NATS report into air traffic control incident details root
       | cause and solution_ -
       | https://news.ycombinator.com/item?id=37401864 - Sept 2023 (19
       | comments)
       | 
       |  _UK Air traffic control network crash_ -
       | https://news.ycombinator.com/item?id=37292406 - Aug 2023 (23
       | comments)
        
         | switch007 wrote:
         | The title of this post made me think there was a new, current
         | meltdown !
        
         | a_wild_dandan wrote:
         | The recent episode of The Daily about the (US) aviation
         | industry has convinced me that we'll see a catastrophic
         | headline soon. Things can't go on like this.
        
       | darkclouds wrote:
       | Interesting to see that flight plans over the UK have to be filed
       | 4 hours in advance.
       | 
       | No mention of plane, pilot, passenger and cargo manifests. So why
       | the 4 hour lead time, is this the time it takes UK Authorities to
       | look people up or workout if the cargo could be dangerous in an
       | airborne Anthrax (Gruinard) Island [1] or Japanese subway Sarin
       | [2], or an IRA favourite, fertilizer bomb thats bypassed the
       | usual purchase reporting regulations used by people like Jeremy
       | Clarkson and Harry Metcalfe as their store of wealth[3]?
       | 
       | It makes me wonder just how much more surveillance of the
       | population exists, knowing I cant even step out of the front door
       | without attracting surveillance of the type that followed Dr
       | David Kelly.
       | 
       | Sure its not a cyber attack per se, carried out over the internet
       | like a DDOS attack or a brute force password guessing attack with
       | port knocking mitigation, but how would one carry out a cyber
       | attack on this system if the only attack vector is from people
       | submitting flight plans?
       | 
       | There sure is a constant playing down of the cyber attack angle
       | to this which makes me think someone wants to Blurred Lines!
       | 
       | One point on the lack of uniquely named global way points, which
       | is the main crux of the problem falling over if some are to be
       | believed.
       | 
       | The USA demonstrates a disproportionate number of similar names,
       | by virtue of Europeans migrating to the US [4]. So has this
       | situation arisen with this system in other parts of the world
       | like in the US? How can a country that created the globe spanning
       | British Empire become so insular with regards to air travel in
       | this way?
       | 
       | I'd agree with the initial assessment that there appears to be a
       | lack of testing, but are the specifications simply not fit for
       | purpose? I'm sure various pilots could speak out here, because
       | some of the regulations require planes to be minimally distanced
       | from each other when transiting across the UK.
       | 
       | On the point of ICAO and other bodies to eradicate non-unique
       | waypoint names, its clear there is some legacy constraint still
       | impeding the safety of air travellers, perhaps caused by poor
       | audio quality analogue radio, so perhaps its time for the
       | unambiguous and globally recognised What 3 Words form of location
       | identifier, to come into effect?
       | 
       | The UK police already prefer it to speed up response times [4].
       | And although the same location can create 3 different words,
       | suggesting drift with GPS [5], even if What 3 Words could not be
       | used for a global system, having something a bit longer to create
       | an easily recognisable human globally unique identifier is needed
       | for these flight plans and perhaps maritime situations.
       | 
       | Obviously global coordination will be like herding cats, and if
       | such a fixed size global network of cells were introduced, some
       | area's like transiting over the Atlantic or Pacific could command
       | bigger cells, but transiting over built up areas like London
       | would require smaller sized identifiable cells. But IF ever there
       | was a time for the New World Order to step up to the plate and
       | assert itself, to create a Globally Unique Place ID (GUPID) for
       | the whole planet, now is the time.
       | 
       | On the point of humans were kept safe, only by the sheer common
       | sense of the pilots and traffic control tower staff, its not
       | something NATS did or should claim, their systems were down, so
       | everyone had to resort back to pen and paper and blocks in
       | queues, and apart from Silverstone when the F1 British Grand Prix
       | is on, is air space ever that densely populated.
       | 
       | NATS were caught with their pants down at so many levels of
       | altitude, is this laissez faire UK management style that saw the
       | Govt having to step in to bail out the banks during the financial
       | crisis, still infecting other parts of UK life and still coming
       | to light?
       | 
       | It's beginning to look a lot like Christmas!
       | 
       | [1] https://www.youtube.com/watch?v=_8Zr0IPtx80
       | 
       | [2] https://www.youtube.com/watch?v=RTr1lquCQMg
       | 
       | [3] https://youtu.be/LS54AJSadT4?t=279
       | 
       | [4]
       | https://en.wikipedia.org/wiki/List_of_U.S._places_named_afte...
       | 
       | [5] https://www.bloomberg.com/news/articles/2019-03-21/u-k-
       | polic...
       | 
       | [6] https://support.what3words.com/en/articles/2212837-why-
       | do-i-...
        
         | amiga386 wrote:
         | > so why the 4 hour lead time
         | 
         | To answer your question without conspiracy drivel, let's look
         | up CAP 694: The UK Flight Planning Guide [0]
         | 
         | Chapter 1
         | 
         | > 6.1 The general ICAO requirement is that FPLs should be filed
         | on the ground at least 60 minutes before clearance to start-up
         | or taxi is requested. The "Estimated Off Block Time" (EOBT) is
         | used as the planned departure time in flight planning, not the
         | planned airborne time.
         | 
         | > 6.3 IFR flights on the North Atlantic and on routes subject
         | to Air Traffic Flow Management, should be filed a minimum of 3
         | hours before EOBT (see Chapter 4).
         | 
         | Chapter 4
         | 
         | > 1.1 The UK is a participating State in the Integrated Initial
         | Flight Plan Processing System (IFPS), which is an integral part
         | of the Eurocontrol centralised Air Traffic Flow Management
         | (ATFM) system.
         | 
         | > 4.1 FPLs should be filed a minimum of 3 hours before
         | Estimated Off Block Time (EOBT) for North Atlantic flights and
         | those subject to ATFM measures, and a minimum of 60 minutes
         | before EOBT for all other flights.
         | 
         | So the answer is because the UK is part of a Europe-wide air
         | traffic control system, which hands out full flight plans to
         | all the relevant authorities for each airspace, and they
         | decided 3 hours is needed so that all possible participants can
         | get their shit together and tell you if they accept the plan or
         | not.
         | 
         | An _entirely separate system_ exists to share Advanced
         | Passenger Information, i.e. passenger manifests [1], and it
         | goes even further that airlines share your overall identity
         | with each other, known as a Passenger Name Record [2], and a
         | variety of countries, led by the USA, insist on this
         | information in advance before the plane is allowed to take off
         | [3]
         | 
         | If you're going to be paranoid, please work with known facts
         | instead of speculating.
         | 
         | [0] https://publicapps.caa.co.uk/docs/33/CAP%20694.pdf
         | 
         | [1]
         | https://en.wikipedia.org/wiki/Advance_Passenger_Information_...
         | 
         | [2] https://en.wikipedia.org/wiki/Passenger_name_record
         | 
         | [3]
         | https://en.wikipedia.org/wiki/United_States%E2%80%93European...
        
       | NovemberWhiskey wrote:
       | This is not the first time this has happened; the phenomenon has
       | even got a name - "poison flight plan".
        
         | krisoft wrote:
         | > the phenomenon has even got a name - "poison flight plan".
         | 
         | Maybe, but it must not be a common phrase because your comment
         | is the first result when I search for it.
         | 
         | And it is also mentioned in this article: http://www.aero-
         | news.net/subsite.cfm?do=main.textpost&id=ce2...
         | 
         | And that's about it? Do you have any other sources?
        
           | tpmx wrote:
           | I think that term was invented four days ago by that article
           | writer. There are four other occurrances before then and
           | they're about PS2 games.
        
             | NovemberWhiskey wrote:
             | This term was in wide circulation when I was consulting at
             | NATS in the 2000-2005 time frame.
        
         | dboreham wrote:
         | The generic term I'm familiar with is "ping of death".
        
       | crabbone wrote:
       | I want to comment specifically on:
       | 
       | > The software and system are not properly tested.
       | 
       | Followed by suggesting to do fuzzing tests.
       | 
       | * Automatically generating valid flight paths is somewhat hard
       | (and you'd have to know which ones are valid because the system,
       | apparently, is designed to also reject some paths). It's also
       | possible that such a generator would generate valid but
       | improbable flight paths. There's probably an astronomic number of
       | possible flight paths, which makes exhaustive testing impossible,
       | thus no guarantee that a "weird" path would've been found. The
       | points through which the paths go seem to be somewhat dynamic
       | (i.e. new airports aren't added every day, but in a life-span of
       | such a system there will be probably a few added). More
       | realistically some points on flight paths may be removed. Does
       | the fuzzing have to account for possibilities of new / removed
       | points?
       | 
       | * This particular functionality is probably buried deep inside
       | other code with no direct or easy way to extricate it from its
       | surrounding, and so would be very difficult to feed into a
       | fuzzer. Which leads to the question of how much fuzzing should be
       | done and at what level. Add to this that some testing
       | methodologies insist on divorcing the testing from development as
       | not to create an incentive for testers to automatically okay the
       | output of development (as they would be sort of okaying their own
       | work). This is not very common in places like Web, but is common
       | in eg. medical equipment (is actually in the guidelines). So, if
       | the developer simply didn't understand what the specification
       | told them to do, then it's possible that external testing wasn't
       | capable of reaching the problematic code-path, or was severely
       | limited in its ability to hit it.
       | 
       | * In my experience with formats and standards like these it's
       | often the case that the standard captures a lot of impossible or
       | unrealistic cases, hopefully a superset of what's actually needed
       | in practice. Flagging every way in which a program doesn't match
       | the specification becomes useless or even counter-productive
       | because developers become overloaded with bug reports most of
       | which aren't really relevant. It's hard to identify the cases
       | that are rare but plausible. The fact that the testers didn't
       | find this defect on time is really just a function of how much
       | time they have. And, really, the time we have to test any program
       | can cover a tiny fraction of what's required to test a program
       | exhaustively. So, you need to rely on heuristics and gut feeling.
        
         | theptip wrote:
         | None of this really argues against fuzz testing; even with
         | completely bogus/malformed flight plans, it shouldn't be
         | possible for a dead letter to take down the entire system. And,
         | since it's translating between an upstream and downstream
         | format (and all the validation is done when ingesting the
         | upstream), you probably want to be sure anything that is valid
         | upstream is also valid downstream.
         | 
         | It's true that fuzz testing is easiest when you can do it more
         | at the unit level (fuzz this function implementing a core
         | algorithm, say) but doing whole-system fuzz tests is perfectly
         | fine too.
        
           | crabbone wrote:
           | This is not against the principle of fuzz testing. This is to
           | say that the author doesn't really know the reality of
           | testing and is very quick to point fingers. It's easy to tell
           | in retrospect that this particular aspect should've been
           | tested. It's basically impossible to find such defects
           | proactively.
        
       | seabass-labrax wrote:
       | I had been considering becoming an air traffic controller myself,
       | and it rather tickles me to think I might have missed my once-in-
       | a-lifetime opportunity to direct aircraft with the original pen-
       | and-paper flight strip mechanism in the 21st century! Completely
       | safe, excruciatingly low-capacity, and sounds like awfully good
       | fun as a novelty (for the willing ATC, not the passengers stuck
       | on the ground, I hasten to add).
        
       | drachir91 wrote:
       | What ticked me is that when the primary system threw in the
       | towel, an EXACT SAME system took over and ran the exact same code
       | on the exact same data as the primary. I know that with code and
       | algorithms it's not always the case but even then you know what
       | doing the same thing over and over expecting different results
       | defines...
       | 
       | Yes, it can be argued that the software should've had more
       | graceful failure modes and this shouldn't have thrown a critical
       | exception. It can be argued that the programmers should've seen
       | this possibility. We can argue a lot of things about this.
       | 
       | But the reality is that this is a mission-critical system. And
       | for such systems, there're ways to mitigate all of these mistakes
       | and allow the system to continue functioning.
       | 
       | The easiest (but least safe) one would be to have the secondary
       | system loaded with code that does the same thing but written by a
       | different team/vendor. It reduces the chance from 100% to much-
       | much less that if any input provokes an unforseen, system-
       | breaking bug in the primary, the same input will provoke the same
       | bug in the secondary.
       | 
       | An even better solution is to have a triumvirate system, where
       | all 3 have code written by different teams, and they always
       | compare results. If 3 agree, great, if 2 agree, not so great but
       | safe to assume that the bug is in the 1 not the 2 (but should
       | throw an alert for the supervisors that the whole system is in a
       | degraded mode where any further node failure is a showstopper),
       | and if all disagree, grind everything to a halt because the world
       | is ending, and let the humans handle it.
       | 
       | It can be refined even further. And it's not something new. So
       | why wasn't this system implemented in such a way? (Aside from
       | cost. I don't care about anyones cost-cutting incentives in
       | mission-critical systems. Sorry capitalism...)
        
         | throwaway894345 wrote:
         | > Aside from cost. I don't care about anyones cost-cutting
         | incentives in mission-critical systems. Sorry capitalism...
         | 
         | Capitalism is happy to have redundancy in mission critical
         | systems all the time. Why would it care here?
        
           | drachir91 wrote:
           | I don't know but in recent years I'm increasingly seeing
           | mission critical systems having only token or "apparent"
           | rendundancies instead of real ones, and couldn't find any
           | other rationale than cost savings and shareholder bottom
           | lines. I'm not saying that capitalism = bad, it's mostly
           | better than the alternatives, but just like its most direct
           | competitor, it suffers from bad implementations across the
           | world and unbounded human greed.
           | 
           | A recent and very "in the face" example, also from the air
           | travel industry would be the B737 Max and its AoA sensors.
           | There were two, for two flight computers, but MCAS only used
           | 1 flight computer and 1 AoA sensor, despite the already
           | existing crosslinks between the flight computers and the
           | sensors...
           | 
           | Pofit maxing first with the "no need for a new type rating
           | for the pilots", then cost-cutting first in aeronautical
           | engineering (solving an airframe design problem with
           | software, plus designing a flight envelope protection system
           | that can overpower the human pilots).
           | 
           | Then cost-cutting in software engineering and QC, rushing out
           | software made by (probably) inexperienced in the field
           | engineers and failing to properly test it and ensure that it
           | had the needed redundancy.
        
         | crabbone wrote:
         | > an EXACT SAME system took over and ran the exact same code
         | 
         | Did you ever work with HA systems? Because this is how they
         | work. It's two copies of the same system intended for the cases
         | when eg. hardware fails, or network partitioning happens etc.
        
           | drachir91 wrote:
           | No, I do not. But HA systems work like that because hardware
           | or network failure is what they are designed to guard
           | against, not a latent bug in the software logic. If there's a
           | software bug, both systems will exhibit the same behavior, so
           | HA fails there.
        
         | burntwater wrote:
         | I'm wondering if the backup system could have a delayed queue;
         | say, 30 seconds behind. If the primary fails, and exactly 30
         | seconds later the secondary system fails, you have reasonable
         | assurance that it was queue input that caused the failure.
         | Rollback to the last successful queue input, skip and flag the
         | suspect input, and see if the next input is successful.
        
           | drachir91 wrote:
           | This looks to me like it could work, but would need a ready
           | force of technicians always expecting something like that so
           | they can troubleshoot it in a timely manner.
        
       | asimpleusecase wrote:
       | And why could the system not put the failed flight plan in a
       | queue for human review and just keep on working for the rest of
       | the flights? I think the lack of that "feature" is what I find so
       | boggling.
        
         | dboreham wrote:
         | Because some software developers are crap at their jobs.
        
         | hn_throwaway_99 wrote:
         | To be fair that is exactly what the article said was a major
         | problem, and which the postmortem also said was a major
         | problem. I agree I think this is the most important issue:
         | 
         | > The FPRSA-R system has bad failure modes
         | 
         | > All systems can malfunction, so the important thing is that
         | they malfunction in a good way and that those responsible are
         | prepared for malfunctions.
         | 
         | > A single flight plan caused a problem, and the entire FPRSA-R
         | system crashed, which means no flight plans are being processed
         | at all. If there is a problem with a single flight plan, it
         | should be moved to a separate slower queue, for manual
         | processing by humans. NATS acknowledges this in their "actions
         | already undertaken or in progress":
         | 
         | >> The addition of specific message filters into the data flow
         | between IFPS and FPRSA-R to filter out any flight plans that
         | fit the conditions that caused the incident.
        
         | cratermoon wrote:
         | > why could the system not put the failed flight plan in a
         | queue
         | 
         | Because it doesn't look at the data as a "flight plan"
         | consisting of "way points" with "segments" along a "route" that
         | has any internal self-consistency. It's a bag of strings and
         | numbers that's parsed and the result passed along, if parsing
         | is successful. If not, give up. In this case fail the _entire
         | system_ and take it out of production.
         | 
         | Airline industry code is a pile of badly-written legacy
         | wrappers on top of legacy wrappers. (Mostly not including
         | actual flight software on the aircraft. Mostly). The FPRSA-R
         | system mentioned here is _not_ a flight plan system, it 's an
         | ETL system. It's not coded to model or work with flight plans,
         | it's just parsing data from system A, re-encoding it for system
         | B, and failing hard if it it can't.
        
           | slt2021 wrote:
           | good ETLs are usually designed to separate good records from
           | bad records, so even if one or two rows in the stream do not
           | conform to schema - you can put them aside and process the
           | rest.
           | 
           | seems like poor engineering
        
             | jandrese wrote:
             | The problem is that it means you have a plane entering the
             | airspace at some point in the near future and the system
             | doesn't know it is going to be there. The whole point of
             | this is to make sure no two planes are attempting to occupy
             | the same space at the same time. If you don't know where
             | one of the planes will be you can't plan all of the rest to
             | avoid it.
             | 
             | The thing that blows my mind is that this was apparently
             | the first time this situation had happened after 15 million
             | records processed. I would have expected it to trigger much
             | more often. It makes me wonder if there wasn't someone who
             | was fixing these as they came up in the 4 hour window, and
             | he just happened to be off that day.
        
               | d1sxeyes wrote:
               | > It makes me wonder if there wasn't someone who was
               | fixing these as they came up in the 4 hour window, and he
               | just happened to be off that day.
               | 
               | This is very possible. I know of a guy who does (or at
               | least a few years ago did) 24x7 365 on-call for a piece
               | of mission (although not safety) critical aviation
               | software.
               | 
               | Most of his calls were fixing AWBs quickly because
               | otherwise planes would need to take off empty or lose
               | their take-off slot.
               | 
               | Although there had been some "bus factor" planning and
               | mitigation around this guy's role, it involved engaging
               | vendors etc. and would have likely resulted in a lot of
               | disruption in the short term.
        
               | fbdab103 wrote:
               | Please tell me this guy is now wealthy beyond imagination
               | and living a life of leisure?
        
               | zaphar wrote:
               | Bad records aren't supposed to be ignored. They are
               | supposed to be looked at by a human who can determine
               | what to do.
               | 
               | Failing the way NATS did means that _all_ future flight
               | plan data including for planes already in the sky are not
               | longer being processed. The safer failure mode was
               | definitely to flag this plan and surface to a human while
               | continuing to process other plans.
        
             | cratermoon wrote:
             | I never said it was a _good_ ETL system. Heck, I don 't
             | even know if the specs for it even specifies what to do
             | with a bad record - there are at least 300 pages detailing
             | the system. Looking around at other stories, I see repeated
             | mentions of how the circumstances leading to this failure
             | are supposedly extremely rare, "one in 15 million"
             | according to one official[1]. But at 100,000 flights/day
             | (estimated), this kind situation would occur,
             | statistically, twice a year.
             | 
             | 1 https://news.sky.com/story/major-flights-disruption-
             | caused-b...
        
             | Scarblac wrote:
             | This flight plan was correct though, if there was some
             | validation like that then it should have passed.
             | 
             | The code that crashed had a bug, it couldn't deal with all
             | valid data.
        
         | pimterry wrote:
         | To be fair, the article suggests early on that sometimes these
         | plans are being processed for flights already in the air
         | (although at least 4 hours away from the UK).
         | 
         | If you can stop the specific problematic plane taking off then
         | keeping the system running is fine, but once you have a flight
         | in the air it's a different game.
         | 
         | It's not totally unreasonable to say "we have an aircraft en
         | route to enter UK airspace and we don't know when or where -
         | stop planning more flights until we know where that plane is".
         | 
         | If you really can't handle the flight plan, I imagine a
         | reasonable solution would be to somehow force the incoming
         | plane to redirect and land before reaching the UK, until you
         | can work out where it's actually going, but that's definitely
         | something that needs to wait for manual intervention anyway.
        
           | krisoft wrote:
           | > "we have an aircraft en route to enter UK airspace and we
           | don't know when or where - stop planning more flights until
           | we know where that plane is".
           | 
           | Flight plans don't tell where the plane is. Where is this
           | assumption coming from?
        
             | joncrocks wrote:
             | Presumably you need to know where upcoming flights are
             | going to be in the future (based on the plan), before they
             | hit radar etc.
        
               | macguillicuddy wrote:
               | For the most part (although there are important
               | exceptions), IFR flights are always in radar contact with
               | a controller. The flight plan is tool allows ATC and the
               | plane to agree a route so that they don't have to be
               | constantly communicating. ATC 'clears' a plane to
               | continue on the route to a given limit, and expects the
               | plane to continue on the plan until that limit unless
               | they give any future instructions.
               | 
               | In this regard UK ATC can choose to do anything they like
               | with a plane when it comes under their control - if they
               | don't consider the flight plan to be valid or safe they
               | can just instruct the plane to hold/divert/land etc.
               | 
               | I'm not sure the NATS system that failed has the ability
               | to reject a given flight plan back upstream.
        
               | RyJones wrote:
               | Mostly yes; however, there are large parts of the
               | Atlantic and Pacific where that isn't true (radar
               | contact). I know the Atlantic routes are frequently full
               | of plans that left the US and Canada heading to the UK.
               | 
               | I have no idea what percent of the volume into the UK
               | comes from outside radar control; if they asked a flight
               | to divert, that may open multiple other cans of worms.
        
               | mannykannot wrote:
               | > If they asked a flight to divert, that may open
               | multiple other cans of worms.
               | 
               | Any ATC system has to be resilient enough to handle a
               | diversion on account of things like bad weather,
               | mechanical failure or a medical emergency. In fact, I
               | would think the diversion of one aircraft would be less
               | of a problem than those caused by bad weather, and
               | certainly less than the problem caused by this failure.
               | Furthermore, I would guess that the mitigation would be
               | just to manually direct the flight according to the
               | accepted flight plan, as it was a completely valid one.
               | 
               | One of the many problems here is that they could not
               | identify the problem-triggering flight plan for hours,
               | and only with the assistance of the vendor's engineers.
               | Another is that the system had immediately foreclosed on
               | that option anyway, by shutting down.
        
             | lxgr wrote:
             | Flight plans do inform ATC where and when a plane is
             | expected to enter their FIR though, no?
        
         | micromacrofoot wrote:
         | I've had brief glimpses at these systems, and honestly I
         | wouldn't be surprised if it took more a year for a simple
         | feature like this to be implemented. These systems look like
         | decades of legacy code duct-taped together.
        
         | jameshart wrote:
         | The algorithm as described in the blogpost is _probably_ not
         | implemented as a straightforward piece of procedural code that
         | goes step by step through the input flightplan waypoints as
         | described. It may be implemented in a way that incorporates
         | some abstractions that obscured the fact that this was an
         | _input_ error.
         | 
         | If from the code's point of view it looked instead like a
         | sanity failure in the underlying navigation waypoint database,
         | aborting processing of flight plans makes a lot more sense.
         | 
         | Imagine the code is asking some repository of waypoints and
         | routes 'find me the waypoint where this route leaves UK
         | airspace'; then it asks to find the route segment that
         | incorporates that waypoint; then it asserts that that segment
         | passes through UK airspace... if that assertion fails, that
         | doesn't look _immediately_ like a problem with the flight plan
         | but rather with the invariant assumptions built into the route
         | data.
         | 
         | And of course in a sense it _is_ potentially a fatal bug
         | because this issue demonstrates that the assumptions the
         | algorithm is making about the data _are_ wrong and it is
         | potentially capable of returning incorrect answers.
        
         | Spivak wrote:
         | Because they hit "unknown error" and when that happens on
         | safety critical systems you have to assume that all your
         | system's invariants are compromised and you're in undefined
         | behavior -- so all you can do is stop.
         | 
         | Saying this should have been handled as a known error is
         | totally reasonable but that's broadly the same as saying they
         | should have just written bug free code. Even if they had parsed
         | it into some structure this would be the equivalent of a
         | KeyError popping out of nowhere because the code assumed an
         | optional key existed.
         | 
         | For these kinds of things the post mortem and remediation have
         | to kinda take as given that eventually a not predictable in
         | advance unhandled unknown error will occur and then work on how
         | it could be handled better. Because of course the solution to a
         | bug is to fix the bug, but the issue and the reason for the
         | meltdown is a DR plan that couldn't be implemented in a
         | reasonable timeframe. I don't care what programming practices,
         | what style, what language, what tooling. Something of a similar
         | caliber will happen again eventually with probability 1 even
         | with the best coders.
        
           | ummonk wrote:
           | That it's safety critical is all the more reason it should
           | fail gracefully (albeit surfacing errors to warn the user). A
           | single bad flight plan shouldn't jeopardize things by making
           | data on all the other flight plans unavailable.
        
           | jjk166 wrote:
           | > Saying this should have been handled as a known error is
           | totally reasonable but that's broadly the same as saying they
           | should have just written bug free code.
           | 
           | I think there's a world of difference between writing bug
           | free code, and writing code such that a bug in one system
           | doesn't propagate to others. Obviously it's unreasonable to
           | foresee every possible issue with a flight plan and handle
           | each, but it's much more reasonable to foresee that there
           | might be some issue with some flight plan at some point, and
           | structure the code such that it doesn't assume an error-free
           | flight plan, and the damage is contained. You can't make
           | systems completely immune to failure, but you can make it so
           | an arbitrarily large number of things have to all go wrong at
           | the same time to get a catastrophic failure.
        
             | ChoHag wrote:
             | [dead]
        
           | krisoft wrote:
           | > Even if they had parsed it into some structure this would
           | be the equivalent of a KeyError popping out of nowhere
           | because the code assumed an optional key existed.
           | 
           | How many KeyError exceptions have brought down your whole
           | server? It doesn't happen because whoever coded your web
           | framework knows better and added a big try-catch around the
           | code which handles individual requests. That way you get a
           | 500 error on the specific request instead of a complete
           | shutdown every time a developer made a mistake.
        
             | david422 wrote:
             | > big try-catch around the code which handles individual
             | requests.
             | 
             | I mean, that's assuming the code isolating requests is also
             | bug free. You just don't know.
        
             | numpad0 wrote:
             | Crash is a feature, though. It's not like exceptions raises
             | by itself into interpreter specifications. It's just that
             | it so happens that Web apps ain't need no airbags that slow
             | down businesses.
        
               | marcosdumay wrote:
               | On a multi-user system, only partial crashes are
               | features. Total crashes are bugs.
               | 
               | A web server is a multi-user system, just like a
               | country's air traffic control.
        
               | acdha wrote:
               | That line of reasoning is how you have systemic failures
               | like this (or the Ariane 5 debacle). It only makes sense
               | in the most dire of situations, like shutting down a
               | reactor, not input validation. At most this failure
               | should have grounded just the one affected flight rather
               | than the entire transportation network.
        
               | Spivak wrote:
               | I love that phrasing, I'm gonna use that from now on when
               | talking about low-stakes vs high-stakes systems.
        
           | madeofpalk wrote:
           | That's like saying that because one browser tab tried to
           | parse some invalid JSON then my whole browser should crash.
        
             | adrianmonk wrote:
             | You don't know that the JSON is invalid. Maybe the JSON is
             | perfect and your parser is broken.
        
             | Spivak wrote:
             | Well yes because you're describing a system where there are
             | really low stakes and crash recovery is always possible
             | because you can just throw away all your local state.
             | 
             | The flip side would be like a database failing to parse
             | some part of its WAL log due to disk corruption and just
             | said, "eh just delete those sections and move on."
        
               | madeofpalk wrote:
               | Crash the tab and allow all the others to carry on!
               | 
               | The problem here is that one individual document failed
               | to parse.
        
             | zimpenfish wrote:
             | No, it's more like saying your browser has detected
             | possible internal corruption with, say, its history or
             | cookies database and should stop writing to it immediately.
             | Which probably means it has to stop working.
        
               | ludwik wrote:
               | It definitely isn't. It was just a validation error in
               | one of thousands external data files that the system
               | processes. Something very routine for almost any software
               | dealing with data.
        
           | kccqzy wrote:
           | I agree with your first paragraph but your second paragraph
           | is quite defeatist. I was involved in a quite few of
           | "premortem" meetings where people think of increasing
           | improbable failure modes and devise strategies for them. It's
           | a useful meeting before larges changes to critical systems
           | are made live. In my opinion, this should totally be a known
           | error.
           | 
           | > Having found an entry and exit point, with the latter being
           | the duplicate and therefore geographically incorrect, the
           | software could not extract a valid UK portion of flight plan
           | between these two points.
           | 
           | It doesn't take much imagination to surmise that perhaps real
           | world data is broken and sometimes you are handed data that
           | doesn't have a valid UK portion of flight plan. Bugs can
           | happen, yes, such as in this case where a valid flight plan
           | was misinterpreted to be invalid, but gracefully dealing with
           | the invalid plan should be a requirement.
        
           | piva00 wrote:
           | > Because they hit "unknown error" and when that happens on
           | safety critical systems you have to assume that all your
           | system's invariants are compromised and you're in undefined
           | behavior -- so all you can do is stop.
           | 
           | What surprised me more is that the amount of data existing
           | for all waypoints on the globe is quite small, if I were to
           | implement a feature that query by their names as an
           | identifier the first thing I'd do is to check for duplicates
           | in the dataset. Because if there are, I need to consider that
           | condition in every place where I'd be querying a waypoint by
           | a potential duplicate identifier.
           | 
           | I had that thought immediately when looking at flight plan
           | format, noticed the short strings referring to waypoints, way
           | before getting to the section where they point out the name
           | collision issue.
           | 
           | Maybe I'm too used to work with absurd amounts of data (at
           | least in comparison to this dataset), it's a constant part of
           | my job to do some cursory data analysis to understand the
           | parameters of the data I'm working with, what values can be
           | duplicated or malformed, etc.
        
             | SoftTalker wrote:
             | If there are duplicate waypoint IDs, they are not close
             | together. They can be easily eliminated by selecting the
             | one that is one hop away from the prior waypoint. Just
             | traversing the graph of waypoints in order would filter out
             | any unreachable duplicates.
        
         | adrianmonk wrote:
         | Because the code classified it as a "this should never happen!"
         | error, and then it happened. The code didn't classify it as a
         | "flight plan has bad data" error or a "flight plan data is OK
         | but we don't support it yet" error.
         | 
         | If a "this should never happen!" error occurs, then you don't
         | know what's wrong with the system or how bad or far-reaching
         | the effects are. Maybe it's like what happened here and you
         | could have continued. Or maybe you're getting the error because
         | the software has a catastrophic new bug that will silently
         | corrupt all the other flight plans and get people killed. You
         | don't know whether it is or isn't safe to continue, so you
         | stop.
        
           | hn_throwaway_99 wrote:
           | I agree with the general sentiment "if you see an unexpected
           | error, STOP", but I don't really think that applies here.
           | 
           | That is, when processing a sequential queue which is what
           | this job does, it seems to me reading the article that each
           | job in the queue is essentially totally independent. In that
           | case, the code most definitely _should_ isolate  "unexpected
           | error in job" from a larger "something unknown happened
           | processing the higher level queue".
           | 
           | I've actually seen this bug in different contexts before, and
           | the lessons should always be: One bad job shouldn't crash the
           | whole system. Error handling boundaries should be such that a
           | bad job should be taken out of the queue and handled
           | separately. If you don't do this (which really just entails
           | being thoughtful when processing jobs about the types of
           | errors that are specific to an individual job), I guarantee
           | you'll have a bad time, just like these maintainers did.
        
             | crabbone wrote:
             | > is essentially totally independent
             | 
             | They physically cannot be independent. The system works on
             | an assumption that the flight was accepted and is valid,
             | but it cannot place it. What if it accidentally schedules
             | another flight in the same time and place?
        
             | Thorentis wrote:
             | Except that you can't be sure this bad flight plan doesn't
             | contain information that will lead to a collision. The
             | system needs to maintain the integrity of _all_ plans it
             | sees. If it can 't process one, and there's the risk of a
             | plane entering airspace with a bad flight plan, you need to
             | stop operations.
        
               | lozenge wrote:
               | But they have 4 hours to reach out to the one plane whose
               | flight plan didn't get processed and tell them to land
               | somewhere else.
        
               | ivraatiems wrote:
               | Assuming they can identify that plane.
               | 
               | Aviation is incredibly risk-averse, which is part of why
               | it's one of the safest modes of travel that exists. I
               | can't imagine any aviation administration in a developed
               | country being OK with a "yeah just keep going" approach
               | in this situation.
        
           | raverbashing wrote:
           | And that's why I never (or very rarely) put "this should
           | never happen" exceptions anymore in my code
           | 
           | Because you eventually figure out that, yes, it does happen
        
             | PeterStuer wrote:
             | So what does your code do when you did not handle the this
             | should never happen exception? Exit and print out a
             | stacktrace to stdout?
        
             | pmontra wrote:
             | A customer of mine is adamant in their resolve to log
             | errors, retry a few times, give up and go on with the next
             | item to process.
             | 
             | That would have grounded only the plane with the flight
             | plan that the UK system could not process.
             | 
             | Still a bug but with less effects to all the continent,
             | because planes that could not get inside or outside the UK
             | could not fly and that affected all of Europe and possibly
             | more.
        
               | crabbone wrote:
               | > That would have grounded only the plane with the flight
               | plan that the UK system could not process.
               | 
               | By the looks of it, it was few hours in the air by the
               | time the system had a breakdown. Considering it didn't
               | know what the problem was, it seems appropriate that it
               | shut down. No planes collided, so the worst didn't
               | happen.
        
             | airstrike wrote:
             | This here is the true takeaway. The bar for writing "this
             | should never happen" code must be set so impossibly high
             | that it might as well be translated into "'this should
             | never happen' should never happen"
        
               | andrewaylett wrote:
               | The problem with _that_ is that most programming
               | languages aren 't sufficiently expressive to be able to
               | recognise that, say, only a subset of switch cases are
               | actually valid, the others having been already ruled out.
               | It's sometimes possible to re-architect to avoid many of
               | this kind of issue, but not always.
               | 
               | What you're often led to is "if this happens, there's a
               | bug in the code elsewhere" code. It's really hard to know
               | what to do in that situation, other than terminate
               | whatever unit of work you were trying to complete: the
               | only thing you know for sure is that the software doesn't
               | accurately model reality.
               | 
               | In this story, there obviously _was_ a bug in the code.
               | And the broken algorithm shouldn 't have passed review.
               | But even so, the _safety critical_ aspect of the
               | _complete_ system wasn 't compromised, and _that_ part
               | worked as specified -- I suspect the system behaviour
               | under error conditions was mandated, and I dread to think
               | what might have happened if the developers (the company,
               | not individuals) were allowed to _actually_ assume errors
               | wouldn 't happen and let the system continue unchecked.
        
           | samus wrote:
           | That reasoning is fine, but it rather seems that the
           | programmers triggered this catastrophic "stop the world"
           | error because they were not thorough enough considering all
           | scenarios. As TA expounds, it seems that neither formal
           | methods nor fuzzing were used, which would have gone a long
           | way flushing out such errors.
        
             | JumpCrisscross wrote:
             | > _it rather seems that the programmers triggered this
             | catastrophic "stop the world" error because they were not
             | thorough enough considering all scenarios_
             | 
             | Yes. But also, it's an ATC system. Its primary purpose "is
             | to prevent collisions..." [1].
             | 
             | If the system encounters a "this should never happen!"
             | error, the correct move _is_ to shut it down and ground air
             | traffic. (The error shouldn 't have happened in the first
             | place. But the shutdown should have been more graceful.)
             | 
             | [1] https://en.wikipedia.org/wiki/Air_traffic_control
        
             | crabbone wrote:
             | Neither formal methods nor fuzzing would've helped if the
             | programmer didn't know that input can repeat. Maybe they
             | just didn't read the paragraph in whatever document
             | describes how this should work and didn't know about it.
             | 
             | I didn't have to implement flight control software, but I
             | had to write some stuff described by MIFID. It's a job from
             | hell, if you take it seriously. It's a series of normative
             | documents that explains how banks have to interact with
             | each other which were published quicker than they could've
             | been implemented (and therefore the date they had to take
             | effect was rescheduled several times).
             | 
             | These documents aren't structured to answer every question
             | a programmer might have. Sometimes the "interesting"
             | information is close together. Sometimes you need to guess
             | the keyword you need to search for to discover all the
             | "interesting" parts... and it could be thousands of pages
             | long.
        
               | sublimefire wrote:
               | I've only heard from people engineering systems for
               | aerospace industry and we're speaking hundreds of pages
               | of api documentation. It is very complex so equally the
               | chances of a human error are higher.
        
       | thrdbndndn wrote:
       | > Flight Plan Reception Suite Automated (FPRSA-R)
       | 
       | Where does the "-R" come from?
        
         | closewith wrote:
         | Replacement.
        
           | sdfghswe wrote:
           | Lol that's like me naming my filenames _final2_realfinal
           | before I learned about git.
        
       | cjbprime wrote:
       | Great post. This part goes too far, I think:
       | 
       | > Human lives were kept safe at all times
       | 
       | > The consequence of all this was not that any human lives were
       | put in danger, ..
       | 
       | When you're arguing that cancelling 2000 flights cost PS100M
       | _and_ that no human danger was incurred, something should feel
       | off. That might be around 600k humans who weren 't able to be
       | where they felt they needed to be. Did they have somewhere safe
       | to sleep? Did they have all the medications they needed with
       | them? Did they have to miss a scheduled surgery? Could we try to
       | measure the effect on their well-being in aggregate, using a
       | metric other than the binary state of alive or facing imminent
       | death? You get the idea.
       | 
       | Of course I agree with the version of the claim that says that no
       | direct danger was caused from the point of view of the failing-
       | safe system. But when you're designing a system, it ought to be
       | part of your role to wonder where risk is going as you more
       | stringently displace it from the singular system and source of
       | risk that you maintain.
        
         | [deleted]
        
         | kodt wrote:
         | I mean it could have also saved lives by that logic. Did
         | someone missing their flight mean they also missed a terrible
         | pileup on the roadways after landing? We can imagine pretty
         | much any scenario here.
        
       | gabereiser wrote:
       | So they forgot to "geographically disparate" fence their queries.
       | Having built a flight navigation system before, I know this bug.
       | I've seen this bug. I've followed the spec to include a geofence
       | to avoid this bug.
        
         | sam0x17 wrote:
         | Why on earth do they not have GUIDs for these navigation points
         | if the names are not globally unique and inter-region routes
         | are commonplace?
        
           | f1shy wrote:
           | The names have to be entered manually by pilots, if e.g. they
           | change the route. They have to be transmitted over the air by
           | humans. So they must be short ans simple.
        
             | zarzavat wrote:
             | Yes but shouldn't one step of the code be to translate
             | these non-unique human-readable identifiers into completely
             | unique machine-readable identifiers?
        
               | avianlyric wrote:
               | How exactly would you do that? It's impossible to map
               | from a dataset of non-unique identifiers to unique
               | identifiers without additional data and heuristics. The
               | mapping is ambiguous by definition.
               | 
               | The underlying flight plan standard were all created in
               | an era of low memory machines, and when humans were
               | expected to directly interpret data exactly as the
               | programs represented it internally (because serialisation
               | and deserialisation is expensive when you need every CPU
               | cycle just run your core algorithms)
        
             | blitzar wrote:
             | Clippy: It looks like you are trying to enter a non unique
             | navigation point, did you mean the one in France or the one
             | in Australia?
        
           | paulddraper wrote:
           | Aviation protocols are extremely backwards compatible and
           | low-tech compatible.
           | 
           | You need to be able to read, write, hear, and speak the
           | identifier. (And receive/transmit in morse code)
           | 
           | Would it be okay to have an "area code prefix" in the
           | identifier? Plausible (but practically speaking too late for
           | that)
        
           | tortue0 wrote:
           | They do and use lat/lon in some cases. Reviewing and
           | inputting that (when being done manual) is another story -
           | but it's technically possible.
        
           | amoerie wrote:
           | Long story: because changing identifiers is a considerable
           | refactoring, and it takes coordination with multiple
           | worldwide distributed partners to transition safely from the
           | old to the new system, all to avoid a hypothetical issue some
           | software engineer came up with
           | 
           | Short story: money. It costs money to do things well.
        
             | ftxbro wrote:
             | > Long story: because changing identifiers is a
             | considerable refactoring
             | 
             | is this what refactoring means
        
               | NBJack wrote:
               | Yes. It would cascade into:
               | 
               | Changes in how ATCs operate
               | 
               | Changes in how pilots operate
               | 
               | Changes in how airplanes receive these instructions
               | (including the flight software itself, safety systems,
               | etc.)
               | 
               | Changes in how airplanes are tested
               | 
               | Changes in how pilots are trained
               | 
               | Etc. In this case, the refactoring requires changes to
               | hardware, software, training, manufacturing, and humans.
        
               | ftxbro wrote:
               | does refactoring mean literally any non-local change even
               | just like changing a variable name, or does it usually
               | mean some kind of structural or architectural non-local
               | change
        
               | deadfish wrote:
               | Pretty sure that is still not the meaning of refactoring.
               | As I understand it refactoring should mean no changes to
               | the external interface but changes to how it is
               | implemented internally.
        
           | epanchin wrote:
           | What three words would be a better solution than a guid, as
           | transmittable over radio.
        
             | dharmab wrote:
             | W3W contains homonyms and words that are easily confused by
             | non-native english speakers. Often within just a few KM.
             | The latter is why ATC uses "niner", to avoid confusing
             | "nine" and "nein".
             | 
             | Talk to someone deep in the GIS rabbit hole and you'll get
             | a rant about how bad W3W is:
             | https://cybergibbons.com/security-2/why-what3words-is-not-
             | su...
        
             | chrisweekly wrote:
             | That's "What3Words" --
             | https://en.m.wikipedia.org/wiki/What3words -- a system for
             | representing geographic location using globally-unique word
             | triads.
        
             | bsder wrote:
             | WTW is a proprietary system that should never be used:
             | 
             | https://www.walklakes.co.uk/opus64534.html
             | 
             | The biggest fault (besides being proprietary) is that you
             | must be _online_ in order to use WTW. The times that you
             | might need WTW are _ALSO_ the times you are most likely to
             | be unable to be online.
        
           | Topgamer7 wrote:
           | I would guess because humans have to read this and ascertain
           | meaning from it. Not everyone is a technical resource.
        
           | nwallin wrote:
           | 1. Pilots occasionally have to fat finger them into
           | ruggedized I/O devices and read them off to ATC over radios.
           | 
           | 2. These are defined by the various regional aviation
           | authorities. The US FAA will define one list, (and they'll be
           | unique in the US) the EU will have one, (EASA?) etc.
           | 
           | The AA965 crash (1995-12-20) was due to an aliased waypoint
           | name. Colombia had two waypoints with the same name within
           | 150 nautical miles of each other. (the name was 'R') This was
           | in violation of ICAO regulations from like the '70s.
           | 
           | https://en.wikipedia.org/wiki/American_Airlines_Flight_965
        
           | gabereiser wrote:
           | FAA regulations state that fixes, navs, and waypoints must be
           | phonetically transmittable over radio.
           | 
           | I.E. Yankee = YANKY. The pilot and ATC must be location
           | aware. Apparently their software does not.
        
           | gavinsyancey wrote:
           | It sounds like for actual processing they replace them with
           | GPS coordinates (or at least augment them with such). But
           | this is the system that is responsible for actually doing
           | that...
        
         | tppiotrowski wrote:
         | ICAO standard effective from 1978 to only duplicate identifiers
         | if more than 600 nmi (690 mi; 1,100 km) apart
        
         | [deleted]
        
       | c7DJTLrn wrote:
       | This is an interesting engineering problem and I'm not sure what
       | the best approach is. Fail safe and stop the world, or keep
       | running and risk danger? I imagine critical systems like
       | trading/aerospace have this worked out to some degree.
        
         | crabbone wrote:
         | There isn't and cannot be a preference to either one. It always
         | depends on what the system is doing and what the consequences
         | would be... Pacemaker cannot "fail safe" for example, under no
         | circumstances. It's meaningless to consider such cases. But if
         | escalation to a human operator is possible, then it will also
         | depend on how the system is meant to be used. In some cases
         | it's absolutely necessary that the system doesn't try to handle
         | errors (eg. if say a patient is in a CT machine -- you always
         | want to stop to, at least, prevent more radiation), but in the
         | situation like the one with the flight control -- my guess is
         | that you want the system to keep trying _while alerting the
         | human operator_.
         | 
         | But then it can also depend on what's in the contract and who
         | will get the blame for the system functioning incorrectly. My
         | guess here is that failing w/o attempting to recover was, while
         | an overkill, a safer strategy than to let eg. two airplanes be
         | scheduled for the same path (and potentially collide).
        
         | fbdab103 wrote:
         | Absolutely no idea on what is correct, but I love to reference
         | this article on software practices at NASA[0], They Write the
         | Right Stuff.
         | 
         | [0] https://www.fastcompany.com/28121/they-write-right-stuff
        
       | lbriner wrote:
       | I seem to remember another problem at NATS which had the same
       | effect. Primary fell over so they switched over to a secondary
       | that fell over for the exact same reason.
       | 
       | It seems like you should only failover if you know the problem is
       | with the primary and not with the software itself. Failing over
       | "just because" just reinforces the idea that they didn't have
       | enough information exposed to really know what to do.
       | 
       | The bit that makes me feel a bit sick though is that they didn't
       | have a method called "ValidateFlightPlan" that throws an error if
       | for any reason it couldn't be parsed and that error could be
       | handled in a really simple way. What programmer would look at a
       | processor of external input and not think, "what do we do with
       | bad input that makes it fall over?". I did something today for a
       | simple message prompt since I can't guarantee that in all
       | scenarios the data I need will be present/correct. Try/catch and
       | a simple message to the user "Data could not be processed".
        
         | 1970-01-01 wrote:
         | Yep. In electrical terms, you replaced the fuse to watch it
         | blow again. There are no more fuses in your shop. Progress?
        
         | d1sxeyes wrote:
         | Well, if the primary is known not to be in a good state, you
         | might as well fail over and hope that the issue was a fried
         | disk or a cosmic bit flip or something.
         | 
         | The real safety feature is the 4 hour lead time before manual
         | processing becomes necessary.
         | 
         | One of the key safety controls in aviation is "if this breaks
         | for any reason, what do we do", not so much "how do we stop
         | this breaking in the first place".
        
           | samus wrote:
           | It was in a bad state, but in a very inane way: a flight plan
           | in its processing queue was faulty. The system itself was
           | mostly fine. It was just not well-written enough to
           | distinguish an input error from an internal error, and thus
           | didn't just skip the faulty flight plan.
        
             | Twirrim wrote:
             | at the risk of nitpicking: "a flight plan in its processing
             | queue was faulty" isn't true, the flight plan was fine. It
             | couldn't process it.
             | 
             | I mention this only because the Daily Mail headline pissed
             | me off with it's usual bullshit foreigner fear mongering
             | crap.
        
               | samus wrote:
               | Indeed, that intention is quite transparent in this case.
               | Anyways, I suspect that invalid input exists that would
               | have made the system react in a similar way
        
           | zaphar wrote:
           | I'm no aviation safety controls expert but it seems to me
           | that there are two types of controls that should be in place:
           | 
           | 1. Process controls: What do we do when this breaks for any
           | reason.
           | 
           | 2. Engineering controls: What can we do to keep this from
           | breaking in the first place?
           | 
           | Both of them seem to be somewhat essential for a truly safe
           | system.
        
             | jeffrallen wrote:
             | One or more of three results can come from the engineering
             | exercise of trying to keep something from breaking in the
             | first place:
             | 
             | 1. You could know the solution, but it would be too heavy.
             | 
             | 2. You could know the solution, but it would include more
             | parts, each of which would need the same process on it, and
             | the process might fail the same way
             | 
             | 3. You miss something and it fails anyway, so your "what if
             | this fails" path better be well rehearsed and executed.
             | 
             | Real engineering is facing the tradeoffs head on, not hand
             | waving them away.
        
             | mixdup wrote:
             | It's very hard to ensure you capture every single possible
             | failure mode. Yes, the engineering control is important but
             | it's not the most critical. What to do if it does fail (for
             | any reason) is the truly critical control, because it
             | solves for the possibility of not knowing every possible
             | way something might fail and therefore missing some way to
             | prevent a failure
        
         | sheepshear wrote:
         | Failing over is correct because there's no way to discern that
         | the hardware is _not_ at fault. They should have designed a
         | better response to the second failure to avoid the knock-on
         | effects.
        
       | anupj wrote:
       | Great writeup
        
       | johnklos wrote:
       | So the "engineering teams" couldn't tail /var/log/FPRSA-R.log and
       | see the cause of the halt?
       | 
       | I've had servers and software that I had never, ever used before
       | stop working, and it took a lot less than four hours to figure
       | out what went wrong. I've even dealt with situations where bad
       | data caused a primary and secondary to both stop working, and
       | I've had to learn how to back out that data and restart things.
       | 
       | Sure, hindsight is easy, but when you have two different systems
       | halt while processing the same data, the list of possible causes
       | shrinks tremendously.
       | 
       | The lack of competence in the "engineering teams" tells us lots
       | about how horribly these supposedly critical systems are managed.
        
         | slingnow wrote:
         | Damn, if only you had been there to instantly save the day by
         | just running that simple command!
        
           | johnklos wrote:
           | No. That's silly. The logs would've / should've just shown
           | that the program halted because it was confused about data.
           | The actual commands to fix would've been quite different.
        
         | seabass-labrax wrote:
         | You're assuming that there is in fact a /var/log/FPRSA-R.log to
         | tail - it would not at all surprise me if a system this old is
         | still writing its logs to a 5.25 inch floppy in Prestwick or
         | Swanwick^1.
         | 
         | ^1: they closed the West Drayton centre about twenty years ago;
         | I don't imagine they moved their old IBM 9020D too, if they
         | still had it by then. My comment is nonetheless only slightly
         | exaggerated ;)
        
       | codeulike wrote:
       | This is a great post. My reading of it:
       | 
       | - waypoint names used around the world are not unique
       | 
       | - as a sortof cludge, "In order to avoid confusion latest
       | standards state that such identical designators should be
       | geographically widely spaced."
       | 
       | - but still you might get the same waypoint name used twice in a
       | route to mean different places
       | 
       | - the software was not written with that possibilty in mind
       | 
       | - route did not compute
       | 
       | - threw 'critical exception' and entered 'maintenance mode' -
       | i.e. crashed
       | 
       | - backup system took over, hit the same bug with the same bit of
       | data, also crashed
       | 
       | - support people have a crap time
       | 
       | - it wasnt until they called the software supplier that they
       | found the low level logs that revealed the cause of the problem
        
         | noman-land wrote:
         | My jaw kept dropping with each new bullet point.
        
           | xvector wrote:
           | Same, is aviation technology really this primitive?
        
             | H8crilA wrote:
             | It is mostly quite primitive, but it also works amazingly
             | well. For example ILS or VOR or ATC audio comms can all be
             | received and read correctly using hardware built from entry
             | level ham radio knowledge. Altimeters still require a
             | manual input of pressure. Fuel levels can be checked with
             | sticks.
             | 
             | Kinda the opposite of a modern web/mobile app, complicated,
             | massively bloated and breaks rather often :).
        
             | rozap wrote:
             | shhh, nobody tell xvector that unleaded avgas finally
             | happened in 2022 :)
        
         | [deleted]
        
         | teleforce wrote:
         | Thanks for the summary and TL;DR.
         | 
         | Essentially this is down to the lack of proper namespace, who'd
         | have thought aerospace engineer need to study operating
         | systems! I've a friend who's a retired air force pilot and
         | graduated from Cranfield University, UK foremost post graduate
         | institution for aerospace engineering with their own airport
         | for teaching and research [1]. According to him he did study OS
         | in Cranfield, and now I finally understand why.
         | 
         | Apparently based on the other comments, the standard for
         | namespace is already available but currently it's not being
         | used by the NATS/ATC, hopefully they've learnt their lessons
         | and start using it for goodness sake. The top comment mentioned
         | about the geofencing bug, but if NATS/ATC is using proper
         | namespace, geofencing probably not necessary in the first
         | place.
         | 
         | [1] Cranfield University:
         | 
         | https://en.wikipedia.org/wiki/Cranfield_University
        
           | seabass-labrax wrote:
           | It sounds like a great place to study that has its own ~2km
           | long airstrip! It would be nice if they had a spare Trident
           | or Hercules just lying around for student baggage transport
           | :)
        
         | dboreham wrote:
         | "software supplier"??? Why on God's green earth isn't someone
         | familiar with the code on 7/24 pager duty for a system with
         | this level of mission criticality?
        
           | sublimefire wrote:
           | I think there is a bit of ignorance about how software is
           | sold in some cases. This is not just some windows or browser
           | application that was sold but it also contained the staff
           | training with a help to procure hardware to run that software
           | and maybe even more. Such systems get closed off from the
           | outside without a way to send telemetry to the public
           | internet (I've seen this before, it is bizarre and hard to
           | deal with). The contract would have some clauses that deal
           | with such situations where you will always have someone on
           | call as the last line of defense if a critical issue happens.
           | Otherwise, the trained teams should have been able to deal
           | with it but could not.
        
           | seabass-labrax wrote:
           | That would be... the software supplier. This is quite a
           | specific fault (albeit one that shouldn't have happened if
           | better programming practices had been used), so I don't think
           | anyone but the software's original developers would know what
           | to do. This system is not safety-critical, luckily.
        
       | sp0ck wrote:
       | Small suggestion. Don't choose obscure language (in terms of
       | popularity, 28th on TIOBE index with 0.65% rating) to visualize
       | structure and algorithms. Otherwise you risk average viewer will
       | stop reading the moment he encounter code samples. There are 27
       | more popular languages, some of them orders of magnitute more.
        
         | louthy wrote:
         | Maybe he doesn't care if people stop reading and he'd prefer to
         | use the language he's most comfortable with? It's his blog
         | after all, not yours.
         | 
         | Additionally, perhaps he's making the point that a language
         | with an expressive type system makes solving problems like this
         | trivial.
        
           | sp0ck wrote:
           | If you don't care about readers reading it or not then what
           | is the point to publish an article ?
        
             | louthy wrote:
             | I read it. Probably lots of other people did too.
             | Presumably the people who don't think computer science
             | begins and ends with JavaScript
        
             | recursive wrote:
             | Why does there have to be a point? If there is one, why do
             | you need to understand it?
        
         | rjh29 wrote:
         | The code is a relatively small part of the article, and quite
         | far into it I might add.
        
         | daaaaaaan wrote:
         | I appreciated the Haskell examples, they aren't particularly
         | hard to follow. How do you think those more popular languages
         | _got_ more popular?
        
       | redleader55 wrote:
       | I imagine, for this kind of system, there is only one supplier.
       | Why not force that supplier, as part of their 10-15 yr contract,
       | to publish the source code for everything, not necessarily as
       | FOSS. This way if there are bugs they can be reported and fixed.
        
         | passwordoops wrote:
         | I agree. But this would assume that:
         | 
         | 1- the people writing and approving the specs even understand
         | why this might be a good suggestion
         | 
         | 2- the people ultimately approving the contract aren't in bed
         | with the supplier
        
           | dboreham wrote:
           | There's always prison for those people.
        
           | FeepingCreature wrote:
           | 3- the people operating the system are capable of maintaining
           | its source code
        
       | SillyUsername wrote:
       | Bugs happen. Fact of being written by fleshy meatballs. What
       | should also have been highlighted is that they clearly had no
       | easy way of finding the specific buggy input in the logs nor
       | simulating it without contacting the manufacturer.
        
         | FeepingCreature wrote:
         | No way or no procedure.
        
       | throw74848 wrote:
       | [flagged]
        
       | tantalor wrote:
       | > The programming style is very imperative
       | 
       | Is that supposed to be a meaningful statement?
        
         | tome wrote:
         | Yes, typically it would be used to mean things like the code
         | mutates data in place rather than using persistent data
         | structures, explicitly loops over data rather than using
         | higher-order map, fold etc. operations, and explicitly checks
         | tag bits rather than using sum types.
        
         | amiga386 wrote:
         | It tells you the author of the blogpost is one of those
         | functional programming proselytizers, you can determine that
         | just by how they sneer. So yes, it is a meaningful statement,
         | but the meaning says more about the author than what they're
         | commenting on.
         | 
         | They similarly reveal their biases when they say "the mistake
         | of the faulty algorithm [...] maintaining pointers into each of
         | them"
         | 
         | Lo and behold, the author chooses Haskell at the end to
         | demonstrate how they'd do it. Such pure, very monoid in the
         | category of endofunctors.
        
           | tantalor wrote:
           | Thanks for making me laugh :D
        
           | tome wrote:
           | > the author of the blogpost is one of those functional
           | programming proselytizers ... Such pure, very monoid in the
           | category of endofunctors
           | 
           | Sorry, who's sneering here?
        
         | [deleted]
        
       | supernova87a wrote:
       | Did the creator of the flight plan software engage in adversarial
       | testing to see if they could break the system with badly formed
       | flight plans? Or was / is the typical practice to mostly just see
       | if the system meets just the "well-behaved" flight plan
       | processing requirements? (with unit tests, etc)
        
         | bombcar wrote:
         | I think we all know the answer to this.
         | 
         | A huge portion of "exploits" in the last 20 years have been
         | "internal business APIs" if you will being exposed to malicious
         | actors.
        
       | jacquesm wrote:
       | Trusted input rarely should be trusted. It's input. You need to
       | validate it as if it is hostile and have a process for dealing
       | with malformed input. Now of course, standing by the sidelines it
       | is easy to criticize and I'm sure whoever worked on this wasn't
       | stupid. But I've seen this error often enough now in practice
       | that I think that it needs to be drilled into programmers heads
       | more forcefully: stuff is only valid if you have _just_ validated
       | it. If you send it to someone else, if someone you trust sends it
       | to you, if you store in a database and then retrieve it and so on
       | then it is just input all over again and you _probably_ should
       | validate it for being well-formed. If you don 't do that then
       | you're a bitflip, migration or an update away from an error that
       | will cause your system to go into an unstable state and the real
       | problem is that you might just propagate the error downstream
       | because you didn't identify it.
       | 
       | Input is hard. Judging what constitutes 'input' in the first
       | place can be harder.
        
         | lgeorget wrote:
         | From what I gathered from the article, the input WAS valid.
         | It's the software that was unable to handle a specific case of
         | valid input.
        
           | jacquesm wrote:
           | That's fine, and is exactly the kind of case that I was
           | thinking of: your software has a different idea of what is
           | valid than an upstream piece of software, so from _your_
           | perspective it is invalid. So you need to pull this message
           | out of the stream, sideline it so it can be looked at by
           | someone qualified enough to make the call of what 's the case
           | (because it could well be either way) and processing for all
           | other messages should continue as normal. After all the only
           | reason you can say with confidence that it in fact was valid
           | is because someone looked at it! You can only do that well
           | after the fact.
           | 
           | A message switch [1] that I worked on had to deal with
           | messages sources from 100's of different parties and while in
           | principle everybody was working from the same spec (CCITT
           | [2]) every day some malformed messages would land in the
           | 'error' queue. Usually the problem was on the side of the
           | sender, but sometimes (fortunately rarely) it wasn't and then
           | the software would be improved to be able to handle that case
           | correctly as well. Given the size of the specs and the many
           | variations on the protocols it wasn't weird at all to see
           | parties get confused. What's surprising is that it happens as
           | rarely as it does.
           | 
           | The big takeaway here should be that even if something
           | happens very rarely it should still not result in a massive
           | cascade, the system should handle this gracefully.
           | 
           | [1] https://www.kvsa.nl/en/
           | 
           | [2] https://en.wikipedia.org/wiki/Group_4_compression
        
             | dboreham wrote:
             | Exact same experience developing systems that process
             | RFC-822 (and descendents) email messages.
        
         | onetimeuse92304 wrote:
         | This really isn't about input. Whether it comes from outside or
         | produce inside the application, the reality is that everything
         | can have bugs. A correct input can cause a buggy application to
         | fail. So while verifying input is obviously an important step,
         | it is not even a beginning if you are really looking to
         | building reliable software.
         | 
         | What really is the heart of the matter is for the entire thing
         | to be allowed to crash due to a problem with single
         | transaction.
         | 
         | What you really want to do is to have firewalls. For example,
         | you want a separate module that runs individual transactions
         | and a separate shell that orchestrates everything but has no or
         | very limited contact with the individual transactions. As bad
         | as giving up on processing a single aircraft is, allowing the
         | problem to cascade to entire system is way worse.
         | 
         | What's even more tragic about this monumental waste of
         | resources is that the knowledge about how to do all of this is
         | readily available. The aerospace and automotive industry have
         | very high development standards along with people you can hire
         | who know those standards and how to use them to write reliable
         | software.
        
           | jacquesm wrote:
           | Yes, there are multiple problems here that interplay in a
           | really bad way and that's one of them. But the input
           | processing/validation step is the first point of contact with
           | that particular flight plan and it should have never
           | progressed beyond that state.
           | 
           | It all hinges on a whole bunch of assumptions and each and
           | every one of those should be dealt with structurally rather
           | than by patching things over.
           | 
           | Just from reading TFA I see a very long list of things that
           | would need attention. Quick recap:
           | 
           | - validate all input
           | 
           | - ensure the system can never stall on any one record
           | 
           | - the system will occasionally come across malformed input
           | which needs a process
           | 
           | - it won't be immediately clear whether the system or the
           | input is at fault, which needs a process
           | 
           | - testing will need to take these scenarios into account
           | 
           | - negative tests will need to be created (such as:
           | purposefully malformed input)
           | 
           | - attempts should be made to force the system into undefined
           | states using malformed _and_ well formed input
           | 
           | - a supervisor mechanism needs to be built into the system
           | that checks overall system health
           | 
           | And probably many more besides. But this is what I gather
           | from the article is what they'll need at a minimum. Typically
           | once you start digging into what it would take to implement
           | any of these you'll run into new things that also need
           | fixing.
           | 
           | As for the last bit of your comment: I'm quite sure that
           | those standards were in play for this particular piece of
           | software, the question is whether or not they were properly
           | applied and even then there are no guarantees against
           | mistakes, they can and do happen. All that those standards
           | manage to do is to reduce their frequency by catching the
           | bulk of them. But some do slip through, and always will.
           | Perfect software never is.
        
             | [deleted]
        
         | [deleted]
        
       | failbuffer wrote:
       | > The manufacturer was able to offer further expertise including
       | analysis of lower-level software logs which led to identification
       | of the likely flight plan that had caused the software exception.
       | 
       | This part stood out to me. I've found it super helpful to include
       | a reference to which piece of days in working with in log
       | messages and exceptions. It helps isolated problems so much
       | faster.
        
       | throw7 wrote:
       | Well, I certainly hope they've at least stopped issuing waypoints
       | with identical names... although it wouldn't surprise me if
       | geographically-distant is the best we can do as a species.
        
         | SoftTalker wrote:
         | They appear to be sequences of 5 upper-case letters. Assuming
         | the 26-character alphabet, that should allow for nearly 12
         | million unique waypoint IDs. The world is a big place but that
         | seems like it should be enough. The more likely problem is that
         | there is (or was) no internationally-recognized authority in
         | charge of handing out waypoint IDs, so we have at least legacy
         | duplicates if not potential new ones.
        
           | seabass-labrax wrote:
           | You have to reduce that to the (still massive) set of IDs
           | that are somewhat pronounceable in languages that use the
           | Latin script. You don't want to be the air traffic controller
           | trying to work out how to say 'Lufthansa 451, fly direct
           | QXKCD'. Nonetheless, I think the there is little cause for
           | concern about changing existing IDs. There might be
           | sentimental attachment, but it takes barely a few flights
           | before the new IDs start sticking, and it's not like pilots
           | never fly new routes.
        
             | SoftTalker wrote:
             | I thought that is what the "ICAO pronunciation" was for?
             | 
             | "Fly direct Quebec Xray Kilo Charlie Delta"
        
               | seabass-labrax wrote:
               | It is, but fixes are almost always spoken as words rather
               | than letter-by-letter. For this reason, they are usually
               | chosen to be somewhat pronounceable, and occasionally you
               | even get jokes in the names. Likewise, radio beacons and
               | airports are usually referred to by the name of their
               | location; for instance "proceed direct Dover" rather than
               | "proceed direct Delta Victor Romeo".
               | 
               | I think a lot of pilots and air traffic controllers would
               | be irritated if they had to spend longer reading out
               | clearances and instructions. In a world where vocal
               | communication is still the primary method of air traffic
               | control, there might be a measurable reduction in
               | capacity in some busier regions.
        
               | drachir91 wrote:
               | No, waypoints aren't spelled out with the ICAO alphabet.
               | They are mnemonics that are pronounced as a word and only
               | spelled out if the person on the receiving end requests
               | it because of bad radio reception, or unfamiliarity with
               | the area/waypoint.
               | 
               | For example, Hungarian waypoints, at least the more
               | important ones are normally named after cities, towns or
               | other geographical locations near them, and use the
               | locations name or abbreviated name, being careful that
               | they can be pronounced reasonably easily for English
               | speakers. Like: ERGOM (for the city Esztergom), ABONY
               | (for the town Fuzesabony), SOPRO (for Sopron), etc.
        
           | lgeorget wrote:
           | Not all 5-character long strings are usable though. They have
           | to be pronounceable as a single word and understandable over
           | radio as much as possible.
        
       | dundarious wrote:
       | I wish the article contained some explanation of why the
       | processing for NATS requires looking at both the ADEXP waypoints
       | _and_ the ICAO4444 waypoints (not a criticism per se, it may not
       | have been addressed in the underlying report). Just looking at
       | the ADEXP seems sufficient for the UK segment logic.
       | 
       | I'm guessing it has something to do with how ICAO4444 is
       | technically human readable, and how in some meaningful sense,
       | pilots and ATC staff "prefer" it. e.g., maybe all ICAO4444
       | waypoints are "significant" to humans (like international
       | airports), whereas ADEXP waypoints are often "insignificant"
       | (local airports, or even locations without any runway at all).
       | 
       | Of course with 20/20 hindsight, it seems obviously incorrect to
       | loop through the ICAO4444 waypoints in their entirety, instead of
       | "resuming" from an advanced position. But why look at them at
       | all?
        
         | masklinn wrote:
         | Possibly it needs the ICAO information to communicate with some
         | systems, but has to work in ADEXP to have sufficient
         | granularity (the essay mentions the possibility of "clipping",
         | a flight going through the UK between two ICAO waypoints).
        
           | dundarious wrote:
           | Yes, I'm essentially wanting to know more about those
           | existing ICAO-based systems, be they machine or not.
        
       | Gud wrote:
       | A day I don't want to remember. Took me 15 hours to reach my
       | destination instead of 2. Had to take train, bus, then train
       | again. 30 minutes after I had booked my tickets, everything was
       | fully booked for two days.
        
         | conradfr wrote:
         | Did you meet John Candy along the way?
        
       | hermitcrab wrote:
       | >"in typical Mail Online reporting style: "Did blunder by French
       | airline spark air traffic control issues?"
       | 
       | The Daily Mail is a horrible, right-wing paper in the UK that
       | blames 'foreigners' for everything. Particularly the French.
       | 
       | Out of curiosity, is there a corresponding French paper that
       | blames the English or the British for everything?
        
         | dopidopHN wrote:
         | French here, as much as I wish It was the case for comical
         | effect... I don't think so.
         | 
         | Our right wing press is also desperately economically liberal
         | so anything privately run is inherently better.
         | 
         | Maybe radio stations? Honestly, major respect to the daily mail
         | for those snarky attacks that keep up the good spirits between
         | our two countries.
         | 
         | It's maybe the food or the weather that make them aggro ? Idk,
         | but don't worry, we love to hate the perfide Albion. Too.
         | 
         | Fellow French: am I wrong ? Maybe "valeur actuelle" could pull
         | up that type of bullshit, but I think they are too busy blaming
         | Islam to start thinking about our former colony across the
         | channel.
        
           | hermitcrab wrote:
           | >major respect to the daily mail for those snarky attacks
           | 
           | There is really nothing to like or respect about the Daily
           | Mail. https://www.globaljustice.org.uk/blog/2017/10/horrible-
           | histo...
           | 
           | >our former colony across the channel
           | 
           | Touche! ;0)
        
         | seszett wrote:
         | Well not really. People in France don't really care that much
         | about England.
         | 
         | The one country that is often blamed for problems is rather
         | Germany, but honestly even Germany doesn't get blamed for petty
         | problems like that.
        
       | rcostin2k2 wrote:
       | The fact that they blamed the French flight plan already accepted
       | by Eurocontrol proves that they didn't really know how the
       | software works. And here the Austrian company should take part of
       | the blame for the lack of intensive testing.
        
         | littlestymaar wrote:
         | They blamed the French because they are British, that's it.
         | It's hard to get rid of bad habits.
        
       | jliptzin wrote:
       | What I don't understand in situations like this when thousands of
       | flights are cancelled is how do they catch up? It always seems
       | like flights are at max capacity at all times, at least when I
       | fly. If they cancel 1,000 flights in one day, how do they absorb
       | that extra volume and get everyone where they need to be? Surely
       | a lot of people have their plans permanently cancelled?
        
         | CamelCaseName wrote:
         | There's always some empty capacity, whether it's non-rev
         | tickets for flight crew and their families which are lower
         | priority than paying customers or people who miss their
         | flights.
         | 
         | I had a cancelled flight recently and they booked people two
         | weeks out because every flight from that day onward was full or
         | nearly full. I showed up the next morning and was able to board
         | the next flight because exactly one person had scanned in their
         | boarding pass (was present at the airport) but did not show up
         | for whatever reason to the airplane.
         | 
         | Beyond that, people just make alternate plans, whether it's
         | taking a bus or taxi home, traveling elsewhere, picking another
         | airline, anything is possible.
        
         | thedrbrian wrote:
         | You don't.
         | 
         | I work in logistics for a FMCG company and sometimes our main
         | producer goes down and we run out of certain types of stock. We
         | send as much out as we can and cancel the rest.
         | 
         | If they really want the stock the customers can rebook an order
         | for tomorrow because they aren't getting it today. And we just
         | start adding extra stock to each delivery.
         | 
         | It's the best of a bad situation.
         | 
         | We don't have the money to have extra trucks and very
         | perishable stock laying about and I know the airlines don't pay
         | 300 grand a month to lease a 737 just to have it sat about
         | doing nothing. There's very little slack.
        
       | worik wrote:
       | I heard in the news that this was caused by a "bad flight plan".
       | 
       | It is clear, even without any more information than that, it was
       | a software failure (bad flight plan?)
       | 
       | It will be interesting to see if Frequentis has to pay a price
       | for causing this
        
       | cja wrote:
       | Every system I've ever made has better error reporting that that
       | one. Even those that only I use. First thing I get working in a
       | new project is the system to tell me when something fails and to
       | help me understand and fix the problem quickly. I then use that
       | system throughout development such that it works very well in
       | production. I'd love to talk to the people who made the system
       | discussed in the article. Is one of them reading this? Can you
       | explain how come this problem reported itself so badly?
        
       | rglover wrote:
       | This is one of the many reasons there should be a universal data
       | standard using a format like JSON. Heavily structured, easy to
       | parse, easy to debug. What you lose in footprint (i.e., more disk
       | space), you gain in system stability.
       | 
       | Imagine a world where everybody uses JSON and if they offer an
       | API, you can just consume the data without a bunch of hoop
       | jumping. Failures like this would vanish overnight.
        
         | masklinn wrote:
         | The bug here was a processing one, having the data in json
         | would make no difference.
        
         | 0xffff2 wrote:
         | Broadly speaking I think this is done for new systems. What you
         | need to identify here is how and when you transition legacy
         | systems to this new better standard of practice.
        
           | rglover wrote:
           | I'd argue in favor of at _least_ an annual review process.
           | Have a dedicated  "feature freeze, emergencies only" period
           | where you evaluate your existing data structures and queue up
           | any necessary work. The only real hang up here is one of bad
           | management.
           | 
           | In terms of how, it's really just a question of Schema A to
           | Schema B mapping. Have a small team responsible for
           | collection/organization of all the possible schemas and then
           | another small team responsible for writing the mapping
           | functions to transition existing data.
           | 
           | It would require will/force. Ideally, too, jobs of those
           | responsible would be dependent on completion of the task so
           | you couldn't just kick the can. You either do it and do it
           | correctly or you're shopping your resume around.
        
         | tristor wrote:
         | The problem is systems written in the 1970s in FORTRAN to run
         | on Mainframes don't speak JSON.
        
           | rglover wrote:
           | Great. It should be fixed by replacing the FORTRAN systems
           | with a modern solution. It's not that it can't be done, it's
           | that the engineers don't bother to start the process (which
           | is a side-effect of bad incentive structure at the employment
           | level).
        
             | fullspectrumdev wrote:
             | Have you ever been involved in such a migration?
             | 
             | It's invariably a complete clusterfuck.
        
               | rglover wrote:
               | I haven't, but I'd love to. My approach wouldn't be very
               | "HR friendly," though.
        
               | count wrote:
               | Ah yes, migration through sheer force of will.
        
               | jakub_g wrote:
               | It's trivial. Only took Amadeus hundreds of developers
               | working for over a decade to migrate off TPF. /s
               | 
               | [0] https://amadeus.com/en/insights/blog/celebrating-one-
               | year-fu...
        
               | rglover wrote:
               | In some sense, yes. Notice that most of the responses to
               | what I've said are immediately negative or dismissive of
               | the idea. If that's the starting point (bad mindset), of
               | course nothing gets fixed and you land where we are
               | today.
               | 
               | My initial approach would be to weed out anyone with that
               | point of view before any work took place (the "not HR
               | friendly" part being to be purposefully exclusionary).
               | The only way a problem of this scope/scale can be solved
               | is by a team of people with extremely thick skin who are
               | comfortable grabbing a beer and telling jokes after they
               | spent the day telling each other to go f*ck themselves.
        
               | tristor wrote:
               | Anyone who has worked with me knows that I have no issue
               | coming in like a wrecking ball in order to make things
               | happen, when necessary. I've also been involved in some
               | of these migration projects. I think your take on the
               | complexity of these projects (and I do mean inherent
               | complexity, not incidental complexity) and the responses
               | you've received is exceptionally naive.
               | 
               | The amount of wise-cracks and beers your team can handle
               | after a work day is not the determinate factor in
               | success. /Most/ of these organizations /want/ to migrate
               | these systems to something better. There is political
               | will and budget to do so, these are still inglorious
               | multi-decade slogs which cannot fail, ever, because
               | failure means people die. No amount of attitude will
               | change that.
        
             | tristor wrote:
             | That's... not how that works. I take it you're probably
             | more of a frontend person than a backend person by this
             | comment. In the backend world, you usually can't fully and
             | completely replace old systems, you can only replace parts
             | of systems while maintaining full backwards compatibility.
             | The most critical systems in the world -- healthcare,
             | transportation, military, and banking -- all run on
             | mainframes still, for the most part. This is isn't a
             | coincidence. When these systems get migrated, any issues,
             | including issues of backwards compatibility cause people to
             | /DIE/. This isn't an issue of a button being two pixels to
             | the left after you bump frontend platform revs, these
             | systems are relied on for the lives and livelihood of
             | millions of people, every single day.
             | 
             | I am totally with you wishing these systems were more
             | modern, having worked with them extensively, but I'm also
             | realistic about the prospect. If every major airline
             | regulator in the world worked on upgrading their ATC
             | systems to something modern by 2023 standards, and
             | everything went perfectly, we could expect to no longer
             | need backwards compatibility with the old system sometime
             | in 2050, and that's /very/ optimistic. These systems are
             | basically why IBM is still in business, frankly.
        
             | mprovost wrote:
             | No migration of this magnitude is blocked because of
             | engineers not "bothering" to start the process. Imagine how
             | many approvals you'd need, plus getting budget from who-
             | knows how many government departments. Someone is paying
             | for your time as an engineer and they decide what you work
             | on. I'm glad we live in a world where engineers can't just
             | decide to rewrite a life or death system because it's
             | written in an old(er) programming language. (Not that there
             | is any evidence that this specific system is written in
             | anything older than C++ or maybe Ada.)
        
             | fbdab103 wrote:
             | I guess we should rewrite it in Rust.
             | 
             | Airplane logistics feels like one of the most complicated
             | systems running today. A single airline has to track
             | millions of entities: planes, parts, engineers, luggage,
             | cargo, passengers, pilots, gate agents, maintenance
             | schedules, etc. Most of which was created all before best-
             | practices were a thing. Not only is the software complex,
             | but there are probably millions of devices in the world
             | expecting exactly format X and will never be upgraded.
             | 
             | I have no doubt that eventually the software will be Ship
             | of Thesus-ed into something approaching sanity, but there
             | are likely to be glaciers of tech debt which cannot be
             | abstracted away in anything less than decades of work.
        
               | seabass-labrax wrote:
               | It would still be valuable to replace components piece-
               | by-piece, starting with rigorously defining internal data
               | structures and publically providing schemas for existing
               | data structures so that companies can incorporate them.
               | 
               | I would like to point out that the article (and the
               | incident) does not relate to airline systems; it is to do
               | with Eurocontrol and NATS and their respective commercial
               | suppliers of software.
        
             | tjohns wrote:
             | Many of them have been upgraded. In the US, we've replaced
             | HOST (the old ATC backend system) with ERAM (the modern
             | replacement) as of 2015.
             | 
             | However, you have to remember this is a global problem. You
             | need to maintain 100% backwards compatibility with every
             | country on the planet. So even if you upgrade your
             | country's systems to something modern, you still have to
             | support old analog communication links and industry
             | standard data formats.
        
         | vb-8448 wrote:
         | It won't fix anything. JSON is the "standard" today, 15 years
         | ago it was XML and in 15 years we will have protobuf or another
         | new standard.
        
           | rglover wrote:
           | Correct. The other "leg" of a solution to this problem would
           | be to codify migration practices so stagnation at the tech
           | level is a non issue long-term.
        
             | vb-8448 wrote:
             | > codify migration practices
             | 
             | I think this won't work: no one really wants to touch a
             | system that works, and people will try to find any excuse
             | to avoid migrating. The reason of this is that everyone
             | prefers systems that work and fails in known way rather new
             | systems that no one knows how can it fail.
        
               | rglover wrote:
               | Does the system work if it randomly fails and collapses
               | the entire system for days?
               | 
               | People generally prefer to be lazy and to not use their
               | brains, show up, and receive a paycheck for the minimum
               | amount of effort. Not to be rude, but that's where this
               | attitude originates. Having a codified process means that
               | attitude _can 't_ exist because you're given all of the
               | tools you need to solve the problem.
        
               | vb-8448 wrote:
               | > Having a codified process means that attitude can't
               | exist because you're given all of the tools you need to
               | solve the problem.
               | 
               | Yes, but in real life doesn't work. Processes have corner
               | cases. As you said, people are lazy and will do
               | everything to find the corner case to fit in.
               | 
               | Just an example from the banking sector. There are
               | processes (and even laws) that force banks to use only
               | certified, supported and regularly patched software:
               | there are still a lot of Windows 2000 servers in their
               | datacenters and will be there for many years.
        
             | recursive wrote:
             | You could do all that stuff.
             | 
             | But after you did it, you'd still have exactly the same
             | problem. The cause was not related to deserialization. That
             | part worked perfectly. The problem is the business logic
             | that applied to the model after the message was parsed.
        
         | smarx007 wrote:
         | There are already standards like XML and RDF Turtle that allow
         | you to clearly communicate vocabulary, such that a property
         | 'iso3779:vin' (shorthand for a made-up URI
         | 'https://ns.iso.org/standard/52200#vin') is interpreted in the
         | same way anywhere in the structures and across API endpoints
         | across companies (unlike JSON, where you need to fight both the
         | existence of multiple labels like 'vin', 'vin_no', 'vinNumber',
         | as well as the fact that the meaning of a property is strongly
         | connected to its place in the JSON tree). The problem is that
         | the added burden is not respected at the small scale and once
         | large scale is reached, the switching costs are too big. And
         | that XML is not cool, naturally.
         | 
         | On top of that, RDF Turtle is the only widely used standard
         | _graph_ data format (as opposed to tree-based formats like JSON
         | and XML). This allows you to reduce the hoop jumping when
         | consuming responses from multiple APIs as graph union is a
         | trivial operation, while n-way tree merging is not.
         | 
         | Finally, RDF Turtle promotes use of URIs as primary identifiers
         | (the ones exposed to the API consumers) instead of primary
         | keys, bespoke tokens, or UUIDs. Followig this rule makes all
         | identifiers globally unique and dereferenceable (ie, the ID
         | contains the necessary information on how to fetch the resource
         | identified by a given ID).
         | 
         | P.S.: The problem at hand was caused by the algorithm that was
         | processing the parsed data, not with the parsing per se. The
         | only improvement a better data format like RDF Turtle would
         | bring is that two different waypoints with the same label would
         | have two different URI identifiers.
        
           | seabass-labrax wrote:
           | Furthermore, there are _already_ XML namespaces for flight
           | plans. These are not, however, used by ATC - only by pilots
           | to load new routes into their aircrafts ' navigation
           | computers.
           | 
           | I'm not sure whether there is an existing RDF ontology for
           | flight plans; it would probably be of low to medium
           | complexity considering how powerful RDF is and the kind of
           | global-scale users it already has.
        
         | fbdab103 wrote:
         | Airport software predates basically every standard on the
         | planet. I would not be surprised to learn that they have their
         | own bizarro world implementation of ASCII, unix epoch time,
         | etc.
        
           | tjohns wrote:
           | Yes, FPL messages are sent over AFTN, which uses ITA-2 Baudot
           | code instead of ASCII:
           | https://en.wikipedia.org/wiki/Baudot_code
           | 
           | The keyboards used by ATC don't even allow entering symbols: 
           | https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2F1.
           | ..
           | 
           | (There is a modern replacement for AFTN called AMHS, which
           | replaces analog phone lines with X.400 messages over IP...
           | but the system still needs to be backwards compatible for ATC
           | units still using analog links.)
        
         | nemetroid wrote:
         | There are several XML formats for expressing flight plans, most
         | notably ARINC 633 and FIXM.
        
         | dundarious wrote:
         | Parsing the data formats had zero contribution to the problem.
         | They had a problem running an algorithm on the input data, and
         | error reporting when that algorithm failed. Nothing about JSON
         | would improve the situation.
        
           | rglover wrote:
           | Yes, but look at the data. The algorithm was buggy because
           | the input data is a nightmare. If the data didn't look like
           | that, it's very unlikely the bug(s) would have ever existed.
        
             | zimpenfish wrote:
             | > The algorithm was buggy because the input data is a
             | nightmare.
             | 
             | No, the algorithm was "buggy" because it didn't account for
             | the entry to and exit points from the UK to have the same
             | designation because they're supposed to be geographically
             | distant (they were 4000Nm apart!) and the UK ain't that
             | big.
        
             | dundarious wrote:
             | ADEXP sounds like the universal data standard you want
             | then. The UK just has an existing NATS that cannot
             | understand it without transformation by this problematic
             | algorithm. So the significant part of your suggestion might
             | be to elide the NATS specific processing and upgrade NATS
             | to use ADEXP directly.
             | 
             | Using a JSON format changes nothing. Just adds a few more
             | characters to the text representation.
        
             | schainks wrote:
             | I have seen a bad outages caused by valid JSON whose
             | consumer implemented something incorrectly.
             | 
             | I agree with dundarius that "doing this in JSON" would not
             | have changed the likelihood the bug could have manifested.
        
               | rglover wrote:
               | No change at all? I find that hard to believe. There's
               | also a data design problem here, but the structure of
               | JSON would aid in, not subtract from, that process.
               | 
               | The question at hand is: "heavily structured data vs. a
               | blob of text as input into a complex algorithm, which one
               | is preferred?"
               | 
               | Unless you're lying, you'd choose the former given the
               | option.
        
               | dundarious wrote:
               | The issue is using _both_ ADEXP and ICAO4444 waypoints,
               | and doing so in a sloppy way. For the waypoint lists,
               | there is no issue with structurelessness -- the fact that
               | they 're lists is pretty obvious, even in the existing
               | formats. Adding some ["",] would not have helped the
               | specific problem, as the relevant structure was already
               | perfectly clear to the implementers. I am not lying when
               | I say the bug would have been equally likely in a JSON
               | format in this specific case.
        
               | schainks wrote:
               | Now I'm wigging out to the idea of how the act of
               | overcoming the inertia of the existing system just to
               | migrate to JSON would spawn thousands of bugs on its own
               | -- many life-threatening, surely.
        
               | schainks wrote:
               | These old standards ARE heavily structured data, despite
               | what their formatting or lack of punctuation suggests.
        
             | numpad0 wrote:
             | To me and XML-ified this would look more nightmarish than
             | the status quo... it's just brief, space separated and \n
             | terminated ASCII. No need to overcomplicate things this
             | simple.
        
       | wolfendin wrote:
       | My question is: why was the algorithm searching any section
       | before the UK entry point. You can't exit at a waypoint before
       | you enter so there is no reason to search that space.
        
       | CodeL wrote:
       | [flagged]
        
       | Rochus wrote:
       | This is apparently just an opinion, no additional inside
       | information than we had from the report
       | (https://news.ycombinator.com/item?id=37401981), isn't it?
       | 
       | EDIT: downvoting this question instead of responding is a pretty
       | strange reaction.
        
         | lagt_t wrote:
         | Dude this isn't reddit dont worry about the votes.
        
           | Rochus wrote:
           | Since it's not Reddit but HN, it's all the stranger to
           | dismiss a perfectly legitimate question. But times and mores
           | seem to change much faster than I realize.
        
         | seabass-labrax wrote:
         | You are correct, but it's an opinion that bridges the gap
         | editorially between those knowledgable about ATC but not data,
         | and those knowledgable about data but not ATC. This is a
         | valuable service to provide, as both fields are rather complex.
        
           | Rochus wrote:
           | Thanks. I didn't have the patience to read it all. I
           | initially hoped that the author was a field expert or even
           | someone with inside knowledge, but he is apparently from a
           | completely different domain and not in the UK, and there were
           | assumptions about things the report was rather specific about
           | (as specific as such reports usually are). It would be more
           | useful if people would take a closer look at the report and
           | draw the right conclusions about organizational failures and
           | how to avoid them. All the great software technologies to
           | achieve memory safety, etc. are of little use if the analyses
           | and specifications are flawed or the assumptions of the
           | various parties in a system of systems do not match. But
           | people seem to prefer to speculate and argue about secondary
           | issues.
        
       | cratermoon wrote:
       | "the description sounds like the procedure is working directly on
       | the textual representation of the flight plan, rather than a data
       | structure parsed from the text file. This would be quite
       | worrying, but it might also just be how it is explained."
       | 
       | Oh, this is typical in airline industry work. Ask programmers
       | about a domain model or parsing, they give you blank stares. They
       | love their validation code, and they love just giving up if
       | something doesn't validate. It's all dumb data pipelines At no
       | point is there code models the activities happening in the real
       | world.
       | 
       | In no system is there a "flight plan" type that has any behavior
       | associated with it or anything like a set of waypoint types. Any
       | type found would be a struct of strings in C terms, passed around
       | and parsed not once, but every time the struct member is
       | accessed. As the article notes, "The programming style seems very
       | imperative.".
        
         | jameshh wrote:
         | That's super interesting (and a little terrifying). It's funny
         | how different industries have developped different "cultures"
         | for seemingly random reasons.
        
           | cratermoon wrote:
           | It was terrifying enough for me in the gig I worked on that
           | dealt with reservations and check-in, where a catastrophic
           | failure would be someone boarding a flight when they
           | shouldn't have. To avoid that sort of failure, the system
           | mostly just gave up and issued the passenger what's called an
           | "Airport Service Document": effectively a record that shows
           | the passenger as having a seat on the flight, but unable to
           | check-in. This allows the passenger to go to the airport and
           | talk to an agent at the check-in desk. At that point, yes, a
           | person gets involved, and a good agent can usually work out
           | the problem and get the passenger on their flight, but of
           | course that takes time.
           | 
           | If you've ever been a the airline desk waiting to check-in
           | and an agent spends 10 minutes working with a passenger
           | (passengers), it's because they got an ASD and the agent has
           | to screw around directly in the the user-hostile SABRE
           | interface to fix the reservation.
        
             | 3pac wrote:
             | SABRE is pretty good compared to the card file it replaced.
        
               | cratermoon wrote:
               | It's better to say SABRE _replicated_ , in digital form,
               | that card file. And even today the legacy of that card
               | form defines SABRE and all the wrappers and gateways to
               | it.
        
         | touisteur wrote:
         | Giving up if something doesn't validate is indeed standard to
         | avoid propagating badly interpreted data, causing far more
         | complex bugs down the line. Validate soon, validate strongly,
         | report errors and don't try to interpret whatever the hell is
         | wrong with the input, don't try to be 'clever', because there
         | lie the safety holes. Crashing on bad input is wrong, but
         | trying to interpret data that doesn't validate, without specs
         | (of course) is fraught with incomprehension and
         | incompatibilities down the line, or unexpected corner cases (or
         | untested, but no one wants to pay for a fully tested _all-goes_
         | system, or just for the tools to simulate  'wrong inputs' or
         | for formal validation of the parser _and_ all the code using
         | the parser 's results).
         | 
         | There are already too many problems with non-compliant or
         | legacy (or just buggy) data emitters, with the complexity in
         | semantics or timing of the interfaces, to try and be clever
         | with badly formatted/encoded data.
         | 
         | It's already difficult (and costly) to make a system work as
         | specified, so subtle variations to make it more tolerant to
         | unspecificied behaviour is just asking for bugs (or for more
         | expensive systems that don't clear the purchasing price bar).
        
           | cratermoon wrote:
           | There's a difference between _parsing_ and _validating_.
           | https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
           | va...
           | 
           | You're right about all the buggy stuff out there, and that
           | nobody wants to pay to make it better, though.
        
       | m1n1 wrote:
       | If you want to hear about how bad air traffic control is in the
       | United States, you can listen/read here
       | https://www.nytimes.com/2023/09/05/podcasts/the-daily/plane-...
       | 
       | There was a time recently when only 3 out of the 300+ air traffic
       | control centers in the U.S. were fully staffed. All the rest were
       | short-handed. Not sure how it stands today
        
       | Diggsey wrote:
       | Software has bugs, that's not really the damning part... The
       | damning part is that in four hours and two levels of support
       | teams, there was noone who actually knew anything about how the
       | system worked who could remove the problematic flight plan so
       | that the rest of the system could continue operating!
       | 
       | What exactly is the point of these support teams when they can't
       | fix the most basic failure mode (a single bad input...)
        
         | hindsightbias wrote:
         | They were probably on vacation
        
         | [deleted]
        
         | jahewson wrote:
         | > What exactly is the point of these support teams when they
         | can't fix the most basic failure mode (a single bad input...)
         | 
         | To collect money on support contracts, I suspect.
        
           | Maxion wrote:
           | Try to get developers who love to code and create to stay on
           | a support team and be on an on-call roster. I betcha at least
           | half will say no, and the other half will either leave or
           | you'll run out of money paying them.
        
         | vb-8448 wrote:
         | Just guessing:
         | 
         | They bought a software from a third party and treat it as a
         | "black box". There are few known ways that the software fails,
         | and the local team has instructions on how to fix it. But if it
         | fails in an unexpected way, good luck, it's impossible for the
         | local team to identify and fix the problem without the vendor.
         | 
         | The reason it took so much was they realized too late that they
         | need to call the vendor.
         | 
         | Probably you have to blame managers rather than engineers in
         | the support team.
        
           | swarnie wrote:
           | Considering this same failure has happened a few times in
           | recent memory maybe its over optimistic of me to expect an
           | entry on the support wiki or something.
        
             | krisoft wrote:
             | > Considering this same failure has happened a few times in
             | recent memory
             | 
             | Which previous instances are you thinking about?
        
         | NikolaNovak wrote:
         | Unfortunately, I work on a reasonably modern ERP system which
         | has been customized significantly for the client and also works
         | with wider range of client-specific data combinations that the
         | vendor has seemingly not anticipated / other clients do not
         | have.
         | 
         | What it means is that on a regular basis, teams will be woken
         | up at 2am because a batch process aborted on bad data; AND it
         | doesn't tell you what data / where in the process it aborted.
         | 
         | The only possibility is to rerun the process with crippling
         | traces, and then manually review the logs to find the issue,
         | remove it, and then re-run the program _again_ (hopefully
         | remembering to remove the trace:).
         | 
         | Even when all goes per plan, this can at times take more than 4
         | hrs.
         | 
         | Now, we are not running a mission-critical real-time system
         | like air traffic; and I'm in NO way saying any of this is good;
         | but, it may not be the case that "two level of support teams
         | didn't know anything" - the system could just be so poorly
         | designed that with best operational experience and knowledge,
         | it still took that long :-< .
         | 
         | On HN, we take certain level of modernity, logging, failure
         | states, messaging, and restartability for granted; which may
         | not be even remotely present on more niche or legacy system
         | (again, NOT saying that's good; just indicating issue may be
         | less with operational competence vs design). It's easy to judge
         | from our external perspective, but we have no idea what was
         | presented / available to support teams, and what their
         | mandatory process is.
        
         | gonzo41 wrote:
         | And when did you last test your monthly backups? But seriously.
         | If you fill out all the positions in an org chart it's easy to
         | think you're delivering, and for a lot of situations it usually
         | works. Anointing someone a manager usually works out because
         | people can muddle through. It doesn't work in medicine, or as
         | it turns out, air traffic control.
         | 
         | Lesson learned for about the next ~5 years.
        
         | ateng wrote:
         | One important software engineering skill that is often overlook
         | is the art of writing just the right amount of log, such that
         | one could have sufficient information to debug easily when
         | things go wrong, but not too verbose such that it will be
         | ignored or pruned in production.
        
         | blibble wrote:
         | I wouldn't expect level 1 and level 2 to be able to diagnose a
         | problem like this
         | 
         | level 3 (devs) should have been brought in much quicker though
        
           | toyg wrote:
           | Having worked in tech support: level 3 (Devs) should have
           | described their source code structure to level 2, and let
           | them access it when they needed it.
        
           | P-Nuts wrote:
           | You don't need a complete diagnosis if you can spit out
           | enough debug info that says, "oops shat the bed while working
           | with this flight plan", then the support people can remove
           | the one that's causing you to fail, restart the system, and
           | tell ATC to route that one manually.
        
       ___________________________________________________________________
       (page generated 2023-09-11 22:00 UTC)