[HN Gopher] UK air traffic control meltdown
___________________________________________________________________
UK air traffic control meltdown
Author : jameshh
Score : 500 points
Date : 2023-09-11 01:08 UTC (20 hours ago)
(HTM) web link (jameshaydon.github.io)
(TXT) w3m dump (jameshaydon.github.io)
| omginternets wrote:
| Of _course_ they blamed the French ^^
| gumballindie wrote:
| [flagged]
| scythe wrote:
| Except that was a completely different incident and it
| occurred in the United States, not the UK. The Daily Mail did
| try to make hay out of the idpol angle, but the British can't
| reasonably be accused of shirking responsibility for the FAA
| grounding flights in the US.
| sergers wrote:
| Well that's DailyMail for you, where they tag anything
| parenting or healthy as "femail" section... cause you know
| only women are looking at that stuff.
|
| Lol.
|
| Anyways I actually think that's just reasonable response,
| system goes down/related system goes down , and in reviewing
| they are making frivolous updates to names that aren't
| needed.
|
| I would question these updates (while they may be minor part
| of overall updates occuring).
| cratermoon wrote:
| At least until the 70s most newspapers had a section called
| "Women" or something similar. Even the news about the
| 60s/70s women's movement appeared there, not in the main
| "news" sections. Those sections were mostly renamed around
| that time to "Lifestyle", "Home", or just "Features".
| vixen99 wrote:
| Is this the UK or US edition? It's always easy fun to have
| a go at the Daily Mail which presumably you read regularly
| else you wouldn't be commenting. Its sin seems to be that
| it's not a serious broadsheet. It's a tabloid with very
| broad appeal that has to be profitable and therefore tries
| to reflect the requirements of the British public for such
| a publication. Perhaps you should lower your expectations.
|
| 'Tag anything parenting or healthy ...'? No, that's not
| correct. Here are a few health & food related items back to
| mid-September that did not appear in 'female'. You are
| right about parenting; most parenting in the UK is still
| undertaken primarily (in terms of executive action) by
| females so items on this topic are reasonably included in
| 'female'. The growing number of people who don't have
| children probably appreciate this sub-grouping by the Mail.
| You may not approve but this is what happens. Single males
| with dependent children are not known for objecting to
| checking out that section. It's not forbidden.
|
| https://www.dailymail.co.uk/wires/pa/article-12505173/Healt
| h... https://www.dailymail.co.uk/wires/ap/article-12504751/
| Eggpla... https://www.dailymail.co.uk/health/article-125046
| 49/Suicide-... https://www.dailymail.co.uk/health/article-1
| 2504813/Anthony-... https://www.dailymail.co.uk/health/arti
| cle-12503801/Cancer-n... https://www.dailymail.co.uk/wires/
| reuters/article-12503815/W... https://www.dailymail.co.uk/w
| ires/reuters/article-12503299/R...
| https://www.dailymail.co.uk/news/article-12468365/One-
| woman-... https://www.dailymail.co.uk/wires/reuters/article
| -12502685/W... https://www.dailymail.co.uk/wires/ap/article
| -12501533/Food-r...
| https://www.dailymail.co.uk/news/article-12490747/How-
| safe-c...
| sergers wrote:
| Dailymail is actually site a frequent multiple times a
| day everyday.
|
| not all content is for everyone, but they got something,
| they are definitely a tabloid style.
|
| they narrate particular views to the public but cover all
| different contents, and alot of content i would consider
| advertisements/plug than actual articles.
|
| i would guess a highly elderly/conservative majoroity
| base
|
| they pander to lowest common denominator, which is fine
| -- they are a for profit news/tabloid, i find some of it
| entertaining (As per daily visits).
|
| do you work for them/just a big fan for doing all that
| digging in defense of DM overexaggerating i made that ALL
| content like that is in that category? i didnt take my
| own comment all that seriously so honest ask.
| gumballindie wrote:
| > tries to reflect the requirements of the British public
|
| The issue though is that quite often the Dailymail
| doesn't reflect, but rather controls the requirements of
| the British public.
| swarnie wrote:
| As is tradition =)
|
| (We actaully have no major issues with the French at least in
| my generation, its all just good fun)
| omginternets wrote:
| It's all good. We still take cheap-shots at English food and
| English women ;)
|
| Edit: I lived in London for 3 years. I miss it every day.
| laputan_machine wrote:
| I wouldn't worry about it, we take cheap shots at French
| food and French people too ;)
| omginternets wrote:
| Sir, those are dueling words.
| ateng wrote:
| Hold on, I think we take cheap shots at French people,
| but _expensive_ shots on the food and wine
| [deleted]
| dang wrote:
| Related. Others?
|
| _Coincidentally-identical waypoint names foxed UK air traffic
| control system_ - https://news.ycombinator.com/item?id=37430384 -
| Sept 2023 (64 comments)
|
| _UK air traffic control outage caused by bad data in flight
| plan_ - https://news.ycombinator.com/item?id=37402766 - Sept 2023
| (20 comments)
|
| _NATS report into air traffic control incident details root
| cause and solution_ -
| https://news.ycombinator.com/item?id=37401864 - Sept 2023 (19
| comments)
|
| _UK Air traffic control network crash_ -
| https://news.ycombinator.com/item?id=37292406 - Aug 2023 (23
| comments)
| switch007 wrote:
| The title of this post made me think there was a new, current
| meltdown !
| a_wild_dandan wrote:
| The recent episode of The Daily about the (US) aviation
| industry has convinced me that we'll see a catastrophic
| headline soon. Things can't go on like this.
| darkclouds wrote:
| Interesting to see that flight plans over the UK have to be filed
| 4 hours in advance.
|
| No mention of plane, pilot, passenger and cargo manifests. So why
| the 4 hour lead time, is this the time it takes UK Authorities to
| look people up or workout if the cargo could be dangerous in an
| airborne Anthrax (Gruinard) Island [1] or Japanese subway Sarin
| [2], or an IRA favourite, fertilizer bomb thats bypassed the
| usual purchase reporting regulations used by people like Jeremy
| Clarkson and Harry Metcalfe as their store of wealth[3]?
|
| It makes me wonder just how much more surveillance of the
| population exists, knowing I cant even step out of the front door
| without attracting surveillance of the type that followed Dr
| David Kelly.
|
| Sure its not a cyber attack per se, carried out over the internet
| like a DDOS attack or a brute force password guessing attack with
| port knocking mitigation, but how would one carry out a cyber
| attack on this system if the only attack vector is from people
| submitting flight plans?
|
| There sure is a constant playing down of the cyber attack angle
| to this which makes me think someone wants to Blurred Lines!
|
| One point on the lack of uniquely named global way points, which
| is the main crux of the problem falling over if some are to be
| believed.
|
| The USA demonstrates a disproportionate number of similar names,
| by virtue of Europeans migrating to the US [4]. So has this
| situation arisen with this system in other parts of the world
| like in the US? How can a country that created the globe spanning
| British Empire become so insular with regards to air travel in
| this way?
|
| I'd agree with the initial assessment that there appears to be a
| lack of testing, but are the specifications simply not fit for
| purpose? I'm sure various pilots could speak out here, because
| some of the regulations require planes to be minimally distanced
| from each other when transiting across the UK.
|
| On the point of ICAO and other bodies to eradicate non-unique
| waypoint names, its clear there is some legacy constraint still
| impeding the safety of air travellers, perhaps caused by poor
| audio quality analogue radio, so perhaps its time for the
| unambiguous and globally recognised What 3 Words form of location
| identifier, to come into effect?
|
| The UK police already prefer it to speed up response times [4].
| And although the same location can create 3 different words,
| suggesting drift with GPS [5], even if What 3 Words could not be
| used for a global system, having something a bit longer to create
| an easily recognisable human globally unique identifier is needed
| for these flight plans and perhaps maritime situations.
|
| Obviously global coordination will be like herding cats, and if
| such a fixed size global network of cells were introduced, some
| area's like transiting over the Atlantic or Pacific could command
| bigger cells, but transiting over built up areas like London
| would require smaller sized identifiable cells. But IF ever there
| was a time for the New World Order to step up to the plate and
| assert itself, to create a Globally Unique Place ID (GUPID) for
| the whole planet, now is the time.
|
| On the point of humans were kept safe, only by the sheer common
| sense of the pilots and traffic control tower staff, its not
| something NATS did or should claim, their systems were down, so
| everyone had to resort back to pen and paper and blocks in
| queues, and apart from Silverstone when the F1 British Grand Prix
| is on, is air space ever that densely populated.
|
| NATS were caught with their pants down at so many levels of
| altitude, is this laissez faire UK management style that saw the
| Govt having to step in to bail out the banks during the financial
| crisis, still infecting other parts of UK life and still coming
| to light?
|
| It's beginning to look a lot like Christmas!
|
| [1] https://www.youtube.com/watch?v=_8Zr0IPtx80
|
| [2] https://www.youtube.com/watch?v=RTr1lquCQMg
|
| [3] https://youtu.be/LS54AJSadT4?t=279
|
| [4]
| https://en.wikipedia.org/wiki/List_of_U.S._places_named_afte...
|
| [5] https://www.bloomberg.com/news/articles/2019-03-21/u-k-
| polic...
|
| [6] https://support.what3words.com/en/articles/2212837-why-
| do-i-...
| amiga386 wrote:
| > so why the 4 hour lead time
|
| To answer your question without conspiracy drivel, let's look
| up CAP 694: The UK Flight Planning Guide [0]
|
| Chapter 1
|
| > 6.1 The general ICAO requirement is that FPLs should be filed
| on the ground at least 60 minutes before clearance to start-up
| or taxi is requested. The "Estimated Off Block Time" (EOBT) is
| used as the planned departure time in flight planning, not the
| planned airborne time.
|
| > 6.3 IFR flights on the North Atlantic and on routes subject
| to Air Traffic Flow Management, should be filed a minimum of 3
| hours before EOBT (see Chapter 4).
|
| Chapter 4
|
| > 1.1 The UK is a participating State in the Integrated Initial
| Flight Plan Processing System (IFPS), which is an integral part
| of the Eurocontrol centralised Air Traffic Flow Management
| (ATFM) system.
|
| > 4.1 FPLs should be filed a minimum of 3 hours before
| Estimated Off Block Time (EOBT) for North Atlantic flights and
| those subject to ATFM measures, and a minimum of 60 minutes
| before EOBT for all other flights.
|
| So the answer is because the UK is part of a Europe-wide air
| traffic control system, which hands out full flight plans to
| all the relevant authorities for each airspace, and they
| decided 3 hours is needed so that all possible participants can
| get their shit together and tell you if they accept the plan or
| not.
|
| An _entirely separate system_ exists to share Advanced
| Passenger Information, i.e. passenger manifests [1], and it
| goes even further that airlines share your overall identity
| with each other, known as a Passenger Name Record [2], and a
| variety of countries, led by the USA, insist on this
| information in advance before the plane is allowed to take off
| [3]
|
| If you're going to be paranoid, please work with known facts
| instead of speculating.
|
| [0] https://publicapps.caa.co.uk/docs/33/CAP%20694.pdf
|
| [1]
| https://en.wikipedia.org/wiki/Advance_Passenger_Information_...
|
| [2] https://en.wikipedia.org/wiki/Passenger_name_record
|
| [3]
| https://en.wikipedia.org/wiki/United_States%E2%80%93European...
| NovemberWhiskey wrote:
| This is not the first time this has happened; the phenomenon has
| even got a name - "poison flight plan".
| krisoft wrote:
| > the phenomenon has even got a name - "poison flight plan".
|
| Maybe, but it must not be a common phrase because your comment
| is the first result when I search for it.
|
| And it is also mentioned in this article: http://www.aero-
| news.net/subsite.cfm?do=main.textpost&id=ce2...
|
| And that's about it? Do you have any other sources?
| tpmx wrote:
| I think that term was invented four days ago by that article
| writer. There are four other occurrances before then and
| they're about PS2 games.
| NovemberWhiskey wrote:
| This term was in wide circulation when I was consulting at
| NATS in the 2000-2005 time frame.
| dboreham wrote:
| The generic term I'm familiar with is "ping of death".
| crabbone wrote:
| I want to comment specifically on:
|
| > The software and system are not properly tested.
|
| Followed by suggesting to do fuzzing tests.
|
| * Automatically generating valid flight paths is somewhat hard
| (and you'd have to know which ones are valid because the system,
| apparently, is designed to also reject some paths). It's also
| possible that such a generator would generate valid but
| improbable flight paths. There's probably an astronomic number of
| possible flight paths, which makes exhaustive testing impossible,
| thus no guarantee that a "weird" path would've been found. The
| points through which the paths go seem to be somewhat dynamic
| (i.e. new airports aren't added every day, but in a life-span of
| such a system there will be probably a few added). More
| realistically some points on flight paths may be removed. Does
| the fuzzing have to account for possibilities of new / removed
| points?
|
| * This particular functionality is probably buried deep inside
| other code with no direct or easy way to extricate it from its
| surrounding, and so would be very difficult to feed into a
| fuzzer. Which leads to the question of how much fuzzing should be
| done and at what level. Add to this that some testing
| methodologies insist on divorcing the testing from development as
| not to create an incentive for testers to automatically okay the
| output of development (as they would be sort of okaying their own
| work). This is not very common in places like Web, but is common
| in eg. medical equipment (is actually in the guidelines). So, if
| the developer simply didn't understand what the specification
| told them to do, then it's possible that external testing wasn't
| capable of reaching the problematic code-path, or was severely
| limited in its ability to hit it.
|
| * In my experience with formats and standards like these it's
| often the case that the standard captures a lot of impossible or
| unrealistic cases, hopefully a superset of what's actually needed
| in practice. Flagging every way in which a program doesn't match
| the specification becomes useless or even counter-productive
| because developers become overloaded with bug reports most of
| which aren't really relevant. It's hard to identify the cases
| that are rare but plausible. The fact that the testers didn't
| find this defect on time is really just a function of how much
| time they have. And, really, the time we have to test any program
| can cover a tiny fraction of what's required to test a program
| exhaustively. So, you need to rely on heuristics and gut feeling.
| theptip wrote:
| None of this really argues against fuzz testing; even with
| completely bogus/malformed flight plans, it shouldn't be
| possible for a dead letter to take down the entire system. And,
| since it's translating between an upstream and downstream
| format (and all the validation is done when ingesting the
| upstream), you probably want to be sure anything that is valid
| upstream is also valid downstream.
|
| It's true that fuzz testing is easiest when you can do it more
| at the unit level (fuzz this function implementing a core
| algorithm, say) but doing whole-system fuzz tests is perfectly
| fine too.
| crabbone wrote:
| This is not against the principle of fuzz testing. This is to
| say that the author doesn't really know the reality of
| testing and is very quick to point fingers. It's easy to tell
| in retrospect that this particular aspect should've been
| tested. It's basically impossible to find such defects
| proactively.
| seabass-labrax wrote:
| I had been considering becoming an air traffic controller myself,
| and it rather tickles me to think I might have missed my once-in-
| a-lifetime opportunity to direct aircraft with the original pen-
| and-paper flight strip mechanism in the 21st century! Completely
| safe, excruciatingly low-capacity, and sounds like awfully good
| fun as a novelty (for the willing ATC, not the passengers stuck
| on the ground, I hasten to add).
| drachir91 wrote:
| What ticked me is that when the primary system threw in the
| towel, an EXACT SAME system took over and ran the exact same code
| on the exact same data as the primary. I know that with code and
| algorithms it's not always the case but even then you know what
| doing the same thing over and over expecting different results
| defines...
|
| Yes, it can be argued that the software should've had more
| graceful failure modes and this shouldn't have thrown a critical
| exception. It can be argued that the programmers should've seen
| this possibility. We can argue a lot of things about this.
|
| But the reality is that this is a mission-critical system. And
| for such systems, there're ways to mitigate all of these mistakes
| and allow the system to continue functioning.
|
| The easiest (but least safe) one would be to have the secondary
| system loaded with code that does the same thing but written by a
| different team/vendor. It reduces the chance from 100% to much-
| much less that if any input provokes an unforseen, system-
| breaking bug in the primary, the same input will provoke the same
| bug in the secondary.
|
| An even better solution is to have a triumvirate system, where
| all 3 have code written by different teams, and they always
| compare results. If 3 agree, great, if 2 agree, not so great but
| safe to assume that the bug is in the 1 not the 2 (but should
| throw an alert for the supervisors that the whole system is in a
| degraded mode where any further node failure is a showstopper),
| and if all disagree, grind everything to a halt because the world
| is ending, and let the humans handle it.
|
| It can be refined even further. And it's not something new. So
| why wasn't this system implemented in such a way? (Aside from
| cost. I don't care about anyones cost-cutting incentives in
| mission-critical systems. Sorry capitalism...)
| throwaway894345 wrote:
| > Aside from cost. I don't care about anyones cost-cutting
| incentives in mission-critical systems. Sorry capitalism...
|
| Capitalism is happy to have redundancy in mission critical
| systems all the time. Why would it care here?
| drachir91 wrote:
| I don't know but in recent years I'm increasingly seeing
| mission critical systems having only token or "apparent"
| rendundancies instead of real ones, and couldn't find any
| other rationale than cost savings and shareholder bottom
| lines. I'm not saying that capitalism = bad, it's mostly
| better than the alternatives, but just like its most direct
| competitor, it suffers from bad implementations across the
| world and unbounded human greed.
|
| A recent and very "in the face" example, also from the air
| travel industry would be the B737 Max and its AoA sensors.
| There were two, for two flight computers, but MCAS only used
| 1 flight computer and 1 AoA sensor, despite the already
| existing crosslinks between the flight computers and the
| sensors...
|
| Pofit maxing first with the "no need for a new type rating
| for the pilots", then cost-cutting first in aeronautical
| engineering (solving an airframe design problem with
| software, plus designing a flight envelope protection system
| that can overpower the human pilots).
|
| Then cost-cutting in software engineering and QC, rushing out
| software made by (probably) inexperienced in the field
| engineers and failing to properly test it and ensure that it
| had the needed redundancy.
| crabbone wrote:
| > an EXACT SAME system took over and ran the exact same code
|
| Did you ever work with HA systems? Because this is how they
| work. It's two copies of the same system intended for the cases
| when eg. hardware fails, or network partitioning happens etc.
| drachir91 wrote:
| No, I do not. But HA systems work like that because hardware
| or network failure is what they are designed to guard
| against, not a latent bug in the software logic. If there's a
| software bug, both systems will exhibit the same behavior, so
| HA fails there.
| burntwater wrote:
| I'm wondering if the backup system could have a delayed queue;
| say, 30 seconds behind. If the primary fails, and exactly 30
| seconds later the secondary system fails, you have reasonable
| assurance that it was queue input that caused the failure.
| Rollback to the last successful queue input, skip and flag the
| suspect input, and see if the next input is successful.
| drachir91 wrote:
| This looks to me like it could work, but would need a ready
| force of technicians always expecting something like that so
| they can troubleshoot it in a timely manner.
| asimpleusecase wrote:
| And why could the system not put the failed flight plan in a
| queue for human review and just keep on working for the rest of
| the flights? I think the lack of that "feature" is what I find so
| boggling.
| dboreham wrote:
| Because some software developers are crap at their jobs.
| hn_throwaway_99 wrote:
| To be fair that is exactly what the article said was a major
| problem, and which the postmortem also said was a major
| problem. I agree I think this is the most important issue:
|
| > The FPRSA-R system has bad failure modes
|
| > All systems can malfunction, so the important thing is that
| they malfunction in a good way and that those responsible are
| prepared for malfunctions.
|
| > A single flight plan caused a problem, and the entire FPRSA-R
| system crashed, which means no flight plans are being processed
| at all. If there is a problem with a single flight plan, it
| should be moved to a separate slower queue, for manual
| processing by humans. NATS acknowledges this in their "actions
| already undertaken or in progress":
|
| >> The addition of specific message filters into the data flow
| between IFPS and FPRSA-R to filter out any flight plans that
| fit the conditions that caused the incident.
| cratermoon wrote:
| > why could the system not put the failed flight plan in a
| queue
|
| Because it doesn't look at the data as a "flight plan"
| consisting of "way points" with "segments" along a "route" that
| has any internal self-consistency. It's a bag of strings and
| numbers that's parsed and the result passed along, if parsing
| is successful. If not, give up. In this case fail the _entire
| system_ and take it out of production.
|
| Airline industry code is a pile of badly-written legacy
| wrappers on top of legacy wrappers. (Mostly not including
| actual flight software on the aircraft. Mostly). The FPRSA-R
| system mentioned here is _not_ a flight plan system, it 's an
| ETL system. It's not coded to model or work with flight plans,
| it's just parsing data from system A, re-encoding it for system
| B, and failing hard if it it can't.
| slt2021 wrote:
| good ETLs are usually designed to separate good records from
| bad records, so even if one or two rows in the stream do not
| conform to schema - you can put them aside and process the
| rest.
|
| seems like poor engineering
| jandrese wrote:
| The problem is that it means you have a plane entering the
| airspace at some point in the near future and the system
| doesn't know it is going to be there. The whole point of
| this is to make sure no two planes are attempting to occupy
| the same space at the same time. If you don't know where
| one of the planes will be you can't plan all of the rest to
| avoid it.
|
| The thing that blows my mind is that this was apparently
| the first time this situation had happened after 15 million
| records processed. I would have expected it to trigger much
| more often. It makes me wonder if there wasn't someone who
| was fixing these as they came up in the 4 hour window, and
| he just happened to be off that day.
| d1sxeyes wrote:
| > It makes me wonder if there wasn't someone who was
| fixing these as they came up in the 4 hour window, and he
| just happened to be off that day.
|
| This is very possible. I know of a guy who does (or at
| least a few years ago did) 24x7 365 on-call for a piece
| of mission (although not safety) critical aviation
| software.
|
| Most of his calls were fixing AWBs quickly because
| otherwise planes would need to take off empty or lose
| their take-off slot.
|
| Although there had been some "bus factor" planning and
| mitigation around this guy's role, it involved engaging
| vendors etc. and would have likely resulted in a lot of
| disruption in the short term.
| fbdab103 wrote:
| Please tell me this guy is now wealthy beyond imagination
| and living a life of leisure?
| zaphar wrote:
| Bad records aren't supposed to be ignored. They are
| supposed to be looked at by a human who can determine
| what to do.
|
| Failing the way NATS did means that _all_ future flight
| plan data including for planes already in the sky are not
| longer being processed. The safer failure mode was
| definitely to flag this plan and surface to a human while
| continuing to process other plans.
| cratermoon wrote:
| I never said it was a _good_ ETL system. Heck, I don 't
| even know if the specs for it even specifies what to do
| with a bad record - there are at least 300 pages detailing
| the system. Looking around at other stories, I see repeated
| mentions of how the circumstances leading to this failure
| are supposedly extremely rare, "one in 15 million"
| according to one official[1]. But at 100,000 flights/day
| (estimated), this kind situation would occur,
| statistically, twice a year.
|
| 1 https://news.sky.com/story/major-flights-disruption-
| caused-b...
| Scarblac wrote:
| This flight plan was correct though, if there was some
| validation like that then it should have passed.
|
| The code that crashed had a bug, it couldn't deal with all
| valid data.
| pimterry wrote:
| To be fair, the article suggests early on that sometimes these
| plans are being processed for flights already in the air
| (although at least 4 hours away from the UK).
|
| If you can stop the specific problematic plane taking off then
| keeping the system running is fine, but once you have a flight
| in the air it's a different game.
|
| It's not totally unreasonable to say "we have an aircraft en
| route to enter UK airspace and we don't know when or where -
| stop planning more flights until we know where that plane is".
|
| If you really can't handle the flight plan, I imagine a
| reasonable solution would be to somehow force the incoming
| plane to redirect and land before reaching the UK, until you
| can work out where it's actually going, but that's definitely
| something that needs to wait for manual intervention anyway.
| krisoft wrote:
| > "we have an aircraft en route to enter UK airspace and we
| don't know when or where - stop planning more flights until
| we know where that plane is".
|
| Flight plans don't tell where the plane is. Where is this
| assumption coming from?
| joncrocks wrote:
| Presumably you need to know where upcoming flights are
| going to be in the future (based on the plan), before they
| hit radar etc.
| macguillicuddy wrote:
| For the most part (although there are important
| exceptions), IFR flights are always in radar contact with
| a controller. The flight plan is tool allows ATC and the
| plane to agree a route so that they don't have to be
| constantly communicating. ATC 'clears' a plane to
| continue on the route to a given limit, and expects the
| plane to continue on the plan until that limit unless
| they give any future instructions.
|
| In this regard UK ATC can choose to do anything they like
| with a plane when it comes under their control - if they
| don't consider the flight plan to be valid or safe they
| can just instruct the plane to hold/divert/land etc.
|
| I'm not sure the NATS system that failed has the ability
| to reject a given flight plan back upstream.
| RyJones wrote:
| Mostly yes; however, there are large parts of the
| Atlantic and Pacific where that isn't true (radar
| contact). I know the Atlantic routes are frequently full
| of plans that left the US and Canada heading to the UK.
|
| I have no idea what percent of the volume into the UK
| comes from outside radar control; if they asked a flight
| to divert, that may open multiple other cans of worms.
| mannykannot wrote:
| > If they asked a flight to divert, that may open
| multiple other cans of worms.
|
| Any ATC system has to be resilient enough to handle a
| diversion on account of things like bad weather,
| mechanical failure or a medical emergency. In fact, I
| would think the diversion of one aircraft would be less
| of a problem than those caused by bad weather, and
| certainly less than the problem caused by this failure.
| Furthermore, I would guess that the mitigation would be
| just to manually direct the flight according to the
| accepted flight plan, as it was a completely valid one.
|
| One of the many problems here is that they could not
| identify the problem-triggering flight plan for hours,
| and only with the assistance of the vendor's engineers.
| Another is that the system had immediately foreclosed on
| that option anyway, by shutting down.
| lxgr wrote:
| Flight plans do inform ATC where and when a plane is
| expected to enter their FIR though, no?
| micromacrofoot wrote:
| I've had brief glimpses at these systems, and honestly I
| wouldn't be surprised if it took more a year for a simple
| feature like this to be implemented. These systems look like
| decades of legacy code duct-taped together.
| jameshart wrote:
| The algorithm as described in the blogpost is _probably_ not
| implemented as a straightforward piece of procedural code that
| goes step by step through the input flightplan waypoints as
| described. It may be implemented in a way that incorporates
| some abstractions that obscured the fact that this was an
| _input_ error.
|
| If from the code's point of view it looked instead like a
| sanity failure in the underlying navigation waypoint database,
| aborting processing of flight plans makes a lot more sense.
|
| Imagine the code is asking some repository of waypoints and
| routes 'find me the waypoint where this route leaves UK
| airspace'; then it asks to find the route segment that
| incorporates that waypoint; then it asserts that that segment
| passes through UK airspace... if that assertion fails, that
| doesn't look _immediately_ like a problem with the flight plan
| but rather with the invariant assumptions built into the route
| data.
|
| And of course in a sense it _is_ potentially a fatal bug
| because this issue demonstrates that the assumptions the
| algorithm is making about the data _are_ wrong and it is
| potentially capable of returning incorrect answers.
| Spivak wrote:
| Because they hit "unknown error" and when that happens on
| safety critical systems you have to assume that all your
| system's invariants are compromised and you're in undefined
| behavior -- so all you can do is stop.
|
| Saying this should have been handled as a known error is
| totally reasonable but that's broadly the same as saying they
| should have just written bug free code. Even if they had parsed
| it into some structure this would be the equivalent of a
| KeyError popping out of nowhere because the code assumed an
| optional key existed.
|
| For these kinds of things the post mortem and remediation have
| to kinda take as given that eventually a not predictable in
| advance unhandled unknown error will occur and then work on how
| it could be handled better. Because of course the solution to a
| bug is to fix the bug, but the issue and the reason for the
| meltdown is a DR plan that couldn't be implemented in a
| reasonable timeframe. I don't care what programming practices,
| what style, what language, what tooling. Something of a similar
| caliber will happen again eventually with probability 1 even
| with the best coders.
| ummonk wrote:
| That it's safety critical is all the more reason it should
| fail gracefully (albeit surfacing errors to warn the user). A
| single bad flight plan shouldn't jeopardize things by making
| data on all the other flight plans unavailable.
| jjk166 wrote:
| > Saying this should have been handled as a known error is
| totally reasonable but that's broadly the same as saying they
| should have just written bug free code.
|
| I think there's a world of difference between writing bug
| free code, and writing code such that a bug in one system
| doesn't propagate to others. Obviously it's unreasonable to
| foresee every possible issue with a flight plan and handle
| each, but it's much more reasonable to foresee that there
| might be some issue with some flight plan at some point, and
| structure the code such that it doesn't assume an error-free
| flight plan, and the damage is contained. You can't make
| systems completely immune to failure, but you can make it so
| an arbitrarily large number of things have to all go wrong at
| the same time to get a catastrophic failure.
| ChoHag wrote:
| [dead]
| krisoft wrote:
| > Even if they had parsed it into some structure this would
| be the equivalent of a KeyError popping out of nowhere
| because the code assumed an optional key existed.
|
| How many KeyError exceptions have brought down your whole
| server? It doesn't happen because whoever coded your web
| framework knows better and added a big try-catch around the
| code which handles individual requests. That way you get a
| 500 error on the specific request instead of a complete
| shutdown every time a developer made a mistake.
| david422 wrote:
| > big try-catch around the code which handles individual
| requests.
|
| I mean, that's assuming the code isolating requests is also
| bug free. You just don't know.
| numpad0 wrote:
| Crash is a feature, though. It's not like exceptions raises
| by itself into interpreter specifications. It's just that
| it so happens that Web apps ain't need no airbags that slow
| down businesses.
| marcosdumay wrote:
| On a multi-user system, only partial crashes are
| features. Total crashes are bugs.
|
| A web server is a multi-user system, just like a
| country's air traffic control.
| acdha wrote:
| That line of reasoning is how you have systemic failures
| like this (or the Ariane 5 debacle). It only makes sense
| in the most dire of situations, like shutting down a
| reactor, not input validation. At most this failure
| should have grounded just the one affected flight rather
| than the entire transportation network.
| Spivak wrote:
| I love that phrasing, I'm gonna use that from now on when
| talking about low-stakes vs high-stakes systems.
| madeofpalk wrote:
| That's like saying that because one browser tab tried to
| parse some invalid JSON then my whole browser should crash.
| adrianmonk wrote:
| You don't know that the JSON is invalid. Maybe the JSON is
| perfect and your parser is broken.
| Spivak wrote:
| Well yes because you're describing a system where there are
| really low stakes and crash recovery is always possible
| because you can just throw away all your local state.
|
| The flip side would be like a database failing to parse
| some part of its WAL log due to disk corruption and just
| said, "eh just delete those sections and move on."
| madeofpalk wrote:
| Crash the tab and allow all the others to carry on!
|
| The problem here is that one individual document failed
| to parse.
| zimpenfish wrote:
| No, it's more like saying your browser has detected
| possible internal corruption with, say, its history or
| cookies database and should stop writing to it immediately.
| Which probably means it has to stop working.
| ludwik wrote:
| It definitely isn't. It was just a validation error in
| one of thousands external data files that the system
| processes. Something very routine for almost any software
| dealing with data.
| kccqzy wrote:
| I agree with your first paragraph but your second paragraph
| is quite defeatist. I was involved in a quite few of
| "premortem" meetings where people think of increasing
| improbable failure modes and devise strategies for them. It's
| a useful meeting before larges changes to critical systems
| are made live. In my opinion, this should totally be a known
| error.
|
| > Having found an entry and exit point, with the latter being
| the duplicate and therefore geographically incorrect, the
| software could not extract a valid UK portion of flight plan
| between these two points.
|
| It doesn't take much imagination to surmise that perhaps real
| world data is broken and sometimes you are handed data that
| doesn't have a valid UK portion of flight plan. Bugs can
| happen, yes, such as in this case where a valid flight plan
| was misinterpreted to be invalid, but gracefully dealing with
| the invalid plan should be a requirement.
| piva00 wrote:
| > Because they hit "unknown error" and when that happens on
| safety critical systems you have to assume that all your
| system's invariants are compromised and you're in undefined
| behavior -- so all you can do is stop.
|
| What surprised me more is that the amount of data existing
| for all waypoints on the globe is quite small, if I were to
| implement a feature that query by their names as an
| identifier the first thing I'd do is to check for duplicates
| in the dataset. Because if there are, I need to consider that
| condition in every place where I'd be querying a waypoint by
| a potential duplicate identifier.
|
| I had that thought immediately when looking at flight plan
| format, noticed the short strings referring to waypoints, way
| before getting to the section where they point out the name
| collision issue.
|
| Maybe I'm too used to work with absurd amounts of data (at
| least in comparison to this dataset), it's a constant part of
| my job to do some cursory data analysis to understand the
| parameters of the data I'm working with, what values can be
| duplicated or malformed, etc.
| SoftTalker wrote:
| If there are duplicate waypoint IDs, they are not close
| together. They can be easily eliminated by selecting the
| one that is one hop away from the prior waypoint. Just
| traversing the graph of waypoints in order would filter out
| any unreachable duplicates.
| adrianmonk wrote:
| Because the code classified it as a "this should never happen!"
| error, and then it happened. The code didn't classify it as a
| "flight plan has bad data" error or a "flight plan data is OK
| but we don't support it yet" error.
|
| If a "this should never happen!" error occurs, then you don't
| know what's wrong with the system or how bad or far-reaching
| the effects are. Maybe it's like what happened here and you
| could have continued. Or maybe you're getting the error because
| the software has a catastrophic new bug that will silently
| corrupt all the other flight plans and get people killed. You
| don't know whether it is or isn't safe to continue, so you
| stop.
| hn_throwaway_99 wrote:
| I agree with the general sentiment "if you see an unexpected
| error, STOP", but I don't really think that applies here.
|
| That is, when processing a sequential queue which is what
| this job does, it seems to me reading the article that each
| job in the queue is essentially totally independent. In that
| case, the code most definitely _should_ isolate "unexpected
| error in job" from a larger "something unknown happened
| processing the higher level queue".
|
| I've actually seen this bug in different contexts before, and
| the lessons should always be: One bad job shouldn't crash the
| whole system. Error handling boundaries should be such that a
| bad job should be taken out of the queue and handled
| separately. If you don't do this (which really just entails
| being thoughtful when processing jobs about the types of
| errors that are specific to an individual job), I guarantee
| you'll have a bad time, just like these maintainers did.
| crabbone wrote:
| > is essentially totally independent
|
| They physically cannot be independent. The system works on
| an assumption that the flight was accepted and is valid,
| but it cannot place it. What if it accidentally schedules
| another flight in the same time and place?
| Thorentis wrote:
| Except that you can't be sure this bad flight plan doesn't
| contain information that will lead to a collision. The
| system needs to maintain the integrity of _all_ plans it
| sees. If it can 't process one, and there's the risk of a
| plane entering airspace with a bad flight plan, you need to
| stop operations.
| lozenge wrote:
| But they have 4 hours to reach out to the one plane whose
| flight plan didn't get processed and tell them to land
| somewhere else.
| ivraatiems wrote:
| Assuming they can identify that plane.
|
| Aviation is incredibly risk-averse, which is part of why
| it's one of the safest modes of travel that exists. I
| can't imagine any aviation administration in a developed
| country being OK with a "yeah just keep going" approach
| in this situation.
| raverbashing wrote:
| And that's why I never (or very rarely) put "this should
| never happen" exceptions anymore in my code
|
| Because you eventually figure out that, yes, it does happen
| PeterStuer wrote:
| So what does your code do when you did not handle the this
| should never happen exception? Exit and print out a
| stacktrace to stdout?
| pmontra wrote:
| A customer of mine is adamant in their resolve to log
| errors, retry a few times, give up and go on with the next
| item to process.
|
| That would have grounded only the plane with the flight
| plan that the UK system could not process.
|
| Still a bug but with less effects to all the continent,
| because planes that could not get inside or outside the UK
| could not fly and that affected all of Europe and possibly
| more.
| crabbone wrote:
| > That would have grounded only the plane with the flight
| plan that the UK system could not process.
|
| By the looks of it, it was few hours in the air by the
| time the system had a breakdown. Considering it didn't
| know what the problem was, it seems appropriate that it
| shut down. No planes collided, so the worst didn't
| happen.
| airstrike wrote:
| This here is the true takeaway. The bar for writing "this
| should never happen" code must be set so impossibly high
| that it might as well be translated into "'this should
| never happen' should never happen"
| andrewaylett wrote:
| The problem with _that_ is that most programming
| languages aren 't sufficiently expressive to be able to
| recognise that, say, only a subset of switch cases are
| actually valid, the others having been already ruled out.
| It's sometimes possible to re-architect to avoid many of
| this kind of issue, but not always.
|
| What you're often led to is "if this happens, there's a
| bug in the code elsewhere" code. It's really hard to know
| what to do in that situation, other than terminate
| whatever unit of work you were trying to complete: the
| only thing you know for sure is that the software doesn't
| accurately model reality.
|
| In this story, there obviously _was_ a bug in the code.
| And the broken algorithm shouldn 't have passed review.
| But even so, the _safety critical_ aspect of the
| _complete_ system wasn 't compromised, and _that_ part
| worked as specified -- I suspect the system behaviour
| under error conditions was mandated, and I dread to think
| what might have happened if the developers (the company,
| not individuals) were allowed to _actually_ assume errors
| wouldn 't happen and let the system continue unchecked.
| samus wrote:
| That reasoning is fine, but it rather seems that the
| programmers triggered this catastrophic "stop the world"
| error because they were not thorough enough considering all
| scenarios. As TA expounds, it seems that neither formal
| methods nor fuzzing were used, which would have gone a long
| way flushing out such errors.
| JumpCrisscross wrote:
| > _it rather seems that the programmers triggered this
| catastrophic "stop the world" error because they were not
| thorough enough considering all scenarios_
|
| Yes. But also, it's an ATC system. Its primary purpose "is
| to prevent collisions..." [1].
|
| If the system encounters a "this should never happen!"
| error, the correct move _is_ to shut it down and ground air
| traffic. (The error shouldn 't have happened in the first
| place. But the shutdown should have been more graceful.)
|
| [1] https://en.wikipedia.org/wiki/Air_traffic_control
| crabbone wrote:
| Neither formal methods nor fuzzing would've helped if the
| programmer didn't know that input can repeat. Maybe they
| just didn't read the paragraph in whatever document
| describes how this should work and didn't know about it.
|
| I didn't have to implement flight control software, but I
| had to write some stuff described by MIFID. It's a job from
| hell, if you take it seriously. It's a series of normative
| documents that explains how banks have to interact with
| each other which were published quicker than they could've
| been implemented (and therefore the date they had to take
| effect was rescheduled several times).
|
| These documents aren't structured to answer every question
| a programmer might have. Sometimes the "interesting"
| information is close together. Sometimes you need to guess
| the keyword you need to search for to discover all the
| "interesting" parts... and it could be thousands of pages
| long.
| sublimefire wrote:
| I've only heard from people engineering systems for
| aerospace industry and we're speaking hundreds of pages
| of api documentation. It is very complex so equally the
| chances of a human error are higher.
| thrdbndndn wrote:
| > Flight Plan Reception Suite Automated (FPRSA-R)
|
| Where does the "-R" come from?
| closewith wrote:
| Replacement.
| sdfghswe wrote:
| Lol that's like me naming my filenames _final2_realfinal
| before I learned about git.
| cjbprime wrote:
| Great post. This part goes too far, I think:
|
| > Human lives were kept safe at all times
|
| > The consequence of all this was not that any human lives were
| put in danger, ..
|
| When you're arguing that cancelling 2000 flights cost PS100M
| _and_ that no human danger was incurred, something should feel
| off. That might be around 600k humans who weren 't able to be
| where they felt they needed to be. Did they have somewhere safe
| to sleep? Did they have all the medications they needed with
| them? Did they have to miss a scheduled surgery? Could we try to
| measure the effect on their well-being in aggregate, using a
| metric other than the binary state of alive or facing imminent
| death? You get the idea.
|
| Of course I agree with the version of the claim that says that no
| direct danger was caused from the point of view of the failing-
| safe system. But when you're designing a system, it ought to be
| part of your role to wonder where risk is going as you more
| stringently displace it from the singular system and source of
| risk that you maintain.
| [deleted]
| kodt wrote:
| I mean it could have also saved lives by that logic. Did
| someone missing their flight mean they also missed a terrible
| pileup on the roadways after landing? We can imagine pretty
| much any scenario here.
| gabereiser wrote:
| So they forgot to "geographically disparate" fence their queries.
| Having built a flight navigation system before, I know this bug.
| I've seen this bug. I've followed the spec to include a geofence
| to avoid this bug.
| sam0x17 wrote:
| Why on earth do they not have GUIDs for these navigation points
| if the names are not globally unique and inter-region routes
| are commonplace?
| f1shy wrote:
| The names have to be entered manually by pilots, if e.g. they
| change the route. They have to be transmitted over the air by
| humans. So they must be short ans simple.
| zarzavat wrote:
| Yes but shouldn't one step of the code be to translate
| these non-unique human-readable identifiers into completely
| unique machine-readable identifiers?
| avianlyric wrote:
| How exactly would you do that? It's impossible to map
| from a dataset of non-unique identifiers to unique
| identifiers without additional data and heuristics. The
| mapping is ambiguous by definition.
|
| The underlying flight plan standard were all created in
| an era of low memory machines, and when humans were
| expected to directly interpret data exactly as the
| programs represented it internally (because serialisation
| and deserialisation is expensive when you need every CPU
| cycle just run your core algorithms)
| blitzar wrote:
| Clippy: It looks like you are trying to enter a non unique
| navigation point, did you mean the one in France or the one
| in Australia?
| paulddraper wrote:
| Aviation protocols are extremely backwards compatible and
| low-tech compatible.
|
| You need to be able to read, write, hear, and speak the
| identifier. (And receive/transmit in morse code)
|
| Would it be okay to have an "area code prefix" in the
| identifier? Plausible (but practically speaking too late for
| that)
| tortue0 wrote:
| They do and use lat/lon in some cases. Reviewing and
| inputting that (when being done manual) is another story -
| but it's technically possible.
| amoerie wrote:
| Long story: because changing identifiers is a considerable
| refactoring, and it takes coordination with multiple
| worldwide distributed partners to transition safely from the
| old to the new system, all to avoid a hypothetical issue some
| software engineer came up with
|
| Short story: money. It costs money to do things well.
| ftxbro wrote:
| > Long story: because changing identifiers is a
| considerable refactoring
|
| is this what refactoring means
| NBJack wrote:
| Yes. It would cascade into:
|
| Changes in how ATCs operate
|
| Changes in how pilots operate
|
| Changes in how airplanes receive these instructions
| (including the flight software itself, safety systems,
| etc.)
|
| Changes in how airplanes are tested
|
| Changes in how pilots are trained
|
| Etc. In this case, the refactoring requires changes to
| hardware, software, training, manufacturing, and humans.
| ftxbro wrote:
| does refactoring mean literally any non-local change even
| just like changing a variable name, or does it usually
| mean some kind of structural or architectural non-local
| change
| deadfish wrote:
| Pretty sure that is still not the meaning of refactoring.
| As I understand it refactoring should mean no changes to
| the external interface but changes to how it is
| implemented internally.
| epanchin wrote:
| What three words would be a better solution than a guid, as
| transmittable over radio.
| dharmab wrote:
| W3W contains homonyms and words that are easily confused by
| non-native english speakers. Often within just a few KM.
| The latter is why ATC uses "niner", to avoid confusing
| "nine" and "nein".
|
| Talk to someone deep in the GIS rabbit hole and you'll get
| a rant about how bad W3W is:
| https://cybergibbons.com/security-2/why-what3words-is-not-
| su...
| chrisweekly wrote:
| That's "What3Words" --
| https://en.m.wikipedia.org/wiki/What3words -- a system for
| representing geographic location using globally-unique word
| triads.
| bsder wrote:
| WTW is a proprietary system that should never be used:
|
| https://www.walklakes.co.uk/opus64534.html
|
| The biggest fault (besides being proprietary) is that you
| must be _online_ in order to use WTW. The times that you
| might need WTW are _ALSO_ the times you are most likely to
| be unable to be online.
| Topgamer7 wrote:
| I would guess because humans have to read this and ascertain
| meaning from it. Not everyone is a technical resource.
| nwallin wrote:
| 1. Pilots occasionally have to fat finger them into
| ruggedized I/O devices and read them off to ATC over radios.
|
| 2. These are defined by the various regional aviation
| authorities. The US FAA will define one list, (and they'll be
| unique in the US) the EU will have one, (EASA?) etc.
|
| The AA965 crash (1995-12-20) was due to an aliased waypoint
| name. Colombia had two waypoints with the same name within
| 150 nautical miles of each other. (the name was 'R') This was
| in violation of ICAO regulations from like the '70s.
|
| https://en.wikipedia.org/wiki/American_Airlines_Flight_965
| gabereiser wrote:
| FAA regulations state that fixes, navs, and waypoints must be
| phonetically transmittable over radio.
|
| I.E. Yankee = YANKY. The pilot and ATC must be location
| aware. Apparently their software does not.
| gavinsyancey wrote:
| It sounds like for actual processing they replace them with
| GPS coordinates (or at least augment them with such). But
| this is the system that is responsible for actually doing
| that...
| tppiotrowski wrote:
| ICAO standard effective from 1978 to only duplicate identifiers
| if more than 600 nmi (690 mi; 1,100 km) apart
| [deleted]
| c7DJTLrn wrote:
| This is an interesting engineering problem and I'm not sure what
| the best approach is. Fail safe and stop the world, or keep
| running and risk danger? I imagine critical systems like
| trading/aerospace have this worked out to some degree.
| crabbone wrote:
| There isn't and cannot be a preference to either one. It always
| depends on what the system is doing and what the consequences
| would be... Pacemaker cannot "fail safe" for example, under no
| circumstances. It's meaningless to consider such cases. But if
| escalation to a human operator is possible, then it will also
| depend on how the system is meant to be used. In some cases
| it's absolutely necessary that the system doesn't try to handle
| errors (eg. if say a patient is in a CT machine -- you always
| want to stop to, at least, prevent more radiation), but in the
| situation like the one with the flight control -- my guess is
| that you want the system to keep trying _while alerting the
| human operator_.
|
| But then it can also depend on what's in the contract and who
| will get the blame for the system functioning incorrectly. My
| guess here is that failing w/o attempting to recover was, while
| an overkill, a safer strategy than to let eg. two airplanes be
| scheduled for the same path (and potentially collide).
| fbdab103 wrote:
| Absolutely no idea on what is correct, but I love to reference
| this article on software practices at NASA[0], They Write the
| Right Stuff.
|
| [0] https://www.fastcompany.com/28121/they-write-right-stuff
| lbriner wrote:
| I seem to remember another problem at NATS which had the same
| effect. Primary fell over so they switched over to a secondary
| that fell over for the exact same reason.
|
| It seems like you should only failover if you know the problem is
| with the primary and not with the software itself. Failing over
| "just because" just reinforces the idea that they didn't have
| enough information exposed to really know what to do.
|
| The bit that makes me feel a bit sick though is that they didn't
| have a method called "ValidateFlightPlan" that throws an error if
| for any reason it couldn't be parsed and that error could be
| handled in a really simple way. What programmer would look at a
| processor of external input and not think, "what do we do with
| bad input that makes it fall over?". I did something today for a
| simple message prompt since I can't guarantee that in all
| scenarios the data I need will be present/correct. Try/catch and
| a simple message to the user "Data could not be processed".
| 1970-01-01 wrote:
| Yep. In electrical terms, you replaced the fuse to watch it
| blow again. There are no more fuses in your shop. Progress?
| d1sxeyes wrote:
| Well, if the primary is known not to be in a good state, you
| might as well fail over and hope that the issue was a fried
| disk or a cosmic bit flip or something.
|
| The real safety feature is the 4 hour lead time before manual
| processing becomes necessary.
|
| One of the key safety controls in aviation is "if this breaks
| for any reason, what do we do", not so much "how do we stop
| this breaking in the first place".
| samus wrote:
| It was in a bad state, but in a very inane way: a flight plan
| in its processing queue was faulty. The system itself was
| mostly fine. It was just not well-written enough to
| distinguish an input error from an internal error, and thus
| didn't just skip the faulty flight plan.
| Twirrim wrote:
| at the risk of nitpicking: "a flight plan in its processing
| queue was faulty" isn't true, the flight plan was fine. It
| couldn't process it.
|
| I mention this only because the Daily Mail headline pissed
| me off with it's usual bullshit foreigner fear mongering
| crap.
| samus wrote:
| Indeed, that intention is quite transparent in this case.
| Anyways, I suspect that invalid input exists that would
| have made the system react in a similar way
| zaphar wrote:
| I'm no aviation safety controls expert but it seems to me
| that there are two types of controls that should be in place:
|
| 1. Process controls: What do we do when this breaks for any
| reason.
|
| 2. Engineering controls: What can we do to keep this from
| breaking in the first place?
|
| Both of them seem to be somewhat essential for a truly safe
| system.
| jeffrallen wrote:
| One or more of three results can come from the engineering
| exercise of trying to keep something from breaking in the
| first place:
|
| 1. You could know the solution, but it would be too heavy.
|
| 2. You could know the solution, but it would include more
| parts, each of which would need the same process on it, and
| the process might fail the same way
|
| 3. You miss something and it fails anyway, so your "what if
| this fails" path better be well rehearsed and executed.
|
| Real engineering is facing the tradeoffs head on, not hand
| waving them away.
| mixdup wrote:
| It's very hard to ensure you capture every single possible
| failure mode. Yes, the engineering control is important but
| it's not the most critical. What to do if it does fail (for
| any reason) is the truly critical control, because it
| solves for the possibility of not knowing every possible
| way something might fail and therefore missing some way to
| prevent a failure
| sheepshear wrote:
| Failing over is correct because there's no way to discern that
| the hardware is _not_ at fault. They should have designed a
| better response to the second failure to avoid the knock-on
| effects.
| anupj wrote:
| Great writeup
| johnklos wrote:
| So the "engineering teams" couldn't tail /var/log/FPRSA-R.log and
| see the cause of the halt?
|
| I've had servers and software that I had never, ever used before
| stop working, and it took a lot less than four hours to figure
| out what went wrong. I've even dealt with situations where bad
| data caused a primary and secondary to both stop working, and
| I've had to learn how to back out that data and restart things.
|
| Sure, hindsight is easy, but when you have two different systems
| halt while processing the same data, the list of possible causes
| shrinks tremendously.
|
| The lack of competence in the "engineering teams" tells us lots
| about how horribly these supposedly critical systems are managed.
| slingnow wrote:
| Damn, if only you had been there to instantly save the day by
| just running that simple command!
| johnklos wrote:
| No. That's silly. The logs would've / should've just shown
| that the program halted because it was confused about data.
| The actual commands to fix would've been quite different.
| seabass-labrax wrote:
| You're assuming that there is in fact a /var/log/FPRSA-R.log to
| tail - it would not at all surprise me if a system this old is
| still writing its logs to a 5.25 inch floppy in Prestwick or
| Swanwick^1.
|
| ^1: they closed the West Drayton centre about twenty years ago;
| I don't imagine they moved their old IBM 9020D too, if they
| still had it by then. My comment is nonetheless only slightly
| exaggerated ;)
| codeulike wrote:
| This is a great post. My reading of it:
|
| - waypoint names used around the world are not unique
|
| - as a sortof cludge, "In order to avoid confusion latest
| standards state that such identical designators should be
| geographically widely spaced."
|
| - but still you might get the same waypoint name used twice in a
| route to mean different places
|
| - the software was not written with that possibilty in mind
|
| - route did not compute
|
| - threw 'critical exception' and entered 'maintenance mode' -
| i.e. crashed
|
| - backup system took over, hit the same bug with the same bit of
| data, also crashed
|
| - support people have a crap time
|
| - it wasnt until they called the software supplier that they
| found the low level logs that revealed the cause of the problem
| noman-land wrote:
| My jaw kept dropping with each new bullet point.
| xvector wrote:
| Same, is aviation technology really this primitive?
| H8crilA wrote:
| It is mostly quite primitive, but it also works amazingly
| well. For example ILS or VOR or ATC audio comms can all be
| received and read correctly using hardware built from entry
| level ham radio knowledge. Altimeters still require a
| manual input of pressure. Fuel levels can be checked with
| sticks.
|
| Kinda the opposite of a modern web/mobile app, complicated,
| massively bloated and breaks rather often :).
| rozap wrote:
| shhh, nobody tell xvector that unleaded avgas finally
| happened in 2022 :)
| [deleted]
| teleforce wrote:
| Thanks for the summary and TL;DR.
|
| Essentially this is down to the lack of proper namespace, who'd
| have thought aerospace engineer need to study operating
| systems! I've a friend who's a retired air force pilot and
| graduated from Cranfield University, UK foremost post graduate
| institution for aerospace engineering with their own airport
| for teaching and research [1]. According to him he did study OS
| in Cranfield, and now I finally understand why.
|
| Apparently based on the other comments, the standard for
| namespace is already available but currently it's not being
| used by the NATS/ATC, hopefully they've learnt their lessons
| and start using it for goodness sake. The top comment mentioned
| about the geofencing bug, but if NATS/ATC is using proper
| namespace, geofencing probably not necessary in the first
| place.
|
| [1] Cranfield University:
|
| https://en.wikipedia.org/wiki/Cranfield_University
| seabass-labrax wrote:
| It sounds like a great place to study that has its own ~2km
| long airstrip! It would be nice if they had a spare Trident
| or Hercules just lying around for student baggage transport
| :)
| dboreham wrote:
| "software supplier"??? Why on God's green earth isn't someone
| familiar with the code on 7/24 pager duty for a system with
| this level of mission criticality?
| sublimefire wrote:
| I think there is a bit of ignorance about how software is
| sold in some cases. This is not just some windows or browser
| application that was sold but it also contained the staff
| training with a help to procure hardware to run that software
| and maybe even more. Such systems get closed off from the
| outside without a way to send telemetry to the public
| internet (I've seen this before, it is bizarre and hard to
| deal with). The contract would have some clauses that deal
| with such situations where you will always have someone on
| call as the last line of defense if a critical issue happens.
| Otherwise, the trained teams should have been able to deal
| with it but could not.
| seabass-labrax wrote:
| That would be... the software supplier. This is quite a
| specific fault (albeit one that shouldn't have happened if
| better programming practices had been used), so I don't think
| anyone but the software's original developers would know what
| to do. This system is not safety-critical, luckily.
| sp0ck wrote:
| Small suggestion. Don't choose obscure language (in terms of
| popularity, 28th on TIOBE index with 0.65% rating) to visualize
| structure and algorithms. Otherwise you risk average viewer will
| stop reading the moment he encounter code samples. There are 27
| more popular languages, some of them orders of magnitute more.
| louthy wrote:
| Maybe he doesn't care if people stop reading and he'd prefer to
| use the language he's most comfortable with? It's his blog
| after all, not yours.
|
| Additionally, perhaps he's making the point that a language
| with an expressive type system makes solving problems like this
| trivial.
| sp0ck wrote:
| If you don't care about readers reading it or not then what
| is the point to publish an article ?
| louthy wrote:
| I read it. Probably lots of other people did too.
| Presumably the people who don't think computer science
| begins and ends with JavaScript
| recursive wrote:
| Why does there have to be a point? If there is one, why do
| you need to understand it?
| rjh29 wrote:
| The code is a relatively small part of the article, and quite
| far into it I might add.
| daaaaaaan wrote:
| I appreciated the Haskell examples, they aren't particularly
| hard to follow. How do you think those more popular languages
| _got_ more popular?
| redleader55 wrote:
| I imagine, for this kind of system, there is only one supplier.
| Why not force that supplier, as part of their 10-15 yr contract,
| to publish the source code for everything, not necessarily as
| FOSS. This way if there are bugs they can be reported and fixed.
| passwordoops wrote:
| I agree. But this would assume that:
|
| 1- the people writing and approving the specs even understand
| why this might be a good suggestion
|
| 2- the people ultimately approving the contract aren't in bed
| with the supplier
| dboreham wrote:
| There's always prison for those people.
| FeepingCreature wrote:
| 3- the people operating the system are capable of maintaining
| its source code
| SillyUsername wrote:
| Bugs happen. Fact of being written by fleshy meatballs. What
| should also have been highlighted is that they clearly had no
| easy way of finding the specific buggy input in the logs nor
| simulating it without contacting the manufacturer.
| FeepingCreature wrote:
| No way or no procedure.
| throw74848 wrote:
| [flagged]
| tantalor wrote:
| > The programming style is very imperative
|
| Is that supposed to be a meaningful statement?
| tome wrote:
| Yes, typically it would be used to mean things like the code
| mutates data in place rather than using persistent data
| structures, explicitly loops over data rather than using
| higher-order map, fold etc. operations, and explicitly checks
| tag bits rather than using sum types.
| amiga386 wrote:
| It tells you the author of the blogpost is one of those
| functional programming proselytizers, you can determine that
| just by how they sneer. So yes, it is a meaningful statement,
| but the meaning says more about the author than what they're
| commenting on.
|
| They similarly reveal their biases when they say "the mistake
| of the faulty algorithm [...] maintaining pointers into each of
| them"
|
| Lo and behold, the author chooses Haskell at the end to
| demonstrate how they'd do it. Such pure, very monoid in the
| category of endofunctors.
| tantalor wrote:
| Thanks for making me laugh :D
| tome wrote:
| > the author of the blogpost is one of those functional
| programming proselytizers ... Such pure, very monoid in the
| category of endofunctors
|
| Sorry, who's sneering here?
| [deleted]
| supernova87a wrote:
| Did the creator of the flight plan software engage in adversarial
| testing to see if they could break the system with badly formed
| flight plans? Or was / is the typical practice to mostly just see
| if the system meets just the "well-behaved" flight plan
| processing requirements? (with unit tests, etc)
| bombcar wrote:
| I think we all know the answer to this.
|
| A huge portion of "exploits" in the last 20 years have been
| "internal business APIs" if you will being exposed to malicious
| actors.
| jacquesm wrote:
| Trusted input rarely should be trusted. It's input. You need to
| validate it as if it is hostile and have a process for dealing
| with malformed input. Now of course, standing by the sidelines it
| is easy to criticize and I'm sure whoever worked on this wasn't
| stupid. But I've seen this error often enough now in practice
| that I think that it needs to be drilled into programmers heads
| more forcefully: stuff is only valid if you have _just_ validated
| it. If you send it to someone else, if someone you trust sends it
| to you, if you store in a database and then retrieve it and so on
| then it is just input all over again and you _probably_ should
| validate it for being well-formed. If you don 't do that then
| you're a bitflip, migration or an update away from an error that
| will cause your system to go into an unstable state and the real
| problem is that you might just propagate the error downstream
| because you didn't identify it.
|
| Input is hard. Judging what constitutes 'input' in the first
| place can be harder.
| lgeorget wrote:
| From what I gathered from the article, the input WAS valid.
| It's the software that was unable to handle a specific case of
| valid input.
| jacquesm wrote:
| That's fine, and is exactly the kind of case that I was
| thinking of: your software has a different idea of what is
| valid than an upstream piece of software, so from _your_
| perspective it is invalid. So you need to pull this message
| out of the stream, sideline it so it can be looked at by
| someone qualified enough to make the call of what 's the case
| (because it could well be either way) and processing for all
| other messages should continue as normal. After all the only
| reason you can say with confidence that it in fact was valid
| is because someone looked at it! You can only do that well
| after the fact.
|
| A message switch [1] that I worked on had to deal with
| messages sources from 100's of different parties and while in
| principle everybody was working from the same spec (CCITT
| [2]) every day some malformed messages would land in the
| 'error' queue. Usually the problem was on the side of the
| sender, but sometimes (fortunately rarely) it wasn't and then
| the software would be improved to be able to handle that case
| correctly as well. Given the size of the specs and the many
| variations on the protocols it wasn't weird at all to see
| parties get confused. What's surprising is that it happens as
| rarely as it does.
|
| The big takeaway here should be that even if something
| happens very rarely it should still not result in a massive
| cascade, the system should handle this gracefully.
|
| [1] https://www.kvsa.nl/en/
|
| [2] https://en.wikipedia.org/wiki/Group_4_compression
| dboreham wrote:
| Exact same experience developing systems that process
| RFC-822 (and descendents) email messages.
| onetimeuse92304 wrote:
| This really isn't about input. Whether it comes from outside or
| produce inside the application, the reality is that everything
| can have bugs. A correct input can cause a buggy application to
| fail. So while verifying input is obviously an important step,
| it is not even a beginning if you are really looking to
| building reliable software.
|
| What really is the heart of the matter is for the entire thing
| to be allowed to crash due to a problem with single
| transaction.
|
| What you really want to do is to have firewalls. For example,
| you want a separate module that runs individual transactions
| and a separate shell that orchestrates everything but has no or
| very limited contact with the individual transactions. As bad
| as giving up on processing a single aircraft is, allowing the
| problem to cascade to entire system is way worse.
|
| What's even more tragic about this monumental waste of
| resources is that the knowledge about how to do all of this is
| readily available. The aerospace and automotive industry have
| very high development standards along with people you can hire
| who know those standards and how to use them to write reliable
| software.
| jacquesm wrote:
| Yes, there are multiple problems here that interplay in a
| really bad way and that's one of them. But the input
| processing/validation step is the first point of contact with
| that particular flight plan and it should have never
| progressed beyond that state.
|
| It all hinges on a whole bunch of assumptions and each and
| every one of those should be dealt with structurally rather
| than by patching things over.
|
| Just from reading TFA I see a very long list of things that
| would need attention. Quick recap:
|
| - validate all input
|
| - ensure the system can never stall on any one record
|
| - the system will occasionally come across malformed input
| which needs a process
|
| - it won't be immediately clear whether the system or the
| input is at fault, which needs a process
|
| - testing will need to take these scenarios into account
|
| - negative tests will need to be created (such as:
| purposefully malformed input)
|
| - attempts should be made to force the system into undefined
| states using malformed _and_ well formed input
|
| - a supervisor mechanism needs to be built into the system
| that checks overall system health
|
| And probably many more besides. But this is what I gather
| from the article is what they'll need at a minimum. Typically
| once you start digging into what it would take to implement
| any of these you'll run into new things that also need
| fixing.
|
| As for the last bit of your comment: I'm quite sure that
| those standards were in play for this particular piece of
| software, the question is whether or not they were properly
| applied and even then there are no guarantees against
| mistakes, they can and do happen. All that those standards
| manage to do is to reduce their frequency by catching the
| bulk of them. But some do slip through, and always will.
| Perfect software never is.
| [deleted]
| [deleted]
| failbuffer wrote:
| > The manufacturer was able to offer further expertise including
| analysis of lower-level software logs which led to identification
| of the likely flight plan that had caused the software exception.
|
| This part stood out to me. I've found it super helpful to include
| a reference to which piece of days in working with in log
| messages and exceptions. It helps isolated problems so much
| faster.
| throw7 wrote:
| Well, I certainly hope they've at least stopped issuing waypoints
| with identical names... although it wouldn't surprise me if
| geographically-distant is the best we can do as a species.
| SoftTalker wrote:
| They appear to be sequences of 5 upper-case letters. Assuming
| the 26-character alphabet, that should allow for nearly 12
| million unique waypoint IDs. The world is a big place but that
| seems like it should be enough. The more likely problem is that
| there is (or was) no internationally-recognized authority in
| charge of handing out waypoint IDs, so we have at least legacy
| duplicates if not potential new ones.
| seabass-labrax wrote:
| You have to reduce that to the (still massive) set of IDs
| that are somewhat pronounceable in languages that use the
| Latin script. You don't want to be the air traffic controller
| trying to work out how to say 'Lufthansa 451, fly direct
| QXKCD'. Nonetheless, I think the there is little cause for
| concern about changing existing IDs. There might be
| sentimental attachment, but it takes barely a few flights
| before the new IDs start sticking, and it's not like pilots
| never fly new routes.
| SoftTalker wrote:
| I thought that is what the "ICAO pronunciation" was for?
|
| "Fly direct Quebec Xray Kilo Charlie Delta"
| seabass-labrax wrote:
| It is, but fixes are almost always spoken as words rather
| than letter-by-letter. For this reason, they are usually
| chosen to be somewhat pronounceable, and occasionally you
| even get jokes in the names. Likewise, radio beacons and
| airports are usually referred to by the name of their
| location; for instance "proceed direct Dover" rather than
| "proceed direct Delta Victor Romeo".
|
| I think a lot of pilots and air traffic controllers would
| be irritated if they had to spend longer reading out
| clearances and instructions. In a world where vocal
| communication is still the primary method of air traffic
| control, there might be a measurable reduction in
| capacity in some busier regions.
| drachir91 wrote:
| No, waypoints aren't spelled out with the ICAO alphabet.
| They are mnemonics that are pronounced as a word and only
| spelled out if the person on the receiving end requests
| it because of bad radio reception, or unfamiliarity with
| the area/waypoint.
|
| For example, Hungarian waypoints, at least the more
| important ones are normally named after cities, towns or
| other geographical locations near them, and use the
| locations name or abbreviated name, being careful that
| they can be pronounced reasonably easily for English
| speakers. Like: ERGOM (for the city Esztergom), ABONY
| (for the town Fuzesabony), SOPRO (for Sopron), etc.
| lgeorget wrote:
| Not all 5-character long strings are usable though. They have
| to be pronounceable as a single word and understandable over
| radio as much as possible.
| dundarious wrote:
| I wish the article contained some explanation of why the
| processing for NATS requires looking at both the ADEXP waypoints
| _and_ the ICAO4444 waypoints (not a criticism per se, it may not
| have been addressed in the underlying report). Just looking at
| the ADEXP seems sufficient for the UK segment logic.
|
| I'm guessing it has something to do with how ICAO4444 is
| technically human readable, and how in some meaningful sense,
| pilots and ATC staff "prefer" it. e.g., maybe all ICAO4444
| waypoints are "significant" to humans (like international
| airports), whereas ADEXP waypoints are often "insignificant"
| (local airports, or even locations without any runway at all).
|
| Of course with 20/20 hindsight, it seems obviously incorrect to
| loop through the ICAO4444 waypoints in their entirety, instead of
| "resuming" from an advanced position. But why look at them at
| all?
| masklinn wrote:
| Possibly it needs the ICAO information to communicate with some
| systems, but has to work in ADEXP to have sufficient
| granularity (the essay mentions the possibility of "clipping",
| a flight going through the UK between two ICAO waypoints).
| dundarious wrote:
| Yes, I'm essentially wanting to know more about those
| existing ICAO-based systems, be they machine or not.
| Gud wrote:
| A day I don't want to remember. Took me 15 hours to reach my
| destination instead of 2. Had to take train, bus, then train
| again. 30 minutes after I had booked my tickets, everything was
| fully booked for two days.
| conradfr wrote:
| Did you meet John Candy along the way?
| hermitcrab wrote:
| >"in typical Mail Online reporting style: "Did blunder by French
| airline spark air traffic control issues?"
|
| The Daily Mail is a horrible, right-wing paper in the UK that
| blames 'foreigners' for everything. Particularly the French.
|
| Out of curiosity, is there a corresponding French paper that
| blames the English or the British for everything?
| dopidopHN wrote:
| French here, as much as I wish It was the case for comical
| effect... I don't think so.
|
| Our right wing press is also desperately economically liberal
| so anything privately run is inherently better.
|
| Maybe radio stations? Honestly, major respect to the daily mail
| for those snarky attacks that keep up the good spirits between
| our two countries.
|
| It's maybe the food or the weather that make them aggro ? Idk,
| but don't worry, we love to hate the perfide Albion. Too.
|
| Fellow French: am I wrong ? Maybe "valeur actuelle" could pull
| up that type of bullshit, but I think they are too busy blaming
| Islam to start thinking about our former colony across the
| channel.
| hermitcrab wrote:
| >major respect to the daily mail for those snarky attacks
|
| There is really nothing to like or respect about the Daily
| Mail. https://www.globaljustice.org.uk/blog/2017/10/horrible-
| histo...
|
| >our former colony across the channel
|
| Touche! ;0)
| seszett wrote:
| Well not really. People in France don't really care that much
| about England.
|
| The one country that is often blamed for problems is rather
| Germany, but honestly even Germany doesn't get blamed for petty
| problems like that.
| rcostin2k2 wrote:
| The fact that they blamed the French flight plan already accepted
| by Eurocontrol proves that they didn't really know how the
| software works. And here the Austrian company should take part of
| the blame for the lack of intensive testing.
| littlestymaar wrote:
| They blamed the French because they are British, that's it.
| It's hard to get rid of bad habits.
| jliptzin wrote:
| What I don't understand in situations like this when thousands of
| flights are cancelled is how do they catch up? It always seems
| like flights are at max capacity at all times, at least when I
| fly. If they cancel 1,000 flights in one day, how do they absorb
| that extra volume and get everyone where they need to be? Surely
| a lot of people have their plans permanently cancelled?
| CamelCaseName wrote:
| There's always some empty capacity, whether it's non-rev
| tickets for flight crew and their families which are lower
| priority than paying customers or people who miss their
| flights.
|
| I had a cancelled flight recently and they booked people two
| weeks out because every flight from that day onward was full or
| nearly full. I showed up the next morning and was able to board
| the next flight because exactly one person had scanned in their
| boarding pass (was present at the airport) but did not show up
| for whatever reason to the airplane.
|
| Beyond that, people just make alternate plans, whether it's
| taking a bus or taxi home, traveling elsewhere, picking another
| airline, anything is possible.
| thedrbrian wrote:
| You don't.
|
| I work in logistics for a FMCG company and sometimes our main
| producer goes down and we run out of certain types of stock. We
| send as much out as we can and cancel the rest.
|
| If they really want the stock the customers can rebook an order
| for tomorrow because they aren't getting it today. And we just
| start adding extra stock to each delivery.
|
| It's the best of a bad situation.
|
| We don't have the money to have extra trucks and very
| perishable stock laying about and I know the airlines don't pay
| 300 grand a month to lease a 737 just to have it sat about
| doing nothing. There's very little slack.
| worik wrote:
| I heard in the news that this was caused by a "bad flight plan".
|
| It is clear, even without any more information than that, it was
| a software failure (bad flight plan?)
|
| It will be interesting to see if Frequentis has to pay a price
| for causing this
| cja wrote:
| Every system I've ever made has better error reporting that that
| one. Even those that only I use. First thing I get working in a
| new project is the system to tell me when something fails and to
| help me understand and fix the problem quickly. I then use that
| system throughout development such that it works very well in
| production. I'd love to talk to the people who made the system
| discussed in the article. Is one of them reading this? Can you
| explain how come this problem reported itself so badly?
| rglover wrote:
| This is one of the many reasons there should be a universal data
| standard using a format like JSON. Heavily structured, easy to
| parse, easy to debug. What you lose in footprint (i.e., more disk
| space), you gain in system stability.
|
| Imagine a world where everybody uses JSON and if they offer an
| API, you can just consume the data without a bunch of hoop
| jumping. Failures like this would vanish overnight.
| masklinn wrote:
| The bug here was a processing one, having the data in json
| would make no difference.
| 0xffff2 wrote:
| Broadly speaking I think this is done for new systems. What you
| need to identify here is how and when you transition legacy
| systems to this new better standard of practice.
| rglover wrote:
| I'd argue in favor of at _least_ an annual review process.
| Have a dedicated "feature freeze, emergencies only" period
| where you evaluate your existing data structures and queue up
| any necessary work. The only real hang up here is one of bad
| management.
|
| In terms of how, it's really just a question of Schema A to
| Schema B mapping. Have a small team responsible for
| collection/organization of all the possible schemas and then
| another small team responsible for writing the mapping
| functions to transition existing data.
|
| It would require will/force. Ideally, too, jobs of those
| responsible would be dependent on completion of the task so
| you couldn't just kick the can. You either do it and do it
| correctly or you're shopping your resume around.
| tristor wrote:
| The problem is systems written in the 1970s in FORTRAN to run
| on Mainframes don't speak JSON.
| rglover wrote:
| Great. It should be fixed by replacing the FORTRAN systems
| with a modern solution. It's not that it can't be done, it's
| that the engineers don't bother to start the process (which
| is a side-effect of bad incentive structure at the employment
| level).
| fullspectrumdev wrote:
| Have you ever been involved in such a migration?
|
| It's invariably a complete clusterfuck.
| rglover wrote:
| I haven't, but I'd love to. My approach wouldn't be very
| "HR friendly," though.
| count wrote:
| Ah yes, migration through sheer force of will.
| jakub_g wrote:
| It's trivial. Only took Amadeus hundreds of developers
| working for over a decade to migrate off TPF. /s
|
| [0] https://amadeus.com/en/insights/blog/celebrating-one-
| year-fu...
| rglover wrote:
| In some sense, yes. Notice that most of the responses to
| what I've said are immediately negative or dismissive of
| the idea. If that's the starting point (bad mindset), of
| course nothing gets fixed and you land where we are
| today.
|
| My initial approach would be to weed out anyone with that
| point of view before any work took place (the "not HR
| friendly" part being to be purposefully exclusionary).
| The only way a problem of this scope/scale can be solved
| is by a team of people with extremely thick skin who are
| comfortable grabbing a beer and telling jokes after they
| spent the day telling each other to go f*ck themselves.
| tristor wrote:
| Anyone who has worked with me knows that I have no issue
| coming in like a wrecking ball in order to make things
| happen, when necessary. I've also been involved in some
| of these migration projects. I think your take on the
| complexity of these projects (and I do mean inherent
| complexity, not incidental complexity) and the responses
| you've received is exceptionally naive.
|
| The amount of wise-cracks and beers your team can handle
| after a work day is not the determinate factor in
| success. /Most/ of these organizations /want/ to migrate
| these systems to something better. There is political
| will and budget to do so, these are still inglorious
| multi-decade slogs which cannot fail, ever, because
| failure means people die. No amount of attitude will
| change that.
| tristor wrote:
| That's... not how that works. I take it you're probably
| more of a frontend person than a backend person by this
| comment. In the backend world, you usually can't fully and
| completely replace old systems, you can only replace parts
| of systems while maintaining full backwards compatibility.
| The most critical systems in the world -- healthcare,
| transportation, military, and banking -- all run on
| mainframes still, for the most part. This is isn't a
| coincidence. When these systems get migrated, any issues,
| including issues of backwards compatibility cause people to
| /DIE/. This isn't an issue of a button being two pixels to
| the left after you bump frontend platform revs, these
| systems are relied on for the lives and livelihood of
| millions of people, every single day.
|
| I am totally with you wishing these systems were more
| modern, having worked with them extensively, but I'm also
| realistic about the prospect. If every major airline
| regulator in the world worked on upgrading their ATC
| systems to something modern by 2023 standards, and
| everything went perfectly, we could expect to no longer
| need backwards compatibility with the old system sometime
| in 2050, and that's /very/ optimistic. These systems are
| basically why IBM is still in business, frankly.
| mprovost wrote:
| No migration of this magnitude is blocked because of
| engineers not "bothering" to start the process. Imagine how
| many approvals you'd need, plus getting budget from who-
| knows how many government departments. Someone is paying
| for your time as an engineer and they decide what you work
| on. I'm glad we live in a world where engineers can't just
| decide to rewrite a life or death system because it's
| written in an old(er) programming language. (Not that there
| is any evidence that this specific system is written in
| anything older than C++ or maybe Ada.)
| fbdab103 wrote:
| I guess we should rewrite it in Rust.
|
| Airplane logistics feels like one of the most complicated
| systems running today. A single airline has to track
| millions of entities: planes, parts, engineers, luggage,
| cargo, passengers, pilots, gate agents, maintenance
| schedules, etc. Most of which was created all before best-
| practices were a thing. Not only is the software complex,
| but there are probably millions of devices in the world
| expecting exactly format X and will never be upgraded.
|
| I have no doubt that eventually the software will be Ship
| of Thesus-ed into something approaching sanity, but there
| are likely to be glaciers of tech debt which cannot be
| abstracted away in anything less than decades of work.
| seabass-labrax wrote:
| It would still be valuable to replace components piece-
| by-piece, starting with rigorously defining internal data
| structures and publically providing schemas for existing
| data structures so that companies can incorporate them.
|
| I would like to point out that the article (and the
| incident) does not relate to airline systems; it is to do
| with Eurocontrol and NATS and their respective commercial
| suppliers of software.
| tjohns wrote:
| Many of them have been upgraded. In the US, we've replaced
| HOST (the old ATC backend system) with ERAM (the modern
| replacement) as of 2015.
|
| However, you have to remember this is a global problem. You
| need to maintain 100% backwards compatibility with every
| country on the planet. So even if you upgrade your
| country's systems to something modern, you still have to
| support old analog communication links and industry
| standard data formats.
| vb-8448 wrote:
| It won't fix anything. JSON is the "standard" today, 15 years
| ago it was XML and in 15 years we will have protobuf or another
| new standard.
| rglover wrote:
| Correct. The other "leg" of a solution to this problem would
| be to codify migration practices so stagnation at the tech
| level is a non issue long-term.
| vb-8448 wrote:
| > codify migration practices
|
| I think this won't work: no one really wants to touch a
| system that works, and people will try to find any excuse
| to avoid migrating. The reason of this is that everyone
| prefers systems that work and fails in known way rather new
| systems that no one knows how can it fail.
| rglover wrote:
| Does the system work if it randomly fails and collapses
| the entire system for days?
|
| People generally prefer to be lazy and to not use their
| brains, show up, and receive a paycheck for the minimum
| amount of effort. Not to be rude, but that's where this
| attitude originates. Having a codified process means that
| attitude _can 't_ exist because you're given all of the
| tools you need to solve the problem.
| vb-8448 wrote:
| > Having a codified process means that attitude can't
| exist because you're given all of the tools you need to
| solve the problem.
|
| Yes, but in real life doesn't work. Processes have corner
| cases. As you said, people are lazy and will do
| everything to find the corner case to fit in.
|
| Just an example from the banking sector. There are
| processes (and even laws) that force banks to use only
| certified, supported and regularly patched software:
| there are still a lot of Windows 2000 servers in their
| datacenters and will be there for many years.
| recursive wrote:
| You could do all that stuff.
|
| But after you did it, you'd still have exactly the same
| problem. The cause was not related to deserialization. That
| part worked perfectly. The problem is the business logic
| that applied to the model after the message was parsed.
| smarx007 wrote:
| There are already standards like XML and RDF Turtle that allow
| you to clearly communicate vocabulary, such that a property
| 'iso3779:vin' (shorthand for a made-up URI
| 'https://ns.iso.org/standard/52200#vin') is interpreted in the
| same way anywhere in the structures and across API endpoints
| across companies (unlike JSON, where you need to fight both the
| existence of multiple labels like 'vin', 'vin_no', 'vinNumber',
| as well as the fact that the meaning of a property is strongly
| connected to its place in the JSON tree). The problem is that
| the added burden is not respected at the small scale and once
| large scale is reached, the switching costs are too big. And
| that XML is not cool, naturally.
|
| On top of that, RDF Turtle is the only widely used standard
| _graph_ data format (as opposed to tree-based formats like JSON
| and XML). This allows you to reduce the hoop jumping when
| consuming responses from multiple APIs as graph union is a
| trivial operation, while n-way tree merging is not.
|
| Finally, RDF Turtle promotes use of URIs as primary identifiers
| (the ones exposed to the API consumers) instead of primary
| keys, bespoke tokens, or UUIDs. Followig this rule makes all
| identifiers globally unique and dereferenceable (ie, the ID
| contains the necessary information on how to fetch the resource
| identified by a given ID).
|
| P.S.: The problem at hand was caused by the algorithm that was
| processing the parsed data, not with the parsing per se. The
| only improvement a better data format like RDF Turtle would
| bring is that two different waypoints with the same label would
| have two different URI identifiers.
| seabass-labrax wrote:
| Furthermore, there are _already_ XML namespaces for flight
| plans. These are not, however, used by ATC - only by pilots
| to load new routes into their aircrafts ' navigation
| computers.
|
| I'm not sure whether there is an existing RDF ontology for
| flight plans; it would probably be of low to medium
| complexity considering how powerful RDF is and the kind of
| global-scale users it already has.
| fbdab103 wrote:
| Airport software predates basically every standard on the
| planet. I would not be surprised to learn that they have their
| own bizarro world implementation of ASCII, unix epoch time,
| etc.
| tjohns wrote:
| Yes, FPL messages are sent over AFTN, which uses ITA-2 Baudot
| code instead of ASCII:
| https://en.wikipedia.org/wiki/Baudot_code
|
| The keyboards used by ATC don't even allow entering symbols:
| https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2F1.
| ..
|
| (There is a modern replacement for AFTN called AMHS, which
| replaces analog phone lines with X.400 messages over IP...
| but the system still needs to be backwards compatible for ATC
| units still using analog links.)
| nemetroid wrote:
| There are several XML formats for expressing flight plans, most
| notably ARINC 633 and FIXM.
| dundarious wrote:
| Parsing the data formats had zero contribution to the problem.
| They had a problem running an algorithm on the input data, and
| error reporting when that algorithm failed. Nothing about JSON
| would improve the situation.
| rglover wrote:
| Yes, but look at the data. The algorithm was buggy because
| the input data is a nightmare. If the data didn't look like
| that, it's very unlikely the bug(s) would have ever existed.
| zimpenfish wrote:
| > The algorithm was buggy because the input data is a
| nightmare.
|
| No, the algorithm was "buggy" because it didn't account for
| the entry to and exit points from the UK to have the same
| designation because they're supposed to be geographically
| distant (they were 4000Nm apart!) and the UK ain't that
| big.
| dundarious wrote:
| ADEXP sounds like the universal data standard you want
| then. The UK just has an existing NATS that cannot
| understand it without transformation by this problematic
| algorithm. So the significant part of your suggestion might
| be to elide the NATS specific processing and upgrade NATS
| to use ADEXP directly.
|
| Using a JSON format changes nothing. Just adds a few more
| characters to the text representation.
| schainks wrote:
| I have seen a bad outages caused by valid JSON whose
| consumer implemented something incorrectly.
|
| I agree with dundarius that "doing this in JSON" would not
| have changed the likelihood the bug could have manifested.
| rglover wrote:
| No change at all? I find that hard to believe. There's
| also a data design problem here, but the structure of
| JSON would aid in, not subtract from, that process.
|
| The question at hand is: "heavily structured data vs. a
| blob of text as input into a complex algorithm, which one
| is preferred?"
|
| Unless you're lying, you'd choose the former given the
| option.
| dundarious wrote:
| The issue is using _both_ ADEXP and ICAO4444 waypoints,
| and doing so in a sloppy way. For the waypoint lists,
| there is no issue with structurelessness -- the fact that
| they 're lists is pretty obvious, even in the existing
| formats. Adding some ["",] would not have helped the
| specific problem, as the relevant structure was already
| perfectly clear to the implementers. I am not lying when
| I say the bug would have been equally likely in a JSON
| format in this specific case.
| schainks wrote:
| Now I'm wigging out to the idea of how the act of
| overcoming the inertia of the existing system just to
| migrate to JSON would spawn thousands of bugs on its own
| -- many life-threatening, surely.
| schainks wrote:
| These old standards ARE heavily structured data, despite
| what their formatting or lack of punctuation suggests.
| numpad0 wrote:
| To me and XML-ified this would look more nightmarish than
| the status quo... it's just brief, space separated and \n
| terminated ASCII. No need to overcomplicate things this
| simple.
| wolfendin wrote:
| My question is: why was the algorithm searching any section
| before the UK entry point. You can't exit at a waypoint before
| you enter so there is no reason to search that space.
| CodeL wrote:
| [flagged]
| Rochus wrote:
| This is apparently just an opinion, no additional inside
| information than we had from the report
| (https://news.ycombinator.com/item?id=37401981), isn't it?
|
| EDIT: downvoting this question instead of responding is a pretty
| strange reaction.
| lagt_t wrote:
| Dude this isn't reddit dont worry about the votes.
| Rochus wrote:
| Since it's not Reddit but HN, it's all the stranger to
| dismiss a perfectly legitimate question. But times and mores
| seem to change much faster than I realize.
| seabass-labrax wrote:
| You are correct, but it's an opinion that bridges the gap
| editorially between those knowledgable about ATC but not data,
| and those knowledgable about data but not ATC. This is a
| valuable service to provide, as both fields are rather complex.
| Rochus wrote:
| Thanks. I didn't have the patience to read it all. I
| initially hoped that the author was a field expert or even
| someone with inside knowledge, but he is apparently from a
| completely different domain and not in the UK, and there were
| assumptions about things the report was rather specific about
| (as specific as such reports usually are). It would be more
| useful if people would take a closer look at the report and
| draw the right conclusions about organizational failures and
| how to avoid them. All the great software technologies to
| achieve memory safety, etc. are of little use if the analyses
| and specifications are flawed or the assumptions of the
| various parties in a system of systems do not match. But
| people seem to prefer to speculate and argue about secondary
| issues.
| cratermoon wrote:
| "the description sounds like the procedure is working directly on
| the textual representation of the flight plan, rather than a data
| structure parsed from the text file. This would be quite
| worrying, but it might also just be how it is explained."
|
| Oh, this is typical in airline industry work. Ask programmers
| about a domain model or parsing, they give you blank stares. They
| love their validation code, and they love just giving up if
| something doesn't validate. It's all dumb data pipelines At no
| point is there code models the activities happening in the real
| world.
|
| In no system is there a "flight plan" type that has any behavior
| associated with it or anything like a set of waypoint types. Any
| type found would be a struct of strings in C terms, passed around
| and parsed not once, but every time the struct member is
| accessed. As the article notes, "The programming style seems very
| imperative.".
| jameshh wrote:
| That's super interesting (and a little terrifying). It's funny
| how different industries have developped different "cultures"
| for seemingly random reasons.
| cratermoon wrote:
| It was terrifying enough for me in the gig I worked on that
| dealt with reservations and check-in, where a catastrophic
| failure would be someone boarding a flight when they
| shouldn't have. To avoid that sort of failure, the system
| mostly just gave up and issued the passenger what's called an
| "Airport Service Document": effectively a record that shows
| the passenger as having a seat on the flight, but unable to
| check-in. This allows the passenger to go to the airport and
| talk to an agent at the check-in desk. At that point, yes, a
| person gets involved, and a good agent can usually work out
| the problem and get the passenger on their flight, but of
| course that takes time.
|
| If you've ever been a the airline desk waiting to check-in
| and an agent spends 10 minutes working with a passenger
| (passengers), it's because they got an ASD and the agent has
| to screw around directly in the the user-hostile SABRE
| interface to fix the reservation.
| 3pac wrote:
| SABRE is pretty good compared to the card file it replaced.
| cratermoon wrote:
| It's better to say SABRE _replicated_ , in digital form,
| that card file. And even today the legacy of that card
| form defines SABRE and all the wrappers and gateways to
| it.
| touisteur wrote:
| Giving up if something doesn't validate is indeed standard to
| avoid propagating badly interpreted data, causing far more
| complex bugs down the line. Validate soon, validate strongly,
| report errors and don't try to interpret whatever the hell is
| wrong with the input, don't try to be 'clever', because there
| lie the safety holes. Crashing on bad input is wrong, but
| trying to interpret data that doesn't validate, without specs
| (of course) is fraught with incomprehension and
| incompatibilities down the line, or unexpected corner cases (or
| untested, but no one wants to pay for a fully tested _all-goes_
| system, or just for the tools to simulate 'wrong inputs' or
| for formal validation of the parser _and_ all the code using
| the parser 's results).
|
| There are already too many problems with non-compliant or
| legacy (or just buggy) data emitters, with the complexity in
| semantics or timing of the interfaces, to try and be clever
| with badly formatted/encoded data.
|
| It's already difficult (and costly) to make a system work as
| specified, so subtle variations to make it more tolerant to
| unspecificied behaviour is just asking for bugs (or for more
| expensive systems that don't clear the purchasing price bar).
| cratermoon wrote:
| There's a difference between _parsing_ and _validating_.
| https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
| va...
|
| You're right about all the buggy stuff out there, and that
| nobody wants to pay to make it better, though.
| m1n1 wrote:
| If you want to hear about how bad air traffic control is in the
| United States, you can listen/read here
| https://www.nytimes.com/2023/09/05/podcasts/the-daily/plane-...
|
| There was a time recently when only 3 out of the 300+ air traffic
| control centers in the U.S. were fully staffed. All the rest were
| short-handed. Not sure how it stands today
| Diggsey wrote:
| Software has bugs, that's not really the damning part... The
| damning part is that in four hours and two levels of support
| teams, there was noone who actually knew anything about how the
| system worked who could remove the problematic flight plan so
| that the rest of the system could continue operating!
|
| What exactly is the point of these support teams when they can't
| fix the most basic failure mode (a single bad input...)
| hindsightbias wrote:
| They were probably on vacation
| [deleted]
| jahewson wrote:
| > What exactly is the point of these support teams when they
| can't fix the most basic failure mode (a single bad input...)
|
| To collect money on support contracts, I suspect.
| Maxion wrote:
| Try to get developers who love to code and create to stay on
| a support team and be on an on-call roster. I betcha at least
| half will say no, and the other half will either leave or
| you'll run out of money paying them.
| vb-8448 wrote:
| Just guessing:
|
| They bought a software from a third party and treat it as a
| "black box". There are few known ways that the software fails,
| and the local team has instructions on how to fix it. But if it
| fails in an unexpected way, good luck, it's impossible for the
| local team to identify and fix the problem without the vendor.
|
| The reason it took so much was they realized too late that they
| need to call the vendor.
|
| Probably you have to blame managers rather than engineers in
| the support team.
| swarnie wrote:
| Considering this same failure has happened a few times in
| recent memory maybe its over optimistic of me to expect an
| entry on the support wiki or something.
| krisoft wrote:
| > Considering this same failure has happened a few times in
| recent memory
|
| Which previous instances are you thinking about?
| NikolaNovak wrote:
| Unfortunately, I work on a reasonably modern ERP system which
| has been customized significantly for the client and also works
| with wider range of client-specific data combinations that the
| vendor has seemingly not anticipated / other clients do not
| have.
|
| What it means is that on a regular basis, teams will be woken
| up at 2am because a batch process aborted on bad data; AND it
| doesn't tell you what data / where in the process it aborted.
|
| The only possibility is to rerun the process with crippling
| traces, and then manually review the logs to find the issue,
| remove it, and then re-run the program _again_ (hopefully
| remembering to remove the trace:).
|
| Even when all goes per plan, this can at times take more than 4
| hrs.
|
| Now, we are not running a mission-critical real-time system
| like air traffic; and I'm in NO way saying any of this is good;
| but, it may not be the case that "two level of support teams
| didn't know anything" - the system could just be so poorly
| designed that with best operational experience and knowledge,
| it still took that long :-< .
|
| On HN, we take certain level of modernity, logging, failure
| states, messaging, and restartability for granted; which may
| not be even remotely present on more niche or legacy system
| (again, NOT saying that's good; just indicating issue may be
| less with operational competence vs design). It's easy to judge
| from our external perspective, but we have no idea what was
| presented / available to support teams, and what their
| mandatory process is.
| gonzo41 wrote:
| And when did you last test your monthly backups? But seriously.
| If you fill out all the positions in an org chart it's easy to
| think you're delivering, and for a lot of situations it usually
| works. Anointing someone a manager usually works out because
| people can muddle through. It doesn't work in medicine, or as
| it turns out, air traffic control.
|
| Lesson learned for about the next ~5 years.
| ateng wrote:
| One important software engineering skill that is often overlook
| is the art of writing just the right amount of log, such that
| one could have sufficient information to debug easily when
| things go wrong, but not too verbose such that it will be
| ignored or pruned in production.
| blibble wrote:
| I wouldn't expect level 1 and level 2 to be able to diagnose a
| problem like this
|
| level 3 (devs) should have been brought in much quicker though
| toyg wrote:
| Having worked in tech support: level 3 (Devs) should have
| described their source code structure to level 2, and let
| them access it when they needed it.
| P-Nuts wrote:
| You don't need a complete diagnosis if you can spit out
| enough debug info that says, "oops shat the bed while working
| with this flight plan", then the support people can remove
| the one that's causing you to fail, restart the system, and
| tell ATC to route that one manually.
___________________________________________________________________
(page generated 2023-09-11 22:00 UTC)