[HN Gopher] Air traffic failure caused by two locations 3600nm a...
       ___________________________________________________________________
        
       Air traffic failure caused by two locations 3600nm apart sharing
       3-letter code
        
       Author : basilesimon
       Score  : 119 points
       Date   : 2024-11-14 15:04 UTC (4 days ago)
        
 (HTM) web link (www.flightglobal.com)
 (TXT) w3m dump (www.flightglobal.com)
        
       | Optimal_Persona wrote:
       | Well, 3600 billionths of a meter IS kinda close...just sayin'
        
         | dh2022 wrote:
         | I read it the same way....
        
           | marky1991 wrote:
           | What did they mean, if not 'nanometers'?
        
             | abracadaniel wrote:
             | Nautical miles
        
         | bilekas wrote:
         | I was thinking the same and thinking that's a super weird edge
         | case to happen. I'm obviously tired.
        
       | FateOfNations wrote:
       | Good news: the system successfully detected an error and didn't
       | send bad data to air traffic controllers.
       | 
       | Bad News: the system can't recover from an error in an individual
       | flight plan, bringing the whole system down with it (along with
       | the backup system since it was running the same code).
        
         | wyldfire wrote:
         | > he system can't recover from an error in an individual flight
         | plan, bringing the whole system down with it
         | 
         | From the system's POV maybe this is the right way to resolve
         | the problem. Could masking the failure by obscuring this
         | flight's waypoint problem have resulted in a potentially
         | conflicting flight not being tracked among other flights? If
         | so, maybe it's truly urgent enough to bring down the system and
         | force the humans to resolve the discrepancy.
         | 
         | The systems outside of the scope of this one failed to preserve
         | a uniqueness guarantee that was depended on by this system. Was
         | that dependency correctly identified as one that was the job of
         | System X and not System Y?
        
           | martinald wrote:
           | Yes I agree. The reason the system crashed from what I
           | understand wasn't because of the duplicate code, it was
           | because it had the plane time travelling, which suggests very
           | serious corruption.
        
             | kevin_thibedeau wrote:
             | Waves hand... This is not the SQL injection you're looking
             | for. It's just a serious corruption.
        
           | aftbit wrote:
           | It seems fundamentally unreasonable for the flight processing
           | system to entirely shut itself down just because it detected
           | that one flight plan had corrupt data. Some degree of
           | robustness should be expected from this system IMO.
        
             | HeyLaughingBoy wrote:
             | It depends on what the potential outcomes are.
             | 
             | I've worked on a (medical, not aviation) system where we
             | tried as much as possible to recover from subsystem
             | failures or at least gracefully reduce functionality until
             | it was safe to shut everything down.
             | 
             | However, there were certain classes of failure where the
             | safest course of action was to shut the entire system down
             | immediately. This was generally the case where continuing
             | to run could have made matters worse, putting patient
             | safety at risk. I suspect that the designers of this system
             | ran into the same problem.
        
             | mannykannot wrote:
             | It does not seem reasonable when you put it like that, but
             | when could be said with confidence that it only affected
             | just one flight plan? I get the impression that it is only
             | in hindsight that this could be seen to be so. On the face
             | of it, this was just an ordinary transatlantic flight like
             | thousands of others.
             | 
             | In general, the point where a problem first becomes
             | apparent is not a guideline to its scope.
             | 
             | Air traffic control is inherently a coordination problem
             | dependent on common data, rules and procedures, which would
             | seem to limit the degree to which subsystems can be siloed.
             | Multiple implementations would not have helped in this
             | case, either.
        
               | MBCook wrote:
               | I think you're on the right track, I assume it's safety.
               | 
               | If one bad flight plan came in, what are the chances
               | other unnoticed errors may be getting through?
               | 
               | Given the huge danger involved with being wrong shutting
               | down with a "stuff doesn't add up, no confidence in safe
               | operation" error may be the best approach.
        
           | akira2501 wrote:
           | > obscuring this flight's waypoint problem have resulted in a
           | potentially conflicting flight not being tracked among other
           | flights?
           | 
           | Flights are tracked by radar and by transponder. The
           | appropriate thing to do is just flag the flight with a
           | discontinuity error but otherwise operate normally. This
           | happens with other statuses like "radio failure" or
           | "emergency aircraft."
           | 
           | It's not something you'd see on a commercial flight, but a
           | private IFR flight (one with a flight plan), you can actually
           | cancel your IFR plan mid flight and revert to VFR (visual
           | flight rules) instead.
           | 
           | Some flights take off without an IFR clearance as a VFR
           | flight, but once airborne, they call up ATC and request an
           | IFR clearance already en route.
           | 
           | The system is vouchsafing where it does not need to.
        
           | outworlder wrote:
           | > From the system's POV maybe this is the right way to
           | resolve the problem. Could masking the failure by obscuring
           | this flight's waypoint problem have resulted in a potentially
           | conflicting flight not being tracked among other flights? If
           | so, maybe it's truly urgent enough to bring down the system
           | and force the humans to resolve the discrepancy.
           | 
           | Flagging the error is absolutely the right way to go. It
           | should have rejected the flight plan, however. There could be
           | issues if the flight was allowed to proceed and you now have
           | an aircraft you didn't expect showing up.
           | 
           | Crashing is not the way to handle it.
        
       | hobs wrote:
       | People posting on this forum saying "ah well software's failure
       | case isn't as bad"
       | 
       | > This forced controllers to revert to manual processing, leading
       | to more than 1,500 flight cancellations and delaying hundreds of
       | services which did operate.
        
         | egypturnash wrote:
         | Zero fatalities though. You could do a _lot_ worse for a
         | massive air traffic control failure.
        
           | hobs wrote:
           | It's true, not saying they did a bad job here, just that even
           | minor problems in your code can exacerbated into giant net
           | effects without you even considering it.
        
           | lxgr wrote:
           | Unfortunately shutting down air traffic generally does not
           | result in zero excess deaths:
           | https://pmc.ncbi.nlm.nih.gov/articles/PMC3233376/
        
             | d1sxeyes wrote:
             | Your source says "the fatality rate did not change
             | appreciably".
        
               | lxgr wrote:
               | Injuries did increase, though, and I can't think of a
               | plausible mechanism that would somehow cap expected
               | outcomes at "injury but not death".
        
               | d1sxeyes wrote:
               | So we were talking about excess deaths, which means that
               | supporting your argument with a paper that argues that a
               | previous finding of excessive deaths was flawed is
               | probably not the strongest argument you could make.
               | 
               | Increased number of injuries but not deaths could be, for
               | example, (purely making things up off the top of my head
               | here) due to higher levels of distractedness among
               | average drivers due to fear of terrorism, which results
               | in more low-speed, surface-street collisions, while
               | there's no change in high speed collisions because a
               | short spell of distractedness on the highway is less
               | likely to result in an accident.
        
         | lallysingh wrote:
         | The software needs a way to reject bad plans without falling
         | over.
        
       | ipunchghosts wrote:
       | Title should be nmi
        
         | jordanb wrote:
         | I do a lot of navigation and have never seen nautical miles
         | abbreviated as "nmi."
        
           | lxgr wrote:
           | I bet not everybody on here does, so picking the unambiguous
           | unit sign would definitely avoid some double-takes.
        
         | barbazoo wrote:
         | The unit of "nm" is common among pilots but yeah technically it
         | should be "NM".
        
         | yongjik wrote:
         | NGL, two locations 3600 non-maskable interrupts apart would
         | have been a _much_ more interesting story.
        
         | buildsjets wrote:
         | Maybe that is true in your industry. It is not true in my
         | industry. NM is the legally accepted abbreviation for nautical
         | miles when used in the context of aircraft operations.
        
         | Andys wrote:
         | Non-maskable Interrupt?
        
       | jp57 wrote:
       | FYI: nm = nautical miles, not nanometers.
        
         | barbazoo wrote:
         | Given the context, I'd say NM actually
         | https://en.wikipedia.org/wiki/Nautical_mile
        
           | jp57 wrote:
           | I was clarifying the post title, which uses "nm".
        
             | pvitz wrote:
             | Yes, it looks like they should have written "NM" instead of
             | "nm".
        
               | andkenneth wrote:
               | No one is using nanometers in aviation navigation. Quite
               | a few aviation systems are case insensitive or all caps
               | only so you can't always make a distinction.
               | 
               | In fact, if you say "miles", you mean nautical miles. You
               | have to use "sm" to mean statute miles if you're using
               | that unit, which is often used for measuring visibility.
        
               | ianferrel wrote:
               | Sure but I could imagine some kind of software failure
               | caused by trying to divide by a distance that rounded two
               | zero because the same location was listed in two
               | databases that were almost but not exactly the same
               | location. In fact I did when I first read the headline,
               | then realized that it was probably nautical miles.
               | 
               | That would be roughly consistent with the title and not a
               | totally absurd thing to happen in the world.
        
               | anigbrowl wrote:
               | Indeed, but you can easily imagine a software glitch over
               | what looks like a single location but which the computer
               | sees as two separate ones.
        
         | dietr1ch wrote:
         | Thanks, from the title I was confused on why there was such a
         | high resolution on positions.
        
         | ikiris wrote:
         | Nanometers would be a very short flight.
        
           | cheschire wrote:
           | I could imagine conflict arising when switching between
           | single and double precision causing inequality like this.
        
         | noqc wrote:
         | man, this ruins everything.
        
         | andyjohnson0 wrote:
         | Even though I knew this was about aviation, I still read nm as
         | nanometres. Now I'm wondering what this says about how my brain
         | works.
        
           | lostlogin wrote:
           | It says 'metric'. Good.
        
             | tialaramex wrote:
             | Indeed. There are plenty of things in aviation where they
             | care so much about compatibility that something survives
             | decades after it should reasonably be obsolete and
             | replaced.
             | 
             | Inches of mercury, magnetic bearings (the magnetic poles
             | _move!_ but they put up with that) and gallons of fuel, all
             | just accepted.
             | 
             | Got a safety-of-life emergency on an ocean liner, oil
             | tanker or whatever? Everywhere in the entire world mandates
             | GMDSS which includes Digital Selective Calling, the boring
             | but complicated problems with radio communication are
             | solved by a machine, you just need to know who you want to
             | talk to (for Mayday calls it's everyone) and what you want
             | to tell them (where you are, that you need urgent
             | assistance and maybe the nature of the emergency)
             | 
             | On an big plane? Well good luck, they only have analogue
             | radio and it's your problem to cope with the extensive
             | troubles as a result.
             | 
             | I'm actually impressed that COSPAS/SARSAT wasn't obliged to
             | keep the analogue plane transmitters working, despite
             | obsoleting (and no longer providing rescue for) analogue
             | boat or personal transmitters. But on that, at least, they
             | were able to say no, if you don't want to spend a few grand
             | on the upgrade for your million dollar plane we don't plan
             | to spend _billions of dollars_ to maintain the satellites
             | just so you can keep your worse system limping along.
        
           | jug wrote:
           | Yeah, I went into the article thinking this because I
           | expected someone had created waypoints right on top of each
           | other and in the process also somehow generating the same
           | code for them.
        
         | QuercusMax wrote:
         | Ah! I thought this was a case where the locations were just
         | BARELY different from each other, not that they're very far
         | apart.
        
         | fabrixxm wrote:
         | It took me a while, tbh..
        
         | cduzz wrote:
         | I was wondering; it seemed like if the to airports were 36000
         | angstroms apart (3600 nanometers), it'd be reasonable to give
         | them the same airport code since they'd be pretty much on top
         | of each other.
         | 
         | I've also seen "DANGER!! 12000000 mVolts!!!" on tiny little
         | model railroad signs.
        
           | atonse wrote:
           | That's so adorable (for model railroads)
        
       | jmvoodoo wrote:
       | So, essentially the system has a serious denial of service flaw.
       | I wonder how many variations of flight plans can cause different
       | but similar errors that also force a disconnect of primary and
       | secondary systems.
       | 
       | Seems "reject individual flight plan" might be a better system
       | response than "down hard to prevent corruption"
       | 
       | Bad assumption that a failure to interpret a plan is a serious
       | coding error seems to be the root cause, but hard to say for
       | sure.
        
         | mjevans wrote:
         | Reject the flight plan would be the last case scenario, but
         | where it should have gone without other options rather than
         | total shutdown.
         | 
         | CORRECT the flight plan, by first promoting the exit/entry
         | points for each autonomous region along the route, validating
         | the entry/exit list only, and then the arcs within, would be
         | the least errant method.
        
           | mcfedr wrote:
           | Reject the plan surely should have come many places before
           | shutdown the whole system!
        
           | d1sxeyes wrote:
           | You can't just reject or correct the flight plan, you're a
           | consumer of the data. The flight plan _was_ valid, it was the
           | interpretation applied by the UK system which was incorrect
           | and led to the failure.
           | 
           | There are a bunch of ways FPRSA-R can already interpret data
           | like this correctly, but there were a combination of 6
           | specific criteria that hadn't been foreseen (e.g. the
           | duplicate waypoints, the waypoints both being outside UK
           | airspace, the exit from UK airspace being implicit on the
           | plan as filed, etc).
        
       | perihelions wrote:
       | Original (2023) thread with 446 comments,
       | 
       | https://news.ycombinator.com/item?id=37461695 ( _" UK air traffic
       | control meltdown (jameshaydon.github.io)"_)
        
       | Jtsummers wrote:
       | There's been some prior discussion on this over the past year,
       | here are a few I found (selected based on comment count, haven't
       | re-read the discussions yet):
       | 
       | From the day of:
       | 
       | https://news.ycombinator.com/item?id=37292406 - 33 points by
       | woodylondon on Aug 28, 2023 (23 comments)
       | 
       | Discussions after:
       | 
       | https://news.ycombinator.com/item?id=37401864 - 22 points by
       | bigjump on Sept 6, 2023 (19 comments)
       | 
       | https://news.ycombinator.com/item?id=37402766 - 24 points by
       | orobinson on Sept 6, 2023 (20 comments)
       | 
       | https://news.ycombinator.com/item?id=37430384 - 34 points by
       | simonjgreen on Sept 8, 2023 (68 comments)
        
         | perihelions wrote:
         | There's also a much larger one,
         | 
         | https://news.ycombinator.com/item?id=37461695 ( _" UK air
         | traffic control meltdown (jameshaydon.github.io)"_, 446
         | comments)
        
       | steeeeeve wrote:
       | You know there's a software engineer somewhere that saw this as a
       | potential problem, brought up a solution, and had that solution
       | rejected because handling it would add 40 hours of work to a
       | project.
        
         | ryandrake wrote:
         | ... or there's a software engineer somewhere who simply
         | _assumed_ that three letter navaid identifiers were globally
         | unique, and baked that assumption into the code.
         | 
         | I guess we now need a "Falsehoods Programmers Believe About
         | Aviation Data" site :)
        
           | MichaelZuo wrote:
           | Or even more straightforward, just don't believe anyone 100%
           | knows what they are doing until they exhaustively list every
           | assumption they are making.
        
             | gregmac wrote:
             | Which also means never assume the exhaustive list is 100%.
        
               | MichaelZuo wrote:
               | Bingo, without some means of credible verification, then
               | assume it's incomplete.
        
               | Filligree wrote:
               | I wouldn't be able to produce such a list, even for areas
               | where I totally do know everything that would be on the
               | list.
        
             | madcaptenor wrote:
             | Even more straightforward, just don't believe anyone 100%
             | knows what they are doing.
        
           | metaltyphoon wrote:
           | Did aviation software for 7 years. This is 100% the first
           | assumption about waypoint / navaid when new devs come in.
        
           | em-bee wrote:
           | or falsehoods programmers believe about global identifiers
        
         | CrimsonCape wrote:
         | C dev: "You are telling me that the three digit codes are not
         | globally unique??? And now we have to add more bits to the
         | struct?? That's going to kill our perfectly optimized bit
         | layout in memory! F***! This whole app is going to sh**"
        
           | throw0101a wrote:
           | > _C dev: "You are telling me that the three digit codes are
           | not globally unique???_
           | 
           | They are understood not to be. They are generally known to be
           | regionally unique.
           | 
           | The "DVL" code is unique with-in FAA/Transport Canada
           | control, and the "DVL" is unique with-in EASA space.
           | 
           | There are pre-defined three-letter codes:
           | 
           | * https://en.wikipedia.org/wiki/IATA_airport_code
           | 
           | And pre-defined four-letter codes:
           | 
           | * https://en.wikipedia.org/wiki/ICAO_airport_code
           | 
           | There are also five-letter names for major route points:
           | 
           | * https://data.icao.int/icads/Product/View/98
           | 
           | * https://ruk.ca/content/icao-icard-and-5lnc-how-
           | those-5-lette...
           | 
           | If there are duplicates there is a resolution process:
           | 
           | * https://www.icao.int/WACAF/Documents/Meetings/2014/ICARD/IC
           | A...
        
             | marcosdumay wrote:
             | Hum... Somebody has a list of foreign local-codes sharing
             | the same space as the local ones?
             | 
             | I assumed IATA messed up, not I'm wondering how that even
             | happens. It's not even easy to discover the local codes of
             | remote aviation authorities.
        
               | skissane wrote:
               | > I assumed IATA messed up,
               | 
               | This isn't IATA. IATA manages codes used for passenger
               | and cargo bookings, which are distinct from the codes
               | used by pilots and air traffic control we are talking
               | about here-ultimately overseen by ICAO. These codes
               | include a lot of stuff which is irrelevant to
               | passengers/freight, such as navigation waypoints,
               | military airbases (which normally would never accept a
               | civilian flight, but still could be used for an emergency
               | landing-plus civilian and military ATC coordinate with
               | each other to avoid conflicts)
        
             | CrimsonCape wrote:
             | It seems like tasking a software engineer to figure this
             | out when the industry at large hasn't figured this out just
             | isn't fair.
             | 
             | Best I can see (using Rust) is a hashmap on UTF-8 string
             | keys and every code in existence gets inserted into the
             | hash map with an enum struct based on the code type. So you
             | are forced to switch over each enum case and handle each
             | case no matter what region code type.
             | 
             | It becomes apparent that the problem must be handled with
             | app logic earlier in the system; to query a database of
             | codes, you must also know which code and "what type" of
             | code it is. Users are going to want to give the code only,
             | so there's some interesting mis-direction introduced; the
             | system has to somehow fuzzy match the best code for the
             | itinerary. Correct me if i'm wrong, but the above seems
             | like a _mandatory_ step in solving the problem which would
             | have caught the exception.
             | 
             | I echo other comments that say that there's probably 60%
             | more work involved than your manager realizes.
        
             | skissane wrote:
             | > They are understood not to be. They are generally known
             | to be regionally unique.
             | 
             | Then why aren't they namespaced? Attach to each code its
             | issuing authority, so it is obvious to the code that
             | DVL@FAA and DVL@EASA are two different things?
             | 
             | Maybe for backward compatibility/ human factors reasons,
             | the code needs to be displayed without the namespace to
             | pilots and air traffic controllers, but it should be a
             | field in the data formats.
        
         | shagie wrote:
         | CGP Grey The Maddening Mess of Airport Codes!
         | https://youtu.be/jfOUVYQnuhw
         | 
         | I'd rather deal with designing tables to properly represent
         | names.
        
         | nightowl_games wrote:
         | I don't know that and I don't like this assumption that only
         | 'managers' make mistakes, or that software engineers are always
         | right. I thinks needlessly adversarial, biased and largely
         | incorrect.
        
           | zer8k wrote:
           | Spoken like a manager.
           | 
           | Look, when you're barking orders at the guys in the trenches
           | who, understandably in fear for their jobs, do the _stupid_
           | "business-smart" thing, then it is entirely the fault of
           | management.
           | 
           | I can't tell you how many times just in the last year I've
           | been blamed-by-proxy for doing something that was decreed
           | upon me by some moron in a corner office. Everything is an
           | emergency, everything needs to be done yesterday, everything
           | is changing all the time because King Shit and his merry band
           | of boot-licking middle managers decide it should be.
           | 
           | Software engineers, especially ones with significant
           | experience, are almost surely more right than middle
           | managers. "Shouldn't we consider this case?" is almost always
           | met with some parable about "overengineering" and followed up
           | by a healthy dose of "that's not AGILE". I have grown so
           | tired of this and thanks to the massive crater in job
           | mobility most of us just do as we are told.
           | 
           | It's the power imbalance. In this light, all blame should
           | fall on the manager unless it can be explicitly shown to be
           | developer problems. The addage "those who can, do, and those
           | who can't, teach" applies equally to management.
           | 
           | When it's my f _@#_ $U neck on the line and the only option
           | to keep my job is do the stupid thing you can bet I'll do the
           | stupid thing. Thank god there's no malpractice law in
           | software.
           | 
           | Poor you - only one of our jobs is getting shipped overseas.
        
             | kortilla wrote:
             | Your attitude is super antagonistic and your relationship
             | with management is not representative of the industry. I
             | recommend you consider a different job or if this pattern
             | repeats at every job that you reflect on how you interact
             | with managers to improve.
        
           | elteto wrote:
           | Agreed. And most of the people with these attitudes have
           | never written actual safety critical code where everything is
           | written to a very detailed spec. Most likely the designers of
           | the system thought of this edge case and required adding a
           | runtime check and fatal assertion if it was ever encountered.
        
       | _pete_ wrote:
       | The DVL really is in the details.
        
         | spatley wrote:
         | Har! should have seen that one coming :)
        
       | jrochkind1 wrote:
       | I don't know how long that failure mode has been in place or if
       | this is relevant, but it makes me think of analogous times I've
       | encountered similar:
       | 
       | When automated systems are first put in place, for something high
       | risk, "just shut down if you see something that may be an error"
       | is a totally reasonable plan. After all, literally yesterday they
       | were all functioning without the automated system, if it doesn't
       | seem to be working right better switch back to the manual process
       | we were all using yesterday, instead of risk a catastrophe.
       | 
       | In that situation, switching back to yesterday's workflow is
       | something that won't interrupt much.
       | 
       | A couple decades -- or honestly even just a couple years --
       | later, that same fault system, left in place without much
       | consideration because it rarely is triggered -- is itself
       | catastrophic, switching back to a rarely used and much more
       | inefficient manual process is extremely disruptive, and even
       | itself raises the risk of catastrophic mistakes.
       | 
       | The general engineering challenge, is how we deal with little-
       | used little-seen functionality (definitely thinking of fault-
       | handling, but there may be other cases) that is totally
       | reasonable when put in place, but has not aged well, and nobody
       | has noticed or realized it, and even if they did it might be hard
       | to convince anyone it's a priority to improve, and the longer you
       | wait the more expensive.
        
         | telgareith wrote:
         | Dig into the OpenZFS 2.2.0 data loss bug story. There was at
         | least one ticket (in FreeBSD) where it cropped up almost a year
         | prior and got labeled "look into layer," but it got closed.
         | 
         | I'm aware closing tickets of "future investigation" tasks when
         | it seems to not be an issue any longer is common. But, it
         | shouldnt be.
        
           | Arainach wrote:
           | >it shouldnt be
           | 
           | Software can (maybe) be perfect, or it can be relevant to a
           | large user base. It cannot be both.
           | 
           | With an enormous budget and a strictly controlled scope
           | (spacecraft) it may be possible to achieve defect-free
           | software.
           | 
           | In most cases it is not. There are always finite resources,
           | and almost always more ideas than it takes time to implement.
           | 
           | If you are trying to make money, is it worth chasing down
           | issues that affect a miniscule fraction of users that take
           | eng time which could be spent on architectural improvements,
           | features, or bugs affecting more people?
           | 
           | If you are an open source or passion project, is it worth
           | your contributors' limited hours, and will trying to insist
           | people chase down everything drive your contributors away?
           | 
           | The reality in any sufficiently large project is that the bug
           | database will only grow over time. If you leave open every
           | old request and report at P3, users will grow just as
           | disillusioned as if you were honest and closed them as "won't
           | fix". Having thousands of open issues that will never be
           | worked on pollutes the database and makes it harder to keep
           | track of the issues which DO matter.
        
             | Shorel wrote:
             | I'm in total disagreement with your last paragraph.
             | 
             | In fact, I can't see how it follows from the rest.
             | 
             | Software can have defects, true. There are finite
             | resources, true. So keep the tickets open. Eventually
             | someone will fix them.
             | 
             | Closing something for spurious psychological reasons seems
             | detrimental to actual engineering and it doesn't actually
             | avoid any real problem.
             | 
             | Let me repeat that: ignoring a problem doesn't make it
             | disappear.
             | 
             | Keep the tickets open.
             | 
             | Anything else is supporting a lie.
        
               | Arainach wrote:
               | It's not "spurious psychological reasons". It is being
               | honest that issues will never, ever meet the bar to be
               | fixed. Pretending otherwise by leaving them open and
               | ranking them in the backlog is a waste of time and
               | attention.
        
               | exe34 wrote:
               | it's more fun/creative/CV-worthy to write new shiny
               | features than to fix old problems.
        
               | gbear605 wrote:
               | There have been a couple times in the past where I've run
               | into an issue marked as WONT FIX and then resolved it on
               | my end (because it was luckily an open source project).
               | If the ticket were still open, it would have been trivial
               | to put up a fix, but instead it was a lot more annoying
               | (and in one of the cases, I just didn't bother). Sure,
               | maybe the issue is _so_ low priority that it wouldn't
               | even be worth reviewing a fix, and this doesn't apply for
               | closed source projects, but otherwise you're just losing
               | out on other people doing free fixes for you.
        
             | mithametacs wrote:
             | Everything is finite including bugs. They aren't magic or
             | spooky.
             | 
             | If you are superstitious about bugs, it's time to triage.
             | Absolutely full turn disagreement with your directions
        
         | ronsor wrote:
         | > The general engineering challenge, is how we deal with
         | little-used little-seen functionality (definitely thinking of
         | fault-handling, but there may be other cases) that is totally
         | reasonable when put in place, but has not aged well, and nobody
         | has noticed or realized it, and even if they did it might be
         | hard to convince anyone it's a priority to improve, and the
         | longer you wait the more expensive.
         | 
         | The solution to this is to trigger all functionality
         | periodically and randomly to ensure it remains tested. If you
         | don't test your backups, you don't have any.
        
           | ericjmorey wrote:
           | Which company deployed a chaos monkey deamon on their
           | systems? Seemed to improve resiliency when I read about it.
        
             | theolivenbaum wrote:
             | Netflix did that many years ago, interesting idea even if a
             | bit disruptive in the beginning
             | https://netflix.github.io/chaosmonkey/
        
             | amelius wrote:
             | "Your flight has been delayed due to Chaos Monkey."
        
             | bitwize wrote:
             | The chaos monkey is there to remind you to always mount a
             | scratch monkey.
        
         | crtified wrote:
         | Also, as codebases and systems get more (not less) complex over
         | time, the potential for technical debt multiplies. There are
         | more processing and outcome vectors, more (and different)
         | branching paths. New logic maps. Every day/month/year/decade is
         | a new operating environment.
        
           | mithametacs wrote:
           | I don't think it is exponential. In fact, one of the things
           | that surprises me about software engineering is that it's
           | possible at all.
           | 
           | Bugs seem to scale log-linearly with code complexity. If it's
           | exponential you're doing it wrong.
        
       | sam0x17 wrote:
       | I've posted this here before, but they really need globally
       | unique codes for all the airports, waypoints, etc, it's crazy
       | there are collisions. People always balk at this for some reason
       | but look at the edge cases that can occur, it's crazy CRAZY
        
         | crote wrote:
         | Coming up with a globally unique waypoint system is trivial.
         | Convincing the aviation industry to spend many hundreds of
         | millions of dollars to change a core data type used in just
         | about every single aviation-related system, in order to avoid
         | triggering rare once-a-decade bugs? That's a _lot_ harder.
        
           | lostlogin wrote:
           | > That's a lot harder.
           | 
           | I wonder what 1,500 cancelled flights and 700,000 disrupted
           | passengers adds up to in cost? And that's just this one
           | incident.
        
             | amiga386 wrote:
             | ...an incident where they didn't parse the data as other
             | systems already parsed the data.
             | 
             | It sounds like the solution is better validation and test
             | suites for the existing scheme, not a new less-ambiguous
             | scheme
        
         | buildsjets wrote:
         | That's not CRAZY at all. CRAZY is at 14deg 4' 50.87" N. 145deg
         | 38' 16.22" E
         | 
         | https://opennav.com/waypoint/US/CRAZY
        
       | gadders wrote:
       | If you want to, you can read the final report from the UK Civil
       | Aviation Authority here:
       | https://www.caa.co.uk/publication/download/23340
       | 
       | It's pretty readable and quite interesting.
        
       | tempodox wrote:
       | When there's no global clearing house for those identifiers,
       | maybe namespaces would help?
       | 
       | Related: The editorialized HN title uses nanometers (nm) when
       | they possibly mean nautical miles (nmi). What would a flight
       | control system make of that?
        
         | bigfatkitten wrote:
         | The reason idents for radio navaids (VOR/NDB) are only three
         | characters is because they are broadcast via morse code. They
         | need to be copyable by pilots who are otherwise somewhat busy
         | and not particularly proficient in Morse. For this purpose,
         | they only need to be unique to that frequency within plausible
         | radio range.
         | 
         | 'nm' and 'NM' are the accepted abbreviations for nautical miles
         | in the aviation industry, whether official or not.
        
         | buildsjets wrote:
         | Every aircraft I've ever flown as either Pilot in Command or
         | required crewmember, and also every marine navigation system I
         | have used in my life has displayed distance information as nm,
         | Nm, or NM, interchangeably. I have never been confused by this,
         | and I have never seen any other crew be confused. I have not
         | ever seen any version of nmi used, in any variation of
         | capitalization. This includes Boeing flight decks, Airbus
         | flight decks, general aviation Garmin equipment, and a few MIL
         | aircraft. And some boats.
        
       | chefandy wrote:
       | As an aside, that site's cookie policy sucks. You can opt out of
       | some, but others, like "combine and link data from other
       | sources", "identify devices based on information transmitted
       | automatically", "link different devices" and others can't be
       | disabled. I feel bad for people that don't have the technical
       | sophistication to protect themselves against that kind of prying.
        
       | amiga386 wrote:
       | This is old news, but what's new news is that last week, the UK
       | Civil Aviation Authority openly published its _Independent Review
       | of NATS (En Route) Plc 's Flight Planning System Failure on 28
       | August 2023_ https://www.caa.co.uk/publication/download/23337
       | (PDF)
       | 
       | Let's look at point 2.28: "Several factors made the
       | identification and rectification of the failure more protracted
       | than it might otherwise have been. These include:
       | 
       | * The Level 2 engineer was rostered on-call and therefore was not
       | available on site at the time of the failure. Having exhausted
       | remote intervention options, it took 1.5 hours for the individual
       | to arrive on-site to perform the necessary full system re-start
       | which was not possible remotely.
       | 
       | * The engineer team followed escalation protocols which resulted
       | in the assistance of the Level 3 engineer not being sought for
       | more than 3 hours after the initial event.
       | 
       | * The Level 3 engineer was unfamiliar with the specific fault
       | message recorded in the FPRSA-R fault log and required the
       | assistance of Frequentis Comsoft to interpret it.
       | 
       | * The assistance of Frequentis Comsoft, which had a unique level
       | of knowledge of the AMS-UK and FPRSA-R interface, was not sought
       | for more than 4 hours after the initial event.
       | 
       | * The joint decision-making model used by NERL for incident
       | management meant there was no single post-holder with
       | accountability for overall management of the incident, such as a
       | senior Incident Manager.
       | 
       | * The status of the data within the AMS-UK during the period of
       | the incident was not clearly understood.
       | 
       | * There was a lack of clear documentation identifying system
       | connectivity.
       | 
       | * The password login details of the Level 2 engineer could not be
       | readily verified due to the architecture of the system."
       | 
       | WHAT DOES "PASSWORD LOGIN DETAILS ... COULD NOT BE READILY
       | VERIFIED" MEAN?
       | 
       | EDIT: Per _NATS Major Incident Investigation Final Report -
       | Flight Plan Reception Suite Automated (FPRSA-R) Sub-system
       | Incident 28th August 2023_
       | https://www.caa.co.uk/publication/download/23340 (PDF) ... "There
       | was a 26-minute delay between the AMS-UK system being ready for
       | use and FPRSA-R being enabled. This was in part caused by a
       | password login issue for the Level 2 Engineer. At this point, the
       | system was brought back up on one server, which did not contain
       | the password database. When the engineer entered the correct
       | password, it could not be verified by the server. "
        
         | mcfedr wrote:
         | But no mention of this insane failure mode? If the article is
         | to be believed
        
       | fyt2024 wrote:
       | Is nm the official abbreviation for nautical miles? I assume it
       | is natural miles. For me it is nanometers.
        
         | andkenneth wrote:
         | Contextually no one is using nanometers in aviation nav
         | applications. Many aviation systems are case insensitive or all
         | caps only so capitalisation is rarely an important distinction.
        
         | buildsjets wrote:
         | Officially, NM is the abbreviation for nautical miles when used
         | in the context of aircraft operations. It's not just a good
         | idea, it's the Law. Specifically, 14 CFR Part 1.2 of the United
         | States Code of Federal Regulations.
         | 
         | https://www.ecfr.gov/current/title-14/chapter-I/subchapter-A...
        
       | cbhl wrote:
       | Hmm, is this the same incident which happened last year? Or is
       | this a new incident?
       | 
       | From Sept 2023 (flightglobal.com):
       | 
       | - https://archive.is/uiDvy
       | 
       | - Comments: https://news.ycombinator.com/item?id=37430384
       | 
       | Also some more detailed analysis:
       | 
       | - https://jameshaydon.github.io/nats-fail/
       | 
       | - Comments: https://news.ycombinator.com/item?id=37461695
        
         | javawizard wrote:
         | First sentence of the article:
         | 
         | > Investigators probing the serious UK air traffic control
         | system failure in August _last year_ [...]
        
       | Joel_Mckay wrote:
       | In other news, goat carts are still getting 100 furlong-firkin-
       | fortnight on dandelions.
       | 
       | =3
        
       | convivialdingo wrote:
       | I guarantee that piece of code has a comment like
       | /* This should never happen */       if (waypoints.matchcount >
       | 2) {
        
       | GnarfGnarf wrote:
       | Funny airport call letters story: I once headed to Salt Lake
       | City, UT (SLC) for a conference. My luggage was processed by a
       | dyslexic baggage handler, who sent it to... SCL (Santiago,
       | Chile).
       | 
       | I was three days in my jeans at business meetings. My bag came
       | back through Lima, Peru and Houston. My bag was having more fun
       | than me.
        
       ___________________________________________________________________
       (page generated 2024-11-18 23:00 UTC)