[HN Gopher] NYSE Tuesday opening mayhem traced to a staffer who ...
___________________________________________________________________
NYSE Tuesday opening mayhem traced to a staffer who left a backup
system running
Author : helsinkiandrew
Score : 261 points
Date : 2023-01-26 06:47 UTC (16 hours ago)
(HTM) web link (www.bloomberg.com)
(TXT) w3m dump (www.bloomberg.com)
| jmount wrote:
| Galileo's Principle bites hard, great explanation here:
| https://99percentinvisible.org/episode/cautionary-tales/tran...
| nickdothutton wrote:
| If a single staffer SNAFU can send your exchange into chaos then
| you dun goofed at risk management and probably a whole lot of
| other management discipline.
| twawaaay wrote:
| Oh, yes. It is the staffer who made the mistake.
|
| What about people who designed it this way?
| sabujp wrote:
| yeah absolutely nothing to do the SPX hitting a target 4k level
| and messing up everything just as it hit that level
| h2odragon wrote:
| Tuesday news blackout; Thursday "it was all Jim's fault"...
| Right.
|
| Smells like horse shit. Most of what comes out of the profession
| of "journalism" does too, lately; but this smells _strongly_.
| yakubin wrote:
| Why couldn't have been Jim's fault? The world is run by Jims.
| h2odragon wrote:
| As others have said, if Jim's fuckup can have such large
| consequences, then there should have been backstops for Jim
| who would have shared the blame. Nothing against Jim, it's
| just he's being used as a distraction to avoid talking about
| the real issues.
| throwawaaarrgh wrote:
| Manager: "I hope nobody asks why Jim was allowed to have so
| much power with no oversight or validation"
| epistemer wrote:
| and all the trades are being reversed basically. The whole
| thing stinks.
| tgtweak wrote:
| Not all of them... only the most egregious.
| psychlops wrote:
| This and other exchanges need to be running 24/7, in large part
| to level the trading field for retail investors. The backend
| should be handled invisibly behind the scenes. There should
| further be an exchange version of a Netflix chaos monkey running
| constantly to ensure such a critical infrastructure is robust.
|
| The fact that these systems do not exist is an exchange problem,
| not a "staffer".
| bagacrap wrote:
| why do retail traders "need" to exist?
|
| Retail traders realistically have only luck to rely on to beat
| hedge funds and banks. What they do is akin to gambling, which
| is on net quite negative for those who participate in it and
| heavily regulated. Retail traders don't serve any purpose in
| our society. They don't help with efficient allocation of
| capital and anyone who might be an actual savant in trading can
| join or start a firm rather than staying independent and
| unlicensed.
| bmitc wrote:
| Seems so weird to not have automated checks for something that
| seems to be described as "someone left the light on" and also not
| have the exchange automatically initiate itself. However, it
| still isn't that clear what the problem was. Were prices not
| "real" or correct?
|
| Stuff like this will happen more and more. We treat software
| driven systems rather recklessly.
| mrkeen wrote:
| More automation -> more code -> more things to go wrong
| nix23 wrote:
| >More automation -> more code -> more things to go wrong
|
| More people -> more entropy -> much more things to go wrong
| wongarsu wrote:
| One person might be inattentive or drunk, but it's less
| likely that two people are. So you institue a two-person
| rule. And if that's not good enough, add a third person to
| double-check. Maybe a supervisor to observe the people
| doing all of the above, to catch any mistakes or negligent
| behavior. Also have them write down the steps they have
| taken, and have somebody else read through that to verify.
| Just keep adding people until you are satisfied with your
| odds (and hope you are not making it worse through second-
| order effects)
| nix23 wrote:
| So two people are more reliable as a automated system you
| wanna say? That's totally wrong....
| reaperducer wrote:
| Sounds like you've never experienced a catastrophic
| failure due to an automation that didn't work right.
|
| I did last year, and my company is in the process of de-
| automating certain processes that can endanger that
| company if they go wrong.
|
| There are many things in tech that are too important to
| automate.
|
| I'd even posit that the more experience you have in tech,
| the more you've seen how things go wrong, and the more
| you realize that automation is a tool for humans to use,
| not a replacement for humans doing a task.
| nix23 wrote:
| >Sounds like you've never experienced a catastrophic
| failure due to an automation that didn't work right.
|
| Much much much more due to human error......but hey maybe
| you are the worst programmer ever..but even then i would
| say your programs are more reliable then a human.
| themitigating wrote:
| This employee left the backup system running. There's
| obviously some automation but what is the solution?
|
| Process changes that people have to remember or more systems
| to prevent the issue. So I don't get your statement related
| to this article
| nix23 wrote:
| >There's obviously some automation but what is the
| solution?
|
| Shutdown and wake-up time in bios of server and switch ;)
| [deleted]
| mrkeen wrote:
| I don't imply that there is a solution.
|
| We will simply create a second system (B) to monitor the
| first system (A). Now we have two systems to maintain.
| System B _will not_ be capable of steering A by itself. So
| we still need to know how to diagnose and repair A, and we
| also need to know about B too. Maybe system B can talk to a
| Prometheus /Grafana stack (if it's up). And that can put
| alerts into Slack (which we ignore because there's _always_
| alerts in Slack). And after standup we can take turns
| looking at graphs with consternation.
|
| > Stuff like this will happen more and more. We treat
| software driven systems rather recklessly.
|
| That sentence is where I go when I hear the word
| 'automation'.
| bongobingo1 wrote:
| The great thing about automation is the breadth, depth and
| speed at which I can propagate mistakes.
| WJW wrote:
| Some submarine operations are deliberately not automated,
| because if eg a sensor is broken or miscalibrated it could
| sink the entire boat if a computer very rapidly acts on the
| wrong information. Rather they do those operations with one
| person operating the valve/machine/device/etc, another
| watching and confirming readings, and the whole thing is on
| a constant audio link with a third person in an engineering
| room who watches the readings through a centralized system.
|
| It's clearly not optimized for efficient use of personnel,
| but the personnel complement will have been designed to
| provide sufficient people at all times and the cost of
| getting it wrong can be very large indeed.
| JackFr wrote:
| Working on the repo desk of a large Japanese bank in New
| York in the 90's. There was a big (both in font size and
| magnitude) number that was on the upper left of the
| blotter system that ran on every traders desk, which
| represented the total we had to borrow that day to fund
| the banks trading book. There would be a number below it
| which would represent how much we had borrowed so far.
|
| It was "too important to automate" so a trading assistant
| keyed it in every morning. One morning he typed the wrong
| number and the mistake was in the billions digit.
|
| At 2:45 "the cage" called the repo desk and said "You
| know you guys are still short a billion, right?"
|
| There was then a flurry of activity as traders got on the
| phone to try to borrow a billion dollars in in fifteen
| minutes, while also trying to not let on we were kind of
| over a barrel. The head of fixed income prepared his
| explanation to the Fed about why we needed to borrow a
| few hundred million overnight.
|
| The number got automated in our next release, and the
| open procedure was changed to the trading assistant
| verifying the number against the "cage" report.
| papito wrote:
| I will argue that the corporate world is the exact
| opposite of the training and discipline of a submarine
| crew. Most of the time I wonder how the businesses even
| survive the chaos and mismanagement, let alone make
| money.
| MrYellowP wrote:
| > It's clearly not optimized for efficient use of
| personnel
|
| I disagree. When the sub sinks, all these people die.
| That's far more inefficient use of personnel.
|
| It feelf like you're putting more emphasis on the
| material cost ("the cost ... can be very large indeed")
| than on what actually matters.
| wongarsu wrote:
| It feels like a wartime-vs-peacetime priority problem.
|
| In wartime you would care about efficiency, build the
| largest number of subs staffed with the minimum crew. And
| since training crew quickly becomes the bottleneck you
| would probably go for the highest degree of automation
| that doesn't impact production times too much. In
| peacetime, efficiency isn't as important. What is
| important is the bad PR of losing one of your submarines
| in a training excercise or on patrol, so crew safety
| becomes a much bigger concern.
|
| Losing those sailors in war would have been a noble
| sacrifice for the cause, losing them to the exact same
| accident in peacetime is a national tragedy.
| UncleEntity wrote:
| Losing a ship (boat?) during peacetime is a PR disaster.
|
| Losing one during wartime can cost you the war.
|
| An inefficient one that probably won't sink due to combat
| damage can stay in the fight long enough to matter.
| WJW wrote:
| Submarines are referred to as boats rather than ships due
| to naval tradition.
|
| As a former naval officer I can only say that technically
| losing any vessel could be the one that loses a war, just
| as any soldier lost could be the straw that breaks the
| camel's back. But if your navy is so rickety that the
| loss of a single vessel is enough to lose then the main
| deficiency was in planning rather than any specific
| warship loss. Losing one in peacetime should never happen
| but is not unheard of even in modern times. See eg the
| Kursk or the Fitzgerald.
| UncleEntity wrote:
| It's also not unheard of for a single vessel to have
| outsized effects on an entire war
| https://amp.theguardian.com/world/2017/oct/20/enigma-
| code-u-...
|
| It could also be argued that the Romans capturing a
| single Carthaginian warship turned the tide for their
| entire empire.
| TeMPOraL wrote:
| This is now my headcanon explanation for why starships in
| Star Trek still require large crews (or crews in
| general).
| mschuster91 wrote:
| At least in ST:VOY, it has been shown that a single
| hologram is capable of running the ship - although one
| might argue "exceptional circumstances" ;)
| wongarsu wrote:
| In Star Trek III they jury-rig the Enterprise to fly with
| a crew of 5, instead of the regular crew of 400.
|
| Though in that state it can't do much more than fly:
| combat capabilities are strongly diminished, maintenance
| doesn't happen, post-combat repairs are out of the
| question, science missions would be much harder. On
| occasion the Enterprise has transported 150 passengers,
| so I imagine there's a lot of kitchen staff, security,
| etc. You only need 5 people to fly the ship, maybe 40 to
| fly sustainably with maintenance, but to actually
| accomplish their reglar mission you need the other 300
| people.
| krapp wrote:
| On the one hand, given how often sensors fail and AIs
| flip the evil bit in Star Trek, limiting automation is
| probably a good idea.
|
| On the other hand, the Enterprise won't even warn anyone
| when command staff are injured, cloned, mind-controlled
| or vanish from the ship altogether unless a human asks
| the computer where a specific person is first.
|
| Of course there are Doylist reasons for all of this but I
| do like the premise of a general fear of AI and possible
| weird space BS being a factor.
| dragonwriter wrote:
| > This is now my headcanon explanation for why starships
| in Star Trek still require large crews (or crews in
| general
|
| The canon explanation is that automation-in-charge was
| experimented with and went really badly, though
| periodically they try something approaching it again.
|
| https://memory-
| alpha.fandom.com/wiki/The_Ultimate_Computer_(...
|
| (AI, human genetic engineering, and a number of other
| areas of technology are affected by variants of this
| issue in the Trek canon.)
| junon wrote:
| I've not seen it put so eloquently.
| dsr3 wrote:
| To err is human, to really foul things up requires a
| computer
|
| - William E. Vaughan
| https://quoteinvestigator.com/2010/12/07/foul-computer/
| erik_seaberg wrote:
| "A computer lets you make more mistakes faster than any
| other invention, with the possible exceptions of handguns
| and tequila."--Mitch Ratcliffe
| ynniv wrote:
| Knight Capital will forever haunt fintech engineers...
| https://www.henricodolfing.com/2019/06/project-failure-
| case-...
| baxtr wrote:
| [flagged]
| qrybam wrote:
| The auction is meant to find a stable price and can have some
| wild prices coming in, because no matching happens at the point
| of entry, the market will naturally find a level before trading
| commences. In this case, those wild prices were matching,
| resulting in crazy trades no-one would have expected.
|
| You'd be surprised how many manual processes there are in
| places like this. It's a combination of legacy systems /
| processes, and a general paranoia around automation going
| wrong. I wouldn't be surprised if they always have someone
| there to shepherd the system along.
| benjaminwootton wrote:
| When I worked on a trading platform, I spent many a happy
| Sunday night waiting for the Australian market to open and
| watch the first orders go through successfully.
|
| We had hundreds of jobs and upgrades happening over each
| weekend. It definetly needed an eye casting over it
| regardless of the automation.
| xwolfi wrote:
| I am working at one now and nothing has changed.
| Automations are so many we recruit an army of people just
| checking if they ran, while knowing how to replace them if
| they didnt (or, more accurately, who to call at night to
| fix it asap).
|
| And the AU orders going through is a good sign, but it's
| far from guaranteeing a free monday, as Japan, Korea or
| Shanghai can fuck it up, each in their own little ways.
| Hong Kong is the best, low regulatory crap, invested
| regulator, high volume low latency traffic everyday
| (relative to the region), I cant recall a time it broke.
|
| Once, someone fat fingered an excel import at close, and we
| lost our trading license for that entire country for 18
| months. And we're not small. But the amount mismatched at
| settlement was super tiny. High attack surface, low
| holistic understanding (it works despite us, we honestly
| have no clue sometimes), heavy consequences on screwup.
| helsinkiandrew wrote:
| Matt Levine at Bloomberg gives a good explanation:
| https://www.bloomberg.com/opinion/articles/2023-01-25/nyse-f...
|
| Basically at the market open all the requests to buy and sell
| get matched at the same "open" auction price, then (a second
| later) the orders get sent to the order book, where the price
| can go up and down based on size. Because the system didn't
| think there was an opening, there wasn't the opening auction,
| the prices went straight to the book and there were large
| swings in price.
| usefulcat wrote:
| https://archive.is/LovhH
| tgtweak wrote:
| I don't think it's that they "left it running" like you would
| leave a backup app running... they literally left the entire
| disaster-recovery site up and running and live. Cermak
| (referred to in the article as the "backup") is an entire
| datacenter, hosting a running copy of the exchange to be used
| in a failover scenario.
|
| You'd have to have more than 1 person involved to forget that
| DR is still active when completing these failover exercises and
| tests off-hours.
| Kon-Peki wrote:
| > Cermak (referred to in the article as the "backup") is an
| entire datacenter
|
| Well, they are a tenent at the Cermak data center. It's a
| truly massive building with huge amounts of connectivity and
| colo opportunities. Probably also the only 100+ year old data
| center building on the US register of historic places, lol
| (it's a former catalog printing facility, built to hold
| insanely heavy printing presses on 8 or 9 really tall floors,
| so it has no problems with densely-packed server racks)
| tgtweak wrote:
| Yeah my point is only that it's a full DR site, not a
| "backup" that was left running as the article pointed out
| (and as a lot of commenters are insinuating).
| chiefalchemist wrote:
| While we're on the subject, if your company uses technology in an
| capacity, read the book The Phoenix Project.
|
| https://www.amazon.com/Phoenix-Project-DevOps-Helping-Busine...
| dogleash wrote:
| I wanted to like that book but it's a big Just-so story.
|
| tl;dr: The brave knight implemented devops and everyone lived
| happily ever after!
| h3daz wrote:
| I am short a couple dozen naked Feb 3 calls on a lot of affected
| tickers and almost had a heart attack when looking at one of them
| on my phone. Thankfully I was not in front of my computer at the
| time because I have no idea how my broker was managing my margin
| at open.
| qeternity wrote:
| The LULD breakers saved everyone from more chaos here because
| effectively as soon as the market opened, all these symbols
| were halted.
| mellosouls wrote:
| These "issue traced to staffer" stories sound like management
| cover up for management/system shortcomings to me.
|
| Systems with such significant potential impact, and in industries
| where lack of financial investment in their continuity is a
| deliberate choice have very little excuse to be passing the buck
| to grunts for basic process flaws that can be triggered by
| individual error.
| NoboruWataya wrote:
| They're not mutually exclusive. A staffer leaving a backup
| system running may well have been the proximate cause of the
| issue but, if true, it was also likely a management/system
| issue as you say. The article is a bit strange in that it
| doesn't attribute the fact in the headline to any source. I
| don't see anything from the NYSE saying "it's all that guy's
| fault". On the contrary it says:
|
| > [NYSE execs] plan to examine the platform's procedures and
| management, potentially reworking rules to be more flexible and
| provide further protections.
|
| Sounds like they know it's a management issue. The headline
| probably focuses on the staffer leaving the backup system
| running simply because it's a better headline.
| zmmmmm wrote:
| yes ... if an organisation has a critical process where a
| single human making a mistake can cost it $millions then it
| has a management / process level issue not an issue with the
| human. Humans make mistakes. Apart from other obvious issues
| with it, creating a context where individual mistakes lead to
| horrific outcomes will create a toxic and horrifically
| stressful workplace - I would actively avoid working in a
| situation like that myself.
| dsfyu404ed wrote:
| > These "issue traced to staffer" stories sound like management
| cover up for management/system shortcomings to me.
|
| At some point you need to strike a balance between
| freedom/flexibility and stupid proofing.
|
| HN goes real hard on the "people are idiots and we should
| design things that no matter what buttons get mashed it all
| works out fine" side of things but in the financial world the
| balance is struck a little further on the "train our employees
| to not be idiots" side of things.
|
| Furthermore, it's usually better optics to blame things on
| people because people can easily and cheaply alter their
| behavior cheaply (per incremental change). If you blame the
| outage on systems it raises questions of when it will be fixed
| and how much $$.
|
| As an aside, it was almost certainly not individual error. At
| places like NYSE you pretty much always have 2-3 people who
| should be in a position to catch a mistake like this.
| mcherm wrote:
| > it was almost certainly not individual error. At places
| like NYSE you pretty much always have 2-3 people who should
| be in a position to catch a mistake like this.
|
| That's exactly the point that is being made here. Either the
| message being put out by the NYSE claiming this was an error
| by one individual is true -- in which case, NYSE leadership
| is to blame for setting up a process that allows catastrophic
| consequences for a single individual's error, OR the message
| being put out by the NYSE is a fabrication designed to
| redirect blame at some scapegoat, in which case NYSE
| leadership is to blame for putting out a false or misleading
| statement.
|
| [Edit: It seems I misunderstood -- attributing this to an
| individual was done by reporters and rumors, not by a formal
| statement from NYSE.]
| reaperducer wrote:
| _These "issue traced to staffer" stories sound like management
| cover up for management/system shortcomings to me._
|
| If you're going to move the blame up the food chain, might as
| well blame the shareholders for giving the company money and
| choosing to keep the upper management in place.
| blippitybleep wrote:
| Sounds like many places I've worked. I think most devs have had
| a job like that.
| GuB-42 wrote:
| We can always blame management, since they make decisions,
| including hiring, we can always trace back problems to
| management. But it is as unhelpful as blaming the grunts.
| Management has the job of making the company profitable, if
| they don't employees won't get paid, investors will lose money,
| and ultimately the company will fail and customers won't get
| service. And just like the "grunts", they are not perfect,
| sometimes, they make mistakes, sometimes, they have to take
| chances.
|
| In fact, blaming anyone is unhelpful unless baltent misconduct
| is the problem, and I don't think it is the case here. As
| always, shared responsibilities. I just wished a different
| wording, something like "NYSE Tuesday opening mayhem traced to
| a backup system not properly shut down". Leave the "staffer"
| part to the technical report. It is useful information for
| investing the problem and fixing what needs to be fixed, but it
| is inconsiderate for a press release.
| Waterluvian wrote:
| The difference is that one asserts authority over the other.
| bigpeopleareold wrote:
| I had a project that was really important once. I made a tiny
| mistake that had a big consequence - a couple of hours of
| potential lost revenues from our customers. I fixed my mistake
| with both my boss and CEO nearby. I said after I pushed the fix
| that I really need more resources around it. That little light
| of "yeah, this is important" that should have flickered didn't.
| :)
|
| I will not be surprised if nothing gets fixed with the issue at
| NYSE.
| 1980phipsi wrote:
| If you were to ask what is the probability that this specific
| error happens again, I would think it would be pretty low.
| Probably lower than a week ago. If you were to ask what is
| the probability that some significant, costly error happens
| again, I don't think the probability is that much lower than
| a week ago.
| smcin wrote:
| But how much _actual_ lost customer revenue? Also, did the
| customer even notice or not?
|
| You're reminding me of the difference between engineers and
| non-technical managers; to many of the latter something's
| only a problem if/when the customer or senior mgmt are on the
| phone complaining about it. Until then it's all engineers
| being too pessimistic about process and risk.
| hinkley wrote:
| We have a Slack channel where we are expected to announce all
| of our production changes.
|
| Some updates are highly regimented, but a couple of the more
| operational teams have discretion to deploy things outside of
| that process, and most teams can flip feature toggles whenever
| they want.
|
| Point is that sometimes people will comment, or even veto
| changes. We have a major customer visiting today, or the sales
| team is at a conference. Don't touch anything or you might
| break something.
| wrldos wrote:
| This. If an individual's mistake can take out your business you
| have a process control problem and that is owned by management.
| JumpCrisscross wrote:
| > _individual 's mistake can take out your business_
|
| It didn't.
| cm2187 wrote:
| If you have processes where there is nothing an employee can
| do to affect the outcome of the company you successfully
| built a legacy bureaucracy that is waiting to be disrupted.
| hgsgm wrote:
| On the contrary, if a single employee can take out the
| whole business, you are guaranteeing disruption.
|
| There are many kinds of "outcomes". A simple backup would
| make outages far more rare.
| yourapostasy wrote:
| In this specific case, I don't think that's necessarily the
| outcome. Our industry has yet to accept a universally-
| acknowledged equivalent of a lockout/tagout (LOTO)
| interlock. There is no need for a bureaucracy if we have
| cryptographically-enforced multisig Shamir secret sharing
| keys where a LOTO prevents (in this case) a system from
| spinning up while another system (the backup system
| apparently in this case) is running. Allow it to be
| overridden by a sufficiently senior manager or say a
| sufficient number of lower-seniority managers, which leaves
| an audit trail. Integrate with a change management,
| notification, secrets storage infrastructures, and
| infrastructure as code, and it encodes these infrastructure
| dependencies into code, and can be queried to auto-
| construct change interlock sequences for a particular
| desired state.
|
| Of course, once you take advantage of such a representation
| at scale by deploying tremendously more complex
| infrastructures, you then have to deal with the dependency
| network meta challenge lest you inadvertently fall into
| dependency hell. While towards there lies NP-hard problems,
| they're still computable to a reasonable degree and I dare
| say a more robust situation than doing it all by hand like
| we do today.
|
| The real challenge is the vast majority of devops staff
| today would really dislike reasoning about such a
| representation when it blows up in their faces, and I can't
| blame them for that kind of reaction.
| quantgenius wrote:
| It's very easy to talk about completely automated systems
| and LOTO and you need these when you have under-skilled
| staff. The NYSE likely does NOT have under trained staff.
| If you have LOTO systems etc, what do you do when a
| sensor fails and you can't figure out why your method for
| checking whether the other system is running incorrectly
| thinks it is. Do you allow the stock market to simply not
| open?
|
| What if multiple sensors fail or it's an ambiguous
| situation like say you are deciding whether or not to
| fail over a power circuit and it's a brownout but not a
| complete power failure? What if there is a systemic
| problem and it's likely the backup power source is going
| to brown out too? At some point you need highly skilled
| individuals, like say trained airline pilots flying a
| plane who have the authority to override systems
| immediately without having to jump through hoops.
|
| This is especially true for mission critical systems.
| Many of the mission critical systems we rely on are NOT
| built on the cloud, i.e. other people's computers because
| you want to be really careful about what hardware you are
| using, precisely how your data center is setup and want
| to make sure things like a noisy neighbor do not impact
| you.
|
| Like it or not, these highly trained individuals are
| going to make mistakes every now and then. A failure like
| this once every decade or so really isn't so bad. The
| individual who made this error is likely not a "grunt". I
| suspect the individual in question will not necessarily
| suffer any major consequences as a result of this unless
| it wasn't a mistake but a flagrant disregard for the
| rules like say bringing a bottle of water into a data
| center that then spilled or something.
|
| Have you built a mission critical, distributed system
| that hasn't failed for 10 years? It's a lot harder than
| it looks. That's how often the NYSE has a problem like
| this, about once a decade. A lot of things that work in
| theory, don't work for the edge cases and things that
| lead to problems once a decade or so are extreme edge
| cases.
|
| In the grand scheme of things a mucked up opening auction
| is a minor problem and anyone who did not take the
| precaution of sending a limit order and sent a market on
| open order despite it being standard practice to
| essentially always use limits and go hurt badly will be
| made whole.
| yourapostasy wrote:
| _> If you have LOTO systems etc, what do you do when a
| sensor fails..._
|
| It pretty much boils down to: it depends upon what the
| business wants to prioritize; operating margin or
| resiliency. There is an entire subfield investigating the
| statistical foundations of resiliency, and the general
| case of N-modular redundancy is in practice implemented
| as triple modular redundancy in most commercial systems
| that want to spend in this vector.
|
| _> Like it or not, these highly trained individuals are
| going to make mistakes every now and then._
|
| Absolutely, and here is where the organization's no-blame
| learning culture swings into action for the well-led
| teams.
|
| _> It 's a lot harder than it looks._
|
| We all know this, and we can all help each other get
| better to deliver ever increasing value to our customers
| by sharing what works for the context we deployed within!
| quantgenius wrote:
| You don't get to major failures once a decade (or less)
| on systems this complex without understanding and in fact
| being on the cutting edge (likely ahead of what you read
| in journal articles written by academics) of the
| statistical foundations of resiliency, n-modular
| redundancy etc.
|
| In real-life outside of a journal article, it's a lot
| harder than just deciding whether you want to prioritize
| operating margin or resiliency at 5000 feet.
|
| In real life when these sorts of edge cases happen, you
| have to understand in minutes or sometimes seconds the
| tradeoffs in terms of costs to your own company and your
| customers of one of n specific possible failure modes and
| risk-manage so you minimize the probability of the
| catastrophic outcomes. This sometimes may involve
| increasing the probability of low cost bad outcomes. You
| can't reason about this stuff before hand. If you could,
| you would have designed your system to not fail in that
| manner.
| itsoktocry wrote:
| > _If you have processes where there is nothing an employee
| can do to affect the outcome of the company you
| successfully built a legacy bureaucracy that is waiting to
| be disrupted._
|
| Exactly.
|
| I wonder if any of the people claiming "it's management's
| process fault!" would be the first to complain about their
| workplace where they have no autonomy.
| iancmceachern wrote:
| Exactly, management shouldn't allow this kind of situation to
| occur by designing the xomoanies processes such that there
| are checks and balances
| gonzo41 wrote:
| Everyone makes mistakes. HOWEVER, Mordern leaders get to the
| heads of large organizations by never making mistakes, by
| blaming the little guy, or coworker and hustling up. They'll
| only fix this because they have too. There isn't a problem
| until it happens.
| thisarticle wrote:
| The days of management taking responsibility for anything are
| over. See: not a single CEO stepping down for over hiring.
| wonderwonder wrote:
| This is because the CEO's core job is to raise stock price.
| Nothing else. They hired in covid and profits & share price
| spiked due to the economic state at the time. Now the
| economic state has changed so they fire employees and the
| stock goes up. By that metric, the CEO will get a bonus at
| the end of the year. CEO does not get a bonus for not
| laying people off. Employees are not humans once you get to
| the csuite. An employee can be a person but multiple
| employees are just numbers on a ledger. They just send out
| "I'm sorry" emails to placate the masses and to get good
| media, no one really cares if the lower level people are
| upset. You only count once you get to a certain level.
| duckmysick wrote:
| There's plenty of companies replacing their CEOs. Just
| today Toyota announced theirs.
| Octoth0rpe wrote:
| The CEO of Toyoda is becoming the chairman of their
| board, that doesn't feel like a CEO being replaced as
| punishment for poor performance in the way that people
| are talking about in this thread. But even when CEOs are
| fully ousted over issues, the golden parachute makes it
| barely feel like a punishment anyway. I'm having trouble
| thinking of a case where a CEO actually seemed to be
| significantly financially impacted by such an event,
| though maybe FTX will provide an example shortly.
| hgsgm wrote:
| Are the golden parachutes bigger (as % of annual comp)
| than employee severance packages?
| Octoth0rpe wrote:
| Do you think that honestly matters in the 10s of millions
| of dollars range? I certainly don't. The problematic
| parachutes in question are beyond enough for an
| excessively wealthy standard of living for the rest of
| their natural lives, even if it's proportionally smaller.
| Whether or not CEO comp should be as high as it currently
| is is another question entirely.
| lotsofspots wrote:
| Oh, they take full responsibility, it always says so in the
| mails they send out. It's just that taking responsibility
| doesn't appear to actually result in anything happening.
| parsimo2010 wrote:
| A cyncical take might be that they are saying that they
| take responsibility (credit) for reducing the monthly
| payroll expenses. They may also have overhired in the
| past, but what's in the past was already paid for. The
| savings next month is how they justify a large paycheck.
| shapefrog wrote:
| Macroeconomic changes have made it impossible for me to
| want to pay you
|
| https://news.ycombinator.com/item?id=34515267
| j33zusjuice wrote:
| Their punishment is in bearing the shame of having been
| wrong. That's the price of leadership.
| randomdata wrote:
| What shame is there in being wrong? Being wrong is the
| ideal state, paving a path to gaining an education, which
| is a source of pride and a benefit.
| [deleted]
| noisy_boy wrote:
| I don't see why we reward scale-out/scale-in in the cloud
| but punish CEOs when they do the same with real people /s
| hgsgm wrote:
| How will those poor decommissioned computers get enough
| bytes to feed themselves?
| oxfordmale wrote:
| They are taking responsibility. They are just delegating
| the consequences to their staff. I suspect this will change
| soon. Activist investors are already surrounding companies
| like Salesforce and I can see CEOs being promoted sideways
| (board member only).
| itsoktocry wrote:
| > _See: not a single CEO stepping down for over hiring._
|
| Wait, what? You think a CEO should _step down_ because
| their management over-hired a relatively small proportion
| of employees and had to do some layoffs?
| hgsgm wrote:
| It's not relatively small. All the companies are
| experiencing similar chaos to the NYSE because people in
| the middle of important operational work suddenly
| vanished. The people laid off weren't idle like H&R Block
| tax preparers in May or Target clerks in January.
|
| The people laid off and the people not needed were a
| different set of people, at the time of the layoff.
| horsawlarway wrote:
| Not really sure why you're getting downvoted, other than
| to assume an emotional reaction from the community to
| layoffs impacting tech.
|
| Frankly - People seem to be forgetting that until 2013,
| MS was still doing stack ranking and routinely letting go
| of the bottom 10% of their workforce (and they were
| hardly the only ones doing it...)
|
| I don't see it as unusual _AT ALL_ that these companies
| are doing a wave of cuts to headcounts after the large
| hiring sprees during covid. Especially as interest rates
| rise, so they 're looking to lower debt burdens in the
| short term and pay off loans made at low interest rates
| instead of rolling into a higher interest loan in the new
| environment.
|
| If anything... I'd expect the exact opposite - a CEO that
| fails to address cost centers as debt becomes more
| expensive is a liability, and someone the board might be
| looking to replace (ask to step down).
|
| ---
|
| Does that mean I'm not sympathetic to those who've lost
| jobs? Of course not.
|
| But tech had to rev the engine pretty hard to handle the
| extra load during covid when everyone was indoors and
| doing things online, and now that demand has dropped. So
| they're letting off the gas pedal.
|
| If folks don't like it - blame the game. Work to
| unionize. Work to incentivize co-ops and shared
| ownership. Work to increase taxation on these companies
| and their highest earners (which... if you're in the tech
| industry almost certainly includes _YOU_ ). Don't go work
| for giant tech conglomerates and then act surprised when
| they act like giant tech conglomerates...
| DrBazza wrote:
| I'll make an extreme comparison:
|
| "Kill one man, and you are a murderer. Kill millions of
| men, and you are a conqueror"
|
| If you make some idiotic financial decision near the
| bottom of the management tree, such as... over hiring,
| you'll likely lose your job or get demoted.
|
| Do it as a CEO, and get a huge bonus.
|
| [1] https://en.wikipedia.org/wiki/Jean_Rostand
| zaroth wrote:
| But it's absurd. Companies are not supposed to _only ever
| hire_.
|
| Some things are cyclical and you need more people for
| some amount of time, and then you find you need less.
| It's not always predictable/seasonal like farming or
| holiday rush.
|
| Is it wrong for a company to respond to market effects?
| That there was a layoff isn't necessarily a sign a
| company did anything wrong... I think how they actually
| do the layoff certainly can be done well or poorly.
| DrBazza wrote:
| It's not hiring though. It's overhiring.
|
| I've forgotten which FAANG it is. But one of them still
| has more employees than last year even after layoffs.
| It's offensive.
| usefulcat wrote:
| So if they "under hire", should they step down for that
| too?
|
| Maybe they should step down any time they fail to
| accurately predict the future?
| SpeedilyDamage wrote:
| Offensive? I'm... honestly, baffled. How could one tech
| company's ability to hire many more people actually
| offend you?
| horsawlarway wrote:
| It's a response to extreme demand during covid. When -
| you know - online service usage was at all time highs
| because everyone was stuck inside and doing things
| online.
|
| It was likely the right call to hire then, just like it
| might be the right call to reduce headcount now.
| icedchai wrote:
| Why is it offensive? Over-hiring has been a thing since
| at least the first dot-com boom. One's managerial power
| is directly proportional to how many "reports" they have
| under them. I worked at one company that raised a decent
| A round. We immediately rented another office down the
| street, spent close to 2 million on renovations, then
| filled it with anyone who could spell HTML. The B round
| was even larger, so the cycle continued (until late 2001
| or so.)
| logifail wrote:
| > The days of management taking responsibility for anything
| are over. See: not a single CEO stepping down for over
| hiring
|
| The list of managers stating that "they were taking
| responsibility" _and then immediately stepping down_ was
| always fairly short.
| cc81 wrote:
| They only take responsibility for the profit margins. Over
| hiring affects those but often not significant enough and
| can be corrected with layoffs.
| hgsgm wrote:
| No, they only take responsibility for short-term market
| cap. Margins and profit don't matter. That's why they
| chase whatever fad hits the investor class.
| PragmaticPulp wrote:
| > These "issue traced to staffer" stories sound like management
| cover up for management/system shortcomings to me.
|
| At the end of the day, the engineers are responsible for the
| engineering. Managers are responsible for managing. Shifting
| all responsibility for execution issues on to management can
| give warm fuzzies, but in reality managers aren't all powerful
| in shaping execution by engineers.
|
| Companies that put all blame on managers when things fail are
| inevitably encumbered with excessive micromanagement, as the
| managers are effectively saddled with responsibility for
| execution as well.
|
| The article was purely anonymous. I don't think it's fair to
| assume they're jumping to blame or fire individual engineers.
| willcipriano wrote:
| Do engineers have authority over engineering? Can they
| overrule management on engineering issues? Whoever takes the
| authority gets the blame.
| NovemberWhiskey wrote:
| Look it's totally OK to recognize that a human action was the
| trigger for an incident - i.e. the causal chain for this
| specific incident started there. That's not the same thing as
| saying the human action was the root cause, and I hope by-and-
| large any kind of baseline competent engineering organization
| has gotten to that level of thinking by now.
| baby wrote:
| It's also true that in systems like this there exist many
| single points of failure. There's a reason decentralized
| systems are seeing a rebirth.
| credit_guy wrote:
| It does not sound like cover up to me.
|
| It was simply the explanation of what happened. I didn't get
| any hint that the said "staffer" will be fired or otherwise
| punished.
|
| Is there a problem with the system that did not have enough
| safeguards to let this happen. For sure, but then no system is
| perfect. This glitch does not happen every day. From memory, I
| remember a NASDAQ glitch at Facebook's IPO. Let's say there are
| 2 or 3 glitches like that for major exchanges in one decade.
| How can you design a system that prevents bugs that show up
| once a decade?
| corobo wrote:
| Oh we're still doing scapegoats?
|
| If your system can be hosed by a single person the system is at
| fault. Start with the scapegoat's manager.
| anonu wrote:
| Tech can be so fragile. You do everything right, trade millions
| of shares everyday and handle billions of dollars. But you forget
| to run one script to shutdown a backup system and everything
| comes crashing down: your reputation in tatters, millions in
| costs to settle bad trades, barbarians at the gates.
| tgtweak wrote:
| https://www.nyse.com/publicdocs/support/DisasterRecoveryFAQs...
|
| > Question: Can I connect to both the production and the DR site
| at the same time?
|
| Answer: No, only one site is available at a time. When the
| primary site is up, the DR site is down; and when the DR site is
| activated, the primary site is down.
|
| I think they need to update these docs to say /should/ be down
| irthomasthomas wrote:
| Is it normal to simply accept the word of an anonymous source for
| something so important? I genuinely don't know, anymore, but it
| doesn't seem like a good idea. I'd rather wait for a more
| thorough investigation. Especially when the story from these
| sources boils down to "Kevin was in charge of booting the NYSE
| App that morning, but he was late for work. He had a good excuse,
| though, he flaked! We'll have the chap straight up for lunch, no
| question".
|
| Edit: I also note that this piece is lacking the traditional "The
| NYSE did not respond to a request for comment".
| reaperducer wrote:
| I don't know how Bloomberg works, but the New York Times has a
| very clear and public policy about using anonymous sources.
|
| There's usually a link to it in the middle or end of any story
| it publishes using an anonymous source.
|
| The Times isn't Bloomberg, but it might give you some insight
| into how these things work.
| itsoktocry wrote:
| > _Is it normal to simply accept the word of an anonymous
| source for something so important?_
|
| Anonymous means they aren't revealing the source, not that
| Bloomberg doesn't know who the sources is, or what they do.
| irthomasthomas wrote:
| I know that. I am referring to you and I, the _reader_
| accepting the word of the anonymous source. Combined with the
| fact that they apparently did not ask for a comment from the
| NYSE before publishing this. Or if they did, they neglected
| to mention it.
| maronato wrote:
| We aren't accepting the word of the anonymous source. We're
| accepting Bloomberg's word that the source is reliable.
| adolph wrote:
| Bloomberg's word:
| https://news.ycombinator.com/item?id=19526348
| bink wrote:
| Some of us aren't "accepting" anything. We're just
| reading about a potential cause of an incident and
| speculating about how it could happen to us or could have
| been prevented. Just because we're reading this article
| and commenting here doesn't mean we just believe
| everything that we read. The post-mortem will come out
| soon enough and we'll read that and comment again.
| lr1970 wrote:
| From the article:
|
| > Meanwhile, market professionals and day traders are rattled and
| waiting for the exchange to elaborate on what it publicly called
| a "manual error" involving its "disaster recovery configuration".
|
| Oh, I love it -- a disaster caused by "disaster recovery
| configuration" :-)
| gjvc wrote:
| _Oh, I love it -- a disaster caused by "disaster recovery
| configuration" :-)_
|
| People install failover configurations to minimise time-to-
| repair or time-to-resume service (and some customers' contracts
| will demand this). This is at the expense of another layer of
| stuff to go wrong, and raising the possibility that it fails
| over when it shouldn't, causing brief but embarrassing outages.
|
| It's possible in some such situations that, on the balance of
| probabilities, introducing mechanisms like this cause more
| disruption _over time_ than they were intended to protect
| against, and that this is more widespread than often
| considered. Still, their operational cost must be borne in
| order to satisfy the clause in the customers ' contracts.
| jrochkind1 wrote:
| > That misled the exchange's computers to treat the 9:30 a.m.
| opening bell as a continuation of trading, and so they skipped
| the day's opening auctions that neatly set initial prices.
|
| I didn't even know about this process. I don't know much about
| trading, but it surprises me that there is a separate process for
| setting prices at the start of trading, and that if it's missed,
| chaotic prices result.
|
| Is this related to how stock markets aren't really ever open 24
| hours? Do they need that reset to function in stable way?
| khold_stare wrote:
| Worked in HFT for a few years. The reason why most markets are
| not open 24 hours is more human, and just historical - aligned
| with people's 9-5 workday. There are also pre open and post
| close sessions of trading but it's much less liquid. Futures
| markets are open almost 24 hours. Even there, it's down for
| some time daily. Personally I think it's actually inertia that
| keeps existing markets this way - the systems of the exchanges
| and participants were designed with the assumption that they
| will have daily downtime, so it's hard to change. It's also
| dependant on how banking and settlement works - a lot of stuff
| happens after the trading ends. Batch processes run as
| different institutions settle their trades between each other,
| etc etc.
|
| Now, as a result, there needs to be a way to set the opening
| price and closing price, like a bootstrap process. A smaller
| version of this process actually happens every time a stock
| gets halted and resumed.
|
| An exchange has an order book - orders of things people want to
| buy and sell at different prices. During normal operation the
| buy and sell orders don't overlap in the order book - if two
| people want to buy and sell at the same overlapping price, they
| just get matched by the exchange at that moment. Unmatched
| orders stay in the order book data structure until a matching
| order comes along. The "price" you see in charts is just the
| midpoint between the highest buy and lowest sell price in the
| order book.
|
| Now, if the order book is empty, what the heck is the price?
| That's what the opening auction needs to solve. The way it
| works is that people can start placing orders ahead of the
| opening bell, but they won't get matched until the open. So
| before the open, the order book is getting filled with orders,
| but crucially the _orders will overlap_. This "crossed" order
| book is a no no during normal trading, but ok before the
| opening auction. When the auction comes, a price is picked
| which maximizes the amount of orders filled (it's more nuanced
| than that, but bear with me). Imagine you pick a price in the
| overlapping region of the order book - every buy order that has
| a higher price than that will match with every sell orders that
| has a price lower than that. They will get matched and executed
| at the opening price, and BAM, you have an uncrossed order
| book, full of orders.
|
| If the auction doesn't happen, and you just open the stock,
| then all hell breaks loose. Many things can go wrong here.
| Firms connected to the exchange may have code that assumes a
| book is not crossed (or at least not as crossed as it would be
| during an auction) causing wild behavior. The exchange itself
| could start matching orders haphazardly in the overlapping
| region, causing those "price swings" that the article talked
| about.
|
| Can't imagine the panic that day haha.
| jrochkind1 wrote:
| Very helpful and clear, thank you.
|
| > Now, as a result, there needs to be a way to set the
| opening price and closing price, like a bootstrap process. A
| smaller version of this process actually happens every time a
| stock gets halted and resumed.
|
| So this suggests that if you _did_ have a hypothetical
| exchange that ran 24 /7... and something unusual happened to
| make trading halt completely (which always is going to happen
| occasionally, whether 9/11 level or more frequently)... you
| would still need to have that "bootstrap" process in place to
| re-start trading.
|
| But if you normally ran 24/7, you'd have a process that you
| maybe had never used, or hadn't used in years!
|
| This maybe provides another justification that isn't just
| historical for having exchanges shut down every day. So you
| are at least testing the bootstrap process daily, you don't
| have a bootstrap process you're going to need in an emergency
| (the worst time to have further problems) that has actually
| just been sitting around unused for years!
|
| (Reminding me of making sure you test your backup and
| continuity processes regularly, right? And the irony here is
| that it's the backup/continuity processes which are alleged
| to have caused the issue here! but still, you need the
| backup/continuity processes...)
| johnbcoughlin wrote:
| Matt Levine suggested that the chaos after opening was mainly
| due to market orders executing at ridiculous prices. Like, a
| limit buy for half the "real price" is the first buy order to
| get in the door, and that gets matched with a market sell
| order.
|
| Does that track with your understanding?
| khold_stare wrote:
| Yes! I almost forgot about market orders because trading
| firms never use market orders for this exact reason - you
| have no control over the price if things go bad. Most flash
| crashes are exacerbated by runaway market orders and stop
| orders for example.
|
| A buy market order would try to match with the "best price"
| which in a deeply crossed book would mean matching with a
| really low priced sell order. Exchanges match orders in
| price-time priority. Similar is true for a market sell
| order - would match at an extreme high price.
|
| Besides the midpoint of the order book, another metric for
| a "current price of the stock" people use, is the "last
| trade price". In the situation above you would get "swings"
| in the price because market orders would be trading very
| high and very low if they alternate between buying and
| selling. The data structure on the exchange itself isn't
| "swinging", it's just the overlapping region being slowly
| eroded by market orders. The "last trade price" metric
| looks really insane in this situation.
| toast0 wrote:
| FYI, there's a similar auction for closing, too. The closing
| price isn't just a race for the last trade under the buzzer;
| there's a process where at some number of minutes before close,
| you can put in orders for close or realtime, and then magic
| happens.
| xyzelement wrote:
| That's right. In short it's something like this: stocks trade
| _on their primary exchanges_ during specific hours. For example
| 9:30 to 4 in the US.
|
| Part of it is legacy from when trading was done by actual
| humans being at the exchange physically to trade during those
| times and part of it (I would guess is still the case) is to
| allow plenty of non-trading hours for back-office jobs and
| settlement.
|
| So yes there's a special start of day process that runs at 9:30
| that runs through all the orders on the books at that time and
| determines a price at which some optimal set of those orders
| can trade, trades them at that price, and also posts that price
| as the Open price for the day.
|
| The process is different during continuous trading since orders
| are one by one matched against the order book.
|
| Source: ran one of the world's largest equity platforms for 5
| years.
| ajoseps wrote:
| isn't there another component to the NYSE auction where the
| DMM has some input into what the closing/opening price
| actually is?
| khold_stare wrote:
| Yes. In my reply to the first comment I mentioned setting
| the opening price is "more complicated". Every exchange has
| their own system for the opening auction which you buy into
| when you list with a particular exchange. Most exchanges
| have an algorithmic way of calculating the price. For NYSE,
| it's again more historical. A Designated Market Maker (DMM)
| for a stock technically determines the opening price. There
| is a person physically on the NYSE trading floor who
| represents the DMM firm who technically opens the different
| stocks. They have a weird custom keyboard from NYSE for
| this purpose...
|
| The price is usually calculated algorithmically by the DMM
| firm and sent to the person at NYSE to approve. Pretty
| arcane. Also somewhat shady, as the DMM firm can be and is
| part of the auction themselves. DMM firms can analyze the
| order book to see what the imbalance is in the overlapping
| region, and place an order of their own to correct the
| imbalance and then set the opening price. I can see how one
| can profit from this in certain situations
| ajoseps wrote:
| I didn't realize the floor broker was actually involved
| with setting the opening price. I always wondered what
| the incentive was to access floor feeds for opening
| auctions
| papito wrote:
| I sweat over "idiot-proofing" the smallest systems, while multi-
| billion dollar operations don't seem to care enough.
|
| Like the S3 being blown away with a simple change in the early
| days, or GitHub running a test suite with production settings.
| It's like the FIRST thing I think about when starting a project.
|
| https://github.blog/2010-11-15-today-s-outage/
| afhammad wrote:
| It seems that this wasn't as routine as these things aught to be
| but rarely are.
| H8crilA wrote:
| Yes it could have also been a test of potential escalation from
| Solomon Islands.
| kube-system wrote:
| I spent an hour trying to figure out why my new stock purchase
| had disappeared from my account. I had an order placed for
| opening on Tuesday morning, and I guess I was affected by the
| trade cancellations. Which is totally weird, because they showed
| up in my account on Tuesday morning after opening.
| herpderperator wrote:
| If the trades are being cancelled, are they going to correct the
| chart data? Right now it looks very misleading on the daily[0],
| weekly, monthly, quarterly, yearly etc for large caps that trade
| quite steadily otherwise. I do understand that this would be a
| challenging effort as that data already flowed to and was stored
| by all the broker-dealers, but I think it should be done.
|
| [0] https://www.dropbox.com/s/6jdmgkdyei9xqz0/mcd.png?dl=0
| anonu wrote:
| Technically yes, historical market data feed needs to be
| cleaned up. Which will be a nightmare for every single person
| who maintains one...
|
| Which is also why exchanges are very reluctant to mass cancel
| trades. The knock on effect goes beyond just market data feeds
| evanpw wrote:
| The same feed that publishes trades also publishes trade busts,
| so it's up to whoever's consuming it downstream to take care
| of.
| ynniv wrote:
| It's easy to throw shade when Bloomberg writes an article that
| puts the blame on "a staffer". Having worked near some of these
| systems, the engineering and process are actually quite good. How
| many companies publish their private network topology, service
| p99.9 in microseconds, and detailed pricing on the open web?
| They're in a painfully competitive global market that's
| ambivalent to names on buildings.
|
| In a week or so there will be a comprehensive internal post
| mortem, and every engineer in the company will read it because
| that's why they work there. "The staffer" will not be named, nor
| will they be fired. The process will be changed. The systems will
| be changed. You probably haven't heard of Pillar, but the NYSE in
| your head was replaced by some pretty amazing, distributed, low
| latency systems. The culture is to over-engineer, over-provision,
| plan for black swans. And test. That it works. Test that it
| scales. Test that backups work. Test, test, test. _firmitatis,
| utilitatis, venustatis_. This failure was due to daily testing.
|
| Sometimes things still fail. That's true anywhere. In most places
| your failures don't make the papers, and accidents are swept
| under the rug. That doesn't happen at NYSE for obvious reasons.
| They're not building large language models (that I know of), or
| self driving cars (pretty sure on this one), but they're a
| modern, cutting edge, "soft" real-time engineering shop. If you
| haven't looked already, you might find something interesting
| there: https://www.ice.com/careers
| bob1029 wrote:
| I've thought about getting into this... The stuff they work on
| is so incredible to me.
|
| Here's a quote from their Pillar product page:
|
| > Up to a 95% Reduction in Latency: The roundtrip latency on
| NYSE Pillar order entry sessions via Pillar matching engines
| has been reduced from ~592ms to ~32ms for FIX and from ~96ms to
| ~26ms for Binary, getting client orders into the market much
| faster. With a 92% improvement in the 99th percentile latency
| results, clients can also have more confidence in improved
| performance consistency regardless of market conditions.
|
| Reading stuff like this makes my current work feel stupid by
| comparison.
| davidf18 wrote:
| [dead]
| nubb wrote:
| wonder how they measure this and is this smart engineering
| from the exchange or just new fast network gear.
| DontchaKnowit wrote:
| Just an educated guess (I'm in the same industry, have
| worked on some networking related stuff) But I think it is
| probably mostly network hardware and architecture. You can
| only improve so much from the code, the networking is where
| all the latency comes from.
| Razengan wrote:
| > _Reading stuff like this makes my current work feel stupid
| by comparison._
|
| It makes our economic system seem stupid. Jesus, we're not
| calculating astrophysics or quantum mechanics. A made-up
| system should not require or depend upon this kind of speed
| or precision. Maybe we should chill.
|
| Reminds me of those pro StarCraft players who keep
| unnecessarily clicking the mouse to keep their APM (actions
| per minute) stat high.
| DontchaKnowit wrote:
| Absolutely agree. but "liquid markets are important" or
| something. _rolls eyes_
|
| This is just another step in the endless journey of
| widening the gap between your average Joe and someone with
| access to high level financial services.
| blibble wrote:
| do they still do UDP packet loss replay over email?
| BirAdam wrote:
| I worked for NYSE's parent, ICE, and I have to agree with this.
| While there were many things I didn't like about working there,
| the tech and the management weren't involved in those things. A
| similar problem to this happened while I was working there, but
| it was on Endex and not NYSE. Management spoke with the
| responsible party, but he wasn't fired and no punitive actions
| were taken against him. The blame game also wasn't played. The
| team just decided to provide more eyes on the process, change
| the interface of the tools a bit, and move on. The company
| itself did face hefty fines for the screw up tho. Ultimately,
| the issues at ICE/NYSE are due to a highly bureaucratic
| structure and to onerous regulations forcing parts of that
| structure to exist. Given those two problems, I think ICE. does
| extremely well.
| jacquesm wrote:
| Just naming a 'staffer' though seems to already be a way to
| apportion blame to a segment of the employees, insulating
| management from what was done. Named or not doesn't really
| matter, clearly blame is being assigned.
| bostonsre wrote:
| Yea, it sounds like an issue with process and automation. It
| shouldn't have been possible for the staffer to make a
| mistake that would cause this.
| jacquesm wrote:
| Precisely. It is never just one error. At a minimum two and
| if you really stare at this sort of thing long enough it
| isn't rare at all to discover a whole chain of them. The
| only difference with all the times that it went right is
| that this time everything was aligned 'just so'.
| ynniv wrote:
| I think those are Bloomberg's words, or their paraphrasing of
| the grapevine. The high level people that I knew there
| weren't petty, and everyone was of the opinion that it didn't
| matter who clicked the button: we were all in the same boat.
| ngz00 wrote:
| I worked there and I can say that this is not accurate at all.
| It is very much a blame culture. I've seen people fired for
| less severe incidents. Beyond the core technology of the Pillar
| engine, the place is not comparable to a modern tech company in
| almost any way.
| throwaway122095 wrote:
| As somebody who worked with them as a client, I can confirm
| this. There is currently a spec-level bug with their core
| Pillar engine and it was essentially bounced between several
| different teams and ultimately ignored as nobody's problem.
| mynameisvlad wrote:
| So basically like any other medium to large company? This
| doesn't sound unique in the slightest.
| uoaei wrote:
| I would think that _the company being a securities
| exchange_ would factor into the analysis. Don 't you?
| mynameisvlad wrote:
| How does them being a securities exchange in any way
| affect the analysis of their software engineering
| practices? They're not some special snowflake, they can
| suffer the same software engineering and business process
| issues as other companies.
| sandworm101 wrote:
| >> They're not some special snowflake
|
| But they are. The consequence of a one-day or one-hour
| shutdown on their system is exponentially worse than most
| any other. I would expect them to have more rigorous
| systems, including more rigorous attention to
| development. Comparing the NYSE to any other business is
| like calling Fort Knox just like any other bank vault.
| LarryMullins wrote:
| > _like calling Fort Knox just like any other bank
| vault._
|
| Main difference being that most bank vaults aren't
| actually empty. ;)
| mynameisvlad wrote:
| No company or organization is immune to bad business
| practices.
|
| Them being a securities exchange does not somehow provide
| immunity from developing rigorous systems which have
| oversights, or make bureaucracy magically go away.
|
| Likewise, the impact of an outage being more extreme does
| not mean the people there are infallible. Things slip
| through. Especially random customer requests being
| bounced around from team to team, the thing in question.
| shanebellone wrote:
| I disagreed with you until: "...like calling Fort Knox
| just like any other bank vault."
|
| Interesting point that teeters on false equivalence. I
| think AWS or Azure might make for a better analogy. Your
| point identifies the inherent risk of actually operating
| a platform business. A bank vault is (mostly) synonymous
| with Cloud, in this context. If a vault is robbed or a
| cloud goes offline, losses extend beyond the business
| which inherently compounds the severity of downtime.
|
| Linear loss vs. parabolic loss.
| btown wrote:
| But if a cloud goes offline, there is damage to the
| economy linear to the length and breadth of the outage.
| Sure, there are losses to businesses serviced by the
| cloud's users, but they'll bounce back, even if a day-
| long outage was so severe as to temporarily ground
| flights and halt supply chains.
|
| If a stock exchange executes trades at incorrect prices,
| even for a short amount of time, all of a sudden you're
| in a kind of non-linear sigmoid regime, where investor
| confidence can suddenly tip into panic selling and
| recessions can be triggered. Thankfully, that didn't
| happen here, but it could have. If you're going to give a
| company that power, you should better hope that they're
| held to higher standards than most dysfunctional tech
| organizations!
| shanebellone wrote:
| "If a stock exchange executes trades at incorrect prices,
| even for a short amount of time, all of a sudden you're
| in a kind of non-linear sigmoid regime, where investor
| confidence can suddenly tip into panic selling and
| recessions can be triggered."
|
| This is false equivalence and slippery slope.
| blantonl wrote:
| No they aren't.
|
| There's far more critical snowflakes out there... FAA
| Airspace management, a medical radiation device, avionics
| in an aircraft, and facebook.
| ynniv wrote:
| Unlike all of the "modern tech company" problems which are
| never ignored and only solved when someone's problem goes
| viral on social media.
|
| They're a big company, some groups are better than others,
| some customers get more attention than others.
| galangalalgol wrote:
| Blame cultures and process cultures are both problems in
| different ways. Blame cultures don't care about individual
| accountability, only that someone suffers. Process cultures
| only care that no one suffers, not that individuals are
| accountable. Both have some misguided notion that something
| other than personal accountability can lead to good results.
| Misattributed blame and suffering does not deter poor
| performance or mistakes. Not even correctly aimed punishments
| are very good at that. Accountability isn't about punishment,
| it is about limiting power to the level of responsibility
| demonstrated. Rules and procedures don't prevent poor
| performance, they can in fact entrench and guard it, and they
| only mildly impact mistakes. Best practice can mitigate
| mistakes to the same extent or better (due to easier
| adaptability), but people keep trying to turn them into
| rules, and that has to be fought. If you followed all the
| rules but didn't get the job done, you still shouldn't be
| handed the same task again, but not out of blame.
| davidf18 wrote:
| [dead]
| ynniv wrote:
| Having been in the industry for a couple decades, and having
| worked at both, they're not all that different. Some groups
| are going to be better than others in the same company. Some
| companies are floating on venture money today, and might
| disappear tomorrow. Most technologies constantly cycle. Our
| experiences working at the same company were different.
| Johnny555 wrote:
| I used to work for a small startup, and postmortems were
| truly no blame - engineers would talk about exactly what
| happened and wouldn't hesitate to put the blame on their
| mistakes.
|
| But as the company grew, the postmortems became more about
| blame since now you're not blaming an engineer, but an entire
| team so singling them out isn't personal. The postmortems
| were no longer a single engineer describing what happened in
| his code, but were team leads talking on behalf of teams.
| They were all about shifting blame from your own team and
| talking about why a service from another team led to the
| problem, even if your team could have (and should have) been
| able to work around it without melting down.
|
| I'm no longer at the company, but Postmortems are much more
| useful when they really are no-blame because you can get to
| the real root of the problem, but I don't know if that's
| possible in a large company.
| SoftTalker wrote:
| As organizations become larger they become more political.
| It's unavoidable.
| hackernewds wrote:
| Curious why the link to ICE.COM?
| hadlock wrote:
| ICE owns NYSE and several other exchanges. ICE stands for
| Intercontinental Exchange. ICE is the IT administrator for
| these exchanges. I helped sell some router management
| software to them a while ago. ICE is fairly new, NYSE used to
| be independent. That changed sometime after 2008.
| anonred wrote:
| Serious question: If someone is smart and capable enough to
| work on tangible things like AI systems or self-driving cars,
| why should they choose the NYSE outside of pure monetary
| reasons or affinity for a "modern" tech stack?
| barneygale wrote:
| You answered your own question. The only motivation is greed.
| DontchaKnowit wrote:
| Taking a well paying job is greedy? What planet are you
| living on?
| misja111 wrote:
| You're not making it easy to get any answers, if you cut out
| 2 of the main reasons for people in general to change jobs.
| anonred wrote:
| I consider these things table stakes when choosing a job.
| ynniv wrote:
| I'm not recruiting for them, just sharing my experience. I
| included their careers link for people who might be
| interested because I know they're always looking for good
| engineers.
| godshatter wrote:
| Wouldn't working with systems that keep the largest stock
| exchange for the largest economy in the world running where a
| simple mistake can cause "mayhem" when the market opens be
| considered more "tangible" than working in AI or on self-
| driving cars? It just doesn't have as much street cred as
| working on those particular projects in the tech community.
| dylan604 wrote:
| Not necessarily. If you're the type that's into finance,
| then sure, that might get you out of bed in the morning.
| I'm not into finance and kind stand the culture that
| surround finance. Yes, it's big and touches every single
| one of us, but doesn't mean I want to embrace it and go to
| work in it every day.
|
| If I can take that same skill set and apply it to something
| with a much better culture surrounding it that affects
| people in a positive way, then I would definitely choose
| that over finance any day of the week and twice on Sunday.
|
| At the end of the day, if the NYSE did not exist, the world
| would continue to turn. It's just not that big of a deal to
| a heck of a lot of people.
| yibg wrote:
| Will the world stop turning if people stopped working on
| self driving cars or AI?
| dylan604 wrote:
| i'm guessing you're trying to make a point here, but care
| to elaborate on what it is? i think you well know the
| answer to the question
| idiotsecant wrote:
| >if the NYSE did not exist, the world would continue to
| turn
|
| This is startlingly ignorant of the complex machine that
| is the modern economic system. If something like the NYSE
| was to shut down today it would be pandemonium.
|
| There is a difference between 'I don't understand how
| something works' and 'I don't understand how something
| works, so it is worthless'. The former is healthy and the
| first step to understanding, the latter is ignorant, and
| the first step to getting more ignorant.
| barbishkoolaid wrote:
| Relax, Ayn Rand cum True Believer complex.
|
| The current state of business within the current
| iteration of how people interact with one another isn't
| some necessity.
|
| Yes, the world may fall apart for a relatively brief
| moment in the grand scheme of things -- but then life
| will go on.
|
| The first step to understanding this is to drop the
| superiority complex.
|
| Very little is actually needed to keep the world turnin.
| Razengan wrote:
| Exactly. Our current implementations of resource
| rationing isn't some fundamental of reality, or even
| needed by human society just 1-2 centuries ago.
| DontchaKnowit wrote:
| I actually think both you and the guy you are arguing
| with are half correct.
|
| The real answer here, in my opinion, is that yes there
| would be pandemonium, and then yes, the world would go on
| without it, but then something else just like it will pop
| up. And that is because a liquid market for financial
| assets (whether that is securities, options on
| securities, futures, etc) will always be a massive
| benefit to the ability of businesses to conduct business,
| and the ability of individuals to preserve and increase
| wealth.
| godshatter wrote:
| I was remarking on the key word "tangible", not trying to
| express an opinion one way or the other on financial
| institutions. Accidentally forgetting to do something and
| ending up in the news because you caused havoc when the
| markets opened the next morning is more "tangible" (able
| to touch things directly) than working on AI or self-
| driving cars, at least currently. Certainly working in
| either of those fields might provide more benefits down
| the line.
| kasey_junk wrote:
| All of my "culture" experience working in finance were
| uniformly better than pure tech.
|
| The movie portrayals don't match my experiences at all
| and I saw a lot more bad behavior in the tech companies I
| worked for.
|
| Heck I saw more people working for the intellectual
| challenge of it in trading than I did in SV style tech
| firms where money drove nearly every decision.
|
| It's really hard for me to buy that SV style tech
| companies are a better place to work when for the last 2
| decades the business models that have been front and
| center are panopticon style tracking to sell ads and
| legal arbitrage.
| dylan604 wrote:
| Oh, don't get me wrong. I pretty much abhor SV/VC culture
| too. It's why I don't have one inkling of a notion to
| work on either coast for the "big" corps.
|
| It's not an either or, I can hate both ;-) I'm a big boy
| and get to make up my own mind on the matter.
| JBlue42 wrote:
| >If someone is smart and capable enough to work on tangible
| things like AI systems or self-driving cars
|
| Maybe they aren't as smart as they think they are? Or they
| find that there are interesting problems to solve in fintech?
| Problems they can tackle and see resolved in a realistic time
| frame vs 'tangible' (?) self-driving cars or chat bots.
|
| I know AI encompasses a far larger range of things but right
| now, what problems is it solving? Artists, writers, and
| others can do that work. What do self-driving cars resolve
| beyond continuing the dominance of car culture in a world
| that could have better public transit and safer
| infrastructure?
| DontchaKnowit wrote:
| Serious Answer :
|
| There are problems in Fintech that are absolutely worth
| solving for altruistic reasons. One that I think is very
| important and might even need to incorporate AI is this :
|
| Larger financial institutions have access WAY more and WAY
| higher quality data surrounding stocks and options. For
| example, publicly available SEC filings contain extremely
| useful information about companies. Professional traders have
| access to services which provide this data accurately in
| programatic form (like an API). Us normal people have only
| the SEC filings themselves, which are enormous documents. It
| would be impossible to read them fast enough to ever catch up
| on all of them in the last say year. There are free APIs, but
| they are absolute dogshit and provide incomplete and
| inaccurate information.
|
| If someone could democratize this and provide this info for
| free or cheap to the public, it would be an enormous benefit
| to the general public.
| hinkley wrote:
| There's also geography. You go far enough East and you have
| mostly public sector or defense jobs. A little smattering of
| insurance data processing. And then fintech.
|
| illinois.edu has one of the top rated CS programs, but once
| you graduate there are not a lot of options but to move to
| one of the coasts, or move back/to Chicago and try your luck
| there. Second City has a good deal of fintech.
| helsinkiandrew wrote:
| https://archive.ph/UoMr9
| WiSaGaN wrote:
| A lot of stocks have minute bar of wide range prices in the first
| few minutes of continuous trading. This seems like the incident
| in what caused Knight Capital fiasco, in which the system
| repeatedly buy on the ask and sell on the bid very fast, thus
| pushing the high price high and low price low. In the opening
| usually the market maker will be more weary of the risk they
| cannot hedge directly and thus will be less willing to take on
| positions, leading to wild swings.
|
| Still this report (and the previous statement) does not give
| enough detail on why a backup system misoperation resulted this.
| Also, critical large systems like exchange rarely have a single
| point failure. Usually there will be a sequence of issues along
| the event chains leading to this. Thus one "failed to properly
| shutdown" caused all this is a bit incredible. We will need more
| explanation.
| itronitron wrote:
| Doesn't each trade require both a buyer and a seller, who buy
| and sell at an agreed on price? Presumably both parties would
| be satisfied with the trade so it isn't clear to me what all
| the fuss is about.
| toast0 wrote:
| If I had put in a market buy/sell on open order, I'm
| accepting the market price, but expecting the market price to
| be set by an opening auction. I don't know if a market on
| open order would have been cancelled or just executed shortly
| after the bell; you could argue for both treatments and it
| usually doesn't come up, so it might not be mentioned in
| retail brokerage documentation.
|
| Personally, I always do limit orders, but I would consider
| market on open/close as reasonable options. But I don't think
| this is typical, a lot of orders are market orders against
| whatever limit order is at the top of the book. Normally,
| that's ok, but it gets weird when things get weird, as seen
| here.
| pcl wrote:
| Market price trades are executed at whatever the current
| price is. Presumably it's those trades that caused havoc.
| gpderetta wrote:
| I haven't really researched anything, but both parties
| thought they were bid/offering into an auction with time to
| cancel or amend their orders.
| WiSaGaN wrote:
| I don't know what happened in current case. In the Knight
| Capital case, KC clearly didn't intend to send those
| erroneous orders. And if those trades were not annuled, KC
| would not be able to settle those, since the trade loss were
| larger than the collateral KC put up.
| evanpw wrote:
| The trades were not annulled, because NYSE ruled them not
| "clearly erroneous". Which is why it is was an existential
| mistake, not just an embarrassing one.
| anonu wrote:
| Knight capital issue was a test flag in the code that caused
| orders to multiply.
|
| From what I'm reading, this NYSE error seems a bit more complex
| where the presence of a backup system confused the current
| market state to skip the open auction.
| throwawaaarrgh wrote:
| There is so much stupidity in the process they describe, I have
| no faith it will be fixed. a manual daily DR test that clearly
| wasn't followed by a test or checklist or double checked by
| another person, _and_ leaving the DR up broke prod?? literally
| none of those things should have happened.
|
| I know the world is held together with duct tape, but it's
| embarrassing when you see the tape fall off.
| tgtweak wrote:
| The way their DR is setup is that clients of NYSE (brokerages,
| OTC systems, firms, banks) all have IP (not dns) connections to
| the primary NYSE production datacenter and a full second set of
| IPs for the DR site. It's not a "dns and load balancers" setup
| where the service itself can just route the traffic somewhere
| else. The clients themselves determine where to connect to
| consume trade data and execute trades. There is likely some
| modus operandi given to clients on how to connect to primary
| and DR sites based on some specific logic.
|
| The NYSE DR guide [1] says that if DR is active, production is
| not. It's not a distant reach to consider that some of these
| clients have a deadman switch doing a healthcheck poll on DR
| and switching to it when it see's that it is "up". If they've
| built their systems in such a way that when it detects the DR
| site active it uses that, then it makes sense that having both
| "online" would cause some havoc. I'm sure the complexity of the
| entire exchange is fairly significant, and having "two" copies
| of it running in parallel with both able to accept and execute
| trades would be a scenario that can cause some unintended
| consequences. Fundamentally, an exchange is "atomic" and
| transactional and cannot be meaningfully distributed to two
| sites that are that far away. The replication in place is
| likely master/slave with a switch to make the slave primary.
| Anyone who has toyed with master-master replication on less
| complicated databases knows the issues that can come up with
| split writes. Imagine that at the scale of a system as large as
| the NYSE.
|
| [1]
| https://www.nyse.com/publicdocs/support/DisasterRecoveryFAQs...
| hksoftware wrote:
| [flagged]
| [deleted]
| ideamotor wrote:
| You suspect a breach?
| avree wrote:
| He is, undoubtedly, a meme-stock conspiracy theorist. Only
| those steeped in the cult of AMC, GME, or BBBY say things
| like that.
| [deleted]
| shapefrog wrote:
| I am curious what your "independent research" has turned up on
| the subject.
| PaulHoule wrote:
| See https://www.henricodolfing.com/2019/06/project-failure-
| case-...
| hknmtt wrote:
| there is no way a person, a single person, an unauthorized
| person, can have access into such system/functionality like this.
| utter BS.
| jterrys wrote:
| my 2c:
|
| In reality what probably happened is previous market day and
| post-trading data encountered some kind of error, which
| triggered a cascade of problems overnight that they were unable
| to properly rectify. This caused delays up until market open.
| They were unable to fully resolve the issue, and forced with
| either delaying opening the market (which is a HUGE no-no) or
| opening with wrong data as is, they chose wrong data.
|
| All in all a lot of people didn't get much sleep Monday. More
| than likely they implemented some changes or updates over the
| weekend that were not properly done, or they encountered some
| errors, and didn't have adequate controls/time to roll-back
| Monday night. They made the right calls too late and there was
| a controls process up the chain that seriously fucked up. These
| are the kinds of problems that get the CEO woken up in the
| middle of the night.
| lvl102 wrote:
| Remember the flash crash of 2015? They let those trades actually
| STAND. Including options. This week's open was nothing in
| comparison.
| detaro wrote:
| That wasn't an exchange "malfunction" in the sense that the
| exchange did not do what it was supposed to, was it?
| lvl102 wrote:
| Do you really think they will take responsibility for
| billions lost/made that day? No, that's a big liability.
| Anyone trading that day knew it was a big glitch at open.
| Some names, BLUE CHIPS, were down 40-50%! What shocked us all
| is that they actually allowed the trades to stand. What was
| different about this week was that the SEC actually tried to
| do their jobs for once and the exchange had to address it ie
| come up with bullshit excuse.
| spywaregorilla wrote:
| So... what was wrong? Why should those people not have been
| allowed to make and lose money?
| lordnacho wrote:
| Backup system connected to prod, that somehow reminds of the
| Knight Trading debacle. Someone there apparently connected some
| test code to their prod and they blew up the company in under an
| hour.
| laurencei wrote:
| I mean - isnt that how all DR is essentially configured? You
| need it "somehow" connected to Prod depending on the failover,
| config, system etc. And in many of these complex systems DR can
| be on a subsystem etc - not an "all or nothing" approach?
| kmac_ wrote:
| Weak blame game.
| pedro2 wrote:
| Successes are management's, failures are individual's :)
| chiefalchemist wrote:
| True. But when that happens, that's a textbook sign of lack of
| leadership.
| jonpo wrote:
| Its nearly always the damn humans. I find it awful that we are
| often assigning blame to some "technology error" when its the
| damn humans pulling the strings all the time. all those times the
| market shut accidentally. or the market opens at the wrong time.
| or that time someone accidentally deletes all the GTC orders in
| order to "save some disk space". that time someone tests opening
| the market at the weekend and puts in the wrong date. Sometimes
| we are just trying to test that the things work and so we take
| awful risks like adding test orders, or failing over to test that
| backup versions of the trading infrastructure still work. All
| these things add human execution risk.
|
| That said I find the US market structure is unfair Charles Schwab
| does protest too much. Retail orders never seem to get near the
| central order book. there is no direct market access. brokers
| just sell your order to whomever MM pays them for the spread in
| return for a kickback. this should be a fantastic fair
| multiplayer game, but instead its pay to win mobile crap with
| vested interests milking their customers.
___________________________________________________________________
(page generated 2023-01-26 23:01 UTC)