hngopher.com

       [HN Gopher] Knightmare: A DevOps Cautionary Tale (2014)
       ___________________________________________________________________
        
       Knightmare: A DevOps Cautionary Tale (2014)
        
       Author : sathishmanohar
       Score  : 218 points
       Date   : 2023-09-10 20:07 UTC (2 hours ago)
        
 (HTM) web link (dougseven.com)
 (TXT) w3m dump (dougseven.com)
        
       | codeulike wrote:
       | When I got to the memorable words "Power Peg" I remembered I'd
       | heard all about this before.
        
       | xyst wrote:
       | Having worked in some Fortune 500 financial firms and low rent
       | "fintech" upstarts, I am not surprised this happened. Decades of
       | bandaid fixes, years of rotating out different
       | consultants/contractors, and software rot. Plus years of
       | emphasizing mid level management over software quality.
       | 
       | As other have mentioned, I don't think "automation of deployment"
       | would have prevented this company's inevitable downfall. If it
       | wasn't this one incident in 2014, then it would have been another
       | incident later on.
        
         | hinkley wrote:
         | It's an entire industry built on adrenaline, bravado, and let's
         | be honest: testosterone. How could their IT discipline be
         | described as anything other than "YOLO"?
         | 
         | Trading is mostly based on a book that, like the waterfall
         | model, was meant to be a cautionary tale on how _not_ to do
         | things. Liars ' Poker had the exact opposite effect of Silent
         | Spring. Imagine if Rachel Carson's book came out and people
         | decided that a career in pesticides was more glamorous than
         | being a doctor or a laywer, we made movies glorifying spraying
         | pesticides everywhere and on everything, and telling anyone who
         | thought you were crazy that they're a jealous loser and to fuck
         | off.
        
         | jonplackett wrote:
         | The thing I was surprised about is that they survived!
         | 
         | After all that they still got a $400 million cash bailout!
        
           | m_0x wrote:
           | It wasn't a bailout, it was investment money.
           | 
           | Bailout could imply government throwing a lifeguard
        
             | kaashif wrote:
             | > It wasn't a bailout, it was investment money.
             | 
             | Almost all bailouts are investments, whether it's a
             | government or private bailout.
             | 
             | The investments are questionable sometimes, but they're
             | still investments.
        
               | leoqa wrote:
               | The nuance is a) what happens to existing equity
               | stakeholders and b) does the bailout have to be repaid.
               | 
               | If the answer is nothing and no, then it's a bailout
               | philosophically. If the existing investors get diluted
               | then they're in part paying for the new capital
               | injection.
        
           | [deleted]
        
           | ben-gy wrote:
           | It wasn't a bail out, it was an opportunity for investors to
           | get great terms on equity at a crucial juncture for the
           | company.
        
             | swores wrote:
             | A government bail out isn't the exclusive use of the phrase
             | "bail out", it was both a bail out and an opportunity for
             | investors to get great terms on equity.
        
           | xcv123 wrote:
           | They were probably worth much more than $400M before the
           | failure so it was a good investment opportunity. They would
           | have been a money printing machine aside from this one major
           | fuckup.
        
             | nly wrote:
             | Their IP (proprietary trading algos) etc were probably
             | worth a lot at the time.
             | 
             | These days probably not so. I wouldn't imagine there are
             | any market makers left in NYSE and NASDAQ who aren't
             | deploying FPGAs to gain a speed edge.
        
         | [deleted]
        
       | hedora wrote:
       | No continuous deployment system I have worked with would have
       | blocked this particular bug.
       | 
       | They were in a situation where they were incrementally rolling
       | out, but the code had a logic bug where the failure of one
       | install within an incremental rollout step bankrupted the
       | company.
       | 
       | I'd guard against this with runtime checks that the software
       | version (e.g. git sha) matches, and also add fault injection into
       | tests that invoke the software rollout infrastructure.
        
       | hinkley wrote:
       | > $400M in assets to bankrupt
       | 
       | Was this Knight Capital?
       | 
       | > Knight Capital Group
       | 
       | Yep. Practically the canonical case study in deployment errors.
        
       | pavas wrote:
       | My team's systems play a critical role for several $100M of sales
       | per day, such that if our systems go down for long enough, these
       | sales will be lost. Long enough means at least several hours and
       | in this time frame we can get things back to a good state, often
       | without much external impact.
       | 
       | We too have manual processes in place, but for any manual process
       | we document the rollback steps (before starting) and monitor the
       | deployment. We also separate deployment of code with deployment
       | of features (which is done gradually behind feature flags). We
       | insist that any new features (or modification of code) requires a
       | new feature flag; while this is painful and slow, it has helped
       | us avoid risky situations and panic and alleviated our ops and
       | on-call burden considerably.
       | 
       | For something to go horribly wrong, it would have to fail many
       | "filters" of defects: 1. code review--accidentally introducing a
       | behavioral change without a feature flag (this can happen, e.g.
       | updating dependencies), 2. manual and devo testing (which is hit
       | or miss), 3. something in our deployment fails (luckily this is
       | mostly automated, though as with all distributed systems there
       | are edge cases), 4. Rollback fails or is done incorrectly 5.
       | Missing monitoring to alert us that issue still hasn't been
       | fixed. 5. Fail to escalate the issue in time to higher-levels. 6.
       | Enough time passes that we miss out on ability to meet our SLA,
       | etc.
       | 
       | For any riskier manual changes we can also require two people to
       | make the change (one points out what's being changed over a video
       | call, the other verifies).
       | 
       | If you're dealing with a system where your SLA is in minutes, and
       | changes are irreversible, you need to know how to practically
       | monitor and rollback within minutes, and if you're doing
       | something new and manually, you need to quadruple check
       | everything and have someone else watching you make the change, or
       | its only a matter of time before enough things go wrong in a row
       | and you can't fix it. It doesn't matter how good or smart you
       | are, mistakes will always happen when people have to manually
       | make or initiate a change, and that chance of making mistakes
       | needs to be built into your change management process.
        
         | coldtea wrote:
         | > _My team 's systems play a critical role for several $100M of
         | sales per day, such that if our systems go down for long
         | enough, these sales will be lost._
         | 
         | Would they? Or would they just happen later? In a lot of cases
         | in regular commerce, or even B2B, the same sales can often be
         | attempted again by the client for a little later, it's not "now
         | or never". As a user I have retried things I wanted to buy when
         | a vendor was down (usually because of a new announcement and
         | big demand breaking their servers) or when my bank had some
         | maintainance issue, and so on.
        
           | yardstick wrote:
           | It would be a serious issue for in person transactions like
           | shops, supermarkets, gas stations, etc
           | 
           | Imagine Walmart or Costco or Chevron centralised payment
           | services went down for 30+ mins. You would get a lot of lost
           | sales from those who don't carry enough cash to cover it
           | otherwise. Maybe a retailer might have a zapzap machine but
           | lots of cards aren't imprinted these days so that's a non
           | starter too.
        
             | kaashif wrote:
             | > Maybe a retailer might have a zapzap machine but lots of
             | cards aren't imprinted these days so that's a non starter
             | too.
             | 
             | When I Google "zapzap machine" this comment is the only
             | result, but after looking around on Wikipedia, I see this
             | is a typo for "zipzap".
             | 
             | Is this really the only time in history someone has typoed
             | zipzap as zapzap? I guess so.
        
           | pavas wrote:
           | It's both (though I would lean towards lost for a majority of
           | them). It's also true that the longer the outage, the greater
           | the impact, and you have to take into account knock-on
           | effects such as loss of customer trust. Since these are
           | elastic customer-goods, and ours isn't the only marketplace,
           | customers have choice. Customers will typically compare
           | price, then speed.
           | 
           | It's also probably true that a one-day outage would have a
           | negative net present value (taking into account all future
           | sales) far exceeding the daily loss in sales, due to loss of
           | customer goodwill.
        
       | reedf1 wrote:
       | Gotta imagine the sinking feeling that guy felt.
        
       | thorum wrote:
       | Honestly seems like the market itself should have safeguards
       | against this kind of thing.
        
         | alpark3 wrote:
         | It's interesting to note that exchanges are adding "obvious
         | error" rules that can slash trades under certain circumstances.
        
         | SoftTalker wrote:
         | The safeguard is "you go bankrupt if you fuck up"
         | 
         | Imagine there was some way for a trading company to execute
         | billions of dollars of trades and they say "ooops, sorry, that
         | was all a mistake" can you not see how that would be abused?
         | 
         | Now, the story also says that within a minute of the market
         | opening, the experienced traders knew something was wrong. Do
         | they bear any culpability for jumping on those trades, making
         | their money off of something they knew couldn't be intentional?
        
       | tomp wrote:
       | Ah, Knight Capital. The warning story for every quant trader /
       | engineer.
       | 
       | This is what people don't realize when they say HFT (high
       | frequency trading) is risk-free, leeching off people, etc.
       | 
       | You make a million every day with very little volatility (the
       | traditional way of quantifying "risk" in finance) but one little
       | mistake, and you're gone. The technical term is "picking up
       | pennies in front of a steamroller (train)". Selling options is
       | also like that.
        
         | [deleted]
        
         | loa_in_ wrote:
         | I don't see how anything here goes against it leeching off
         | people
        
           | dasil003 wrote:
           | Depends on whether they truly take on the risk. Interestingly
           | I can't clearly tell from a quick google who exactly ended up
           | holding the bag here, and what became of upper management.
        
           | Nevermark wrote:
           | Leeching implies someone has found a way to skim value from
           | you without providing value.
           | 
           | Someone taking on loads of risk to carry out your commands
           | efficiently is providing value.
           | 
           | You can argue whether they are doing so competently or not,
           | or whether they are pricing optimally or not, but they are
           | not just "takers" or "leeches".
        
             | Guvante wrote:
             | What risk are they taking exactly? Bugs ruining the
             | business isn't meaningful risk for the customer. It isn't
             | like day traders are at risk of going bankrupt due to that
             | after all.
             | 
             | They claim liquidity is their value but given how they act
             | they don't seem to be providing measurable liquidity,
             | either in terms of price or volume. (Yes they increase
             | volume by getting in the middle of trades but that isn't
             | useful volume...)
        
               | hackerlight wrote:
               | Market risk isn't the only type of risk. Many businesses
               | in other industries don't have market risk, that isn't
               | abnormal. Even businesses that you would expect to be
               | exposed to market risk aren't, since they hedge most or
               | all of it.
               | 
               | There's operational risk, like what brought down Knight
               | Capital, that's a type of risk. Or the risk that you will
               | be put out of business by competition because you were
               | too slow to innovate while burning through all your cash
               | runway. HFT firms face the same risks that other types of
               | businesses face. Smaller HFT firms fail often, and larger
               | firms tend to stay around (although sometimes they also
               | fail and often they shrink), which is similar to many
               | mature competitive industries.
               | 
               | > given how they act they don't seem to be providing
               | measurable liquidity
               | 
               | I'm not sure "How they act" should inform one's
               | perspective on the empirical question of whether or not
               | they are adding to liquidity. There is a lot of serious
               | debate and research that has gone into that question.
        
             | erik_seaberg wrote:
             | If a seller and a buyer are in market within seconds of
             | each other, they would have traded successfully without a
             | third party taking some of their money. As I understand it,
             | HFTs are trying to _avoid_ taking meaningful long-term
             | positions (which is why latency matters to only them).
        
             | UncleMeat wrote:
             | How are they taking on loads of risk? Risk has a particular
             | meaning in investment and "well, a bug can blow up my
             | company" isn't part of that meaning.
             | 
             | Simply creating risky (in the colloquial meaning) things is
             | not itself a reason to deserve money.
        
               | siftrics wrote:
               | When you make markets you are literally paid the spread
               | to assume the risk of holding the position.
        
               | envsubst wrote:
               | > Risk has a particular meaning in investment
               | 
               | Risk in finance definitely takes on more meaning than the
               | narrow definition in modern portfolio theory (stddev of
               | price).
        
             | jraph wrote:
             | What value do they provide?
        
               | callalex wrote:
               | They claim to provide liquidity, even though they are
               | just front-running trades that are already happening
               | anyway.
        
         | alpark3 wrote:
         | Most people confuse market making/risk holding with high
         | frequency statistical arbitrage strategies. I'm not totally
         | sure exactly what Knight Capital was running, but generally the
         | only "little" mistakes that would cause HFT market takers such
         | as Jump(for the most part) would blow up is some type of
         | egregious technical error like this, or some type of assumption
         | violations outside of market conditions(legal, structural,
         | etc.). Compare this to market makers like Jane Street who hold
         | market risk in exchange for EV, and thus could lose money just
         | based off of market swings (not to blowup levels if they know
         | what they're doing), and you can see the difference between the
         | styles.
         | 
         | I'm a proponent of both. But generally I hold more respect for
         | actual market makers who hold positions and can warehouse risk.
        
       | hyperhopper wrote:
       | Yes, the deployment practices were bad, but they still would have
       | had an issue even with proper practices.
       | 
       | The real issue was re-using an old flag. That should have never
       | been thought of or approved.
        
         | rwmj wrote:
         | There's definitely more to this story. Why was there a fixed
         | number of "flags" so that they needed to be reused? I wish
         | there was a true technical explanation.
        
         | amluto wrote:
         | I would argue the real issue was the lack of an automated
         | system (or multiple automated systems) that would hit the kill
         | switch if the trading activity didn't look right.
        
           | distortionfield wrote:
           | But how would you even start to define something as
           | stochastic as trading activity as "not looking right"?
        
             | mxz3000 wrote:
             | spamming the market with orders for one
        
             | Jorge1o1 wrote:
             | I've had to fill out forms for new algorithms / quant
             | strategies with questions like:
             | 
             | - how many orders per minute do you expect to create?
             | 
             | - how many orders per minute do you expect to cancel/amend?
             | 
             | - what's your max per-ticker position?
             | 
             | - what's your max strategy-level GMV/NMV?
             | 
             | Etc.
             | 
             | Any one of those questions can be used to set up
             | killswitches.
             | 
             | [edited for formatting]
        
         | notnmeyer wrote:
         | this is what stood out to me reading the story. i wonder if
         | there was a reason why they opted for this, however half-baked.
         | 
         | it reads less to me like a case for devops as it does a case
         | for better practices at every stage of development. how
         | arrogant or willfully ignorant do you have to be to operate
         | like this considering what's at stake?
        
           | SoftTalker wrote:
           | They probably already had a bitfield of feature flags, maybe
           | it was a 16-bit integer and full, and someone notices "hey
           | this one is old, we can reuse it and not have to change the
           | datatype"
        
             | notnmeyer wrote:
             | ah, yeah--hadn't considered that!
        
         | Neil44 wrote:
         | I can only think that it was some kind of fixed binary blob of
         | 1/0 flags where all the positions had been used umpteen times
         | over the years and nobody wanted to mess with the system to
         | replace it with something better.
        
       | lopkeny12ko wrote:
       | I'm not sure how automated deployments would have solved this
       | problem. In fact, if anything, it would have magnified the impact
       | and fallout of the problem.
       | 
       | Substitute "a developer forgot to upload the code to one of the
       | servers" for "the deployment agent errored while downloading the
       | new binary/code onto the server and a bug in the agent prevented
       | the error from being surfaced." Now you have the same failure
       | mode, and the impact happens even faster.
       | 
       | The blame here lies squarely with the developers--the code was
       | written in a non-backwards-compatible way.
        
         | lwhi wrote:
         | Automated deployments require planning before the time they're
         | executed.
         | 
         | If code is involved, someone likely reviews and approves it.
         | 
         | There are naturally far more safeguards in place than there
         | would be for a manual deployment.
        
           | justinclift wrote:
           | In an ideal world, sure.
           | 
           | In the current one, we have Facebook's "Move fast and break
           | things" being applied to many things where it has no business
           | being.
           | 
           | Banking and communications infrastructure comes to my mind,
           | but there are definitely others. :)
        
             | lwhi wrote:
             | I think the benefits from automated deployments are things
             | that are just par for the course to be honest.
             | 
             | Sure, you can mess these things up .. but doing so would
             | involve willful negligence rather than someone's absent
             | mindedness.
             | 
             | Basically, I think the takeaway from the article is
             | probably worth taking.
        
         | schneems wrote:
         | I see this as a problem of not investing enough in the deploy
         | process. (Disclosure: I maintain an open source deploy tool for
         | a living).
         | 
         | Charity Majors gave a talk in Euruko that talked a lot about
         | this. Deploy tooling shouldn't be a bunch of bash scripts in a
         | trench coat, it should be fully staffed, fully tested, and
         | automated within an inch of its life.
         | 
         | If you have a deploy process that has some king of immutable
         | architecture, tooling to monitor (failed/stuck/incomplete)
         | rollouts, and the ability to quickly rollback to a prior known
         | good stage then you have layers of protection and an easy
         | course of action for when things do go sideways. It might not
         | have made this problem impossible, but it would have made it
         | harder to happen.
        
           | hooverd wrote:
           | > bash scripts in a trench coat That's an amazing turn of
           | phrase.
        
           | hinkley wrote:
           | I wrote a tool to automate our hotfix process, and people
           | were somewhat surprised that you could kill the process at
           | any step and start over and it would almost always do the
           | right thing. Like how did you expect it to work? Why replace
           | an error prone process with an error prone and opaque one
           | that you can't restart?
        
         | Waterluvian wrote:
         | Which really means is a failure of leadership for being so
         | incompetent as to allow such an intensely risky situation to
         | exist.
        
         | hinkley wrote:
         | The goal with automation is that the number of unidentified
         | corner cases reduces over time.
         | 
         | A manual runbook is a game of, "I did step 12, I think I did
         | step 13, so the next step is 14." that plays out every single
         | time you do server work. The thing with the human brain is that
         | when you interrupt a task you've done a million times in the
         | middle, most people can't reliably discern between this
         | iteration and false memories from the last time they did it.
         | 
         | So unless there are interlocks that prevent skipping a step,
         | it's a gamble every single time. And the effort involved in
         | creating interlocks is a large fraction of the cost of
         | automating.
        
         | ledauphin wrote:
         | The blame here may indeed lie with whoever decided that reusing
         | an old flag was a good idea. As anyone who has been in software
         | development for any time can attest, this decision was not
         | necessarily - and perhaps not even likely - made by a
         | "developer."
        
           | Nekhrimah wrote:
           | >whoever decided that reusing an old flag was a good idea.
           | 
           | My understanding is that in high frequency trading,
           | minimizing the size of the transmission is paramount. Hence
           | re-purposing an existing flag, rather than adding size to the
           | packet makes some sense.
        
           | hinkley wrote:
           | Flag recycling is a task that should be measured in months to
           | quarters, and from what I recall of the postmortem they tried
           | to achieve it in weeks, which is just criminally stupid.
           | 
           | It's this detail of the story which flips me from sympathy to
           | schadenfreude. You dumb motherfuckers fucked around and found
           | out.
        
           | SoftTalker wrote:
           | Or at least not by a developer who has made that sort of
           | mistake in the past.
           | 
           | I don't know what software engineering programs teach these
           | days, but in the 1980s there was very little inclusion of
           | case studies of things that went wrong. This was unlike the
           | courses in the business school (my undergrad was CS major +
           | business minor) nor I would presume what real engineering
           | disciplines teach.
           | 
           | My first exposure to a fuckup in production was a fuckup in
           | production on my first job.
        
           | lopkeny12ko wrote:
           | I doubt any manager or VP cares or knows enough about the
           | technical details of the code to dictate the name that should
           | be used for a feature flag, of all things.
        
           | andersa wrote:
           | I wonder if this code was written in c++ or similar, the
           | flags were actually a bitfield, and they repurposed it
           | because they ran out of bits.
           | 
           | Need a space here? Oh, let's throw out this junk nobody used
           | in 8 years and there we go...
        
             | CraigRo wrote:
             | It is very hard to change the overall size of the messages,
             | and there's a lot of pressure to keep them short. So it
             | could have been a bitfield or several similar things... e.g
             | a value in a char field
        
           | manicennui wrote:
           | 9 times out of 10, I see developers making the mistakes that
           | everyone seems to want to blame on non-technical people.
           | There is a massive amount of software being written by people
           | with a wide range of capabilities, and a large number of
           | developers never master the basics. It doesn't help that some
           | of the worst tools "win" and offer little protection against
           | many basic mistakes.
        
             | hinkley wrote:
             | For a group who so thoroughly despises bosses that operate
             | on 'blame allocation', we spend a lot of time shopping
             | around for permission to engage in reckless behavior. Most
             | people would call that being a hypocrite.
             | 
             | Whereas I would call it... no, hypocrite works just fine.
        
           | mijoharas wrote:
           | At the very least have a two deploys - actually removing the
           | old code that relies on it and then repurposing it. Giant
           | foot gun to do it all in one especially without any automated
           | deploys.
        
         | thrashh wrote:
         | I agree. It doesn't matter if you give an inexperienced person
         | a hammer or a saw -- they'll still screw it up.
         | 
         | My biggest pet peeve is they NO ONE ever does failure modeling.
         | 
         | I swear everyone builds things assuming it will work perfectly.
         | Then when you mention if one part fails, it will completely
         | bring down everything, they'll say that it's a 1 in a million
         | chance. Yeah, the problem isn't that it's unlikely, it's that
         | when it does happen, you've perfectly designed your system to
         | destroy itself.
        
         | dmurray wrote:
         | > The blame here lies squarely with the developers--the code
         | was written in a non-backwards-compatible way.
         | 
         | The blame completely lies with the risk management team.
         | 
         | The market knew there was a terrible problem, Knight knew there
         | was a problem, yet it took 45 minutes of trying various
         | hotfixes before they ceased trading. Either because they didn't
         | have a kill switch, or because no one was empowered to pull the
         | kill switch because of the opportunity cost (perhaps pulling
         | the switch at the wrong time costs $500k in opportunity).
         | 
         | I worked for a competitor to Knight at the time, and we
         | deployed terrible bugs to production all the time, and during
         | post mortems we couldn't fathom the same thing happening to us.
         | A dozen automated systems would have kicked in to stop
         | individual trades, and any senior trader or operations person
         | could have got a kill switch pulled with 60 seconds of
         | dialogue, and not feared the repercussions. Actually, we made
         | way less of Knight's $400m than we could have because our risk
         | systems kept shutting strategies down because what was
         | happening was "too good to be true".
        
           | mpeg wrote:
           | It's nice to see your perspective as someone familiar with
           | better systems.
           | 
           | I have always found this story fascinating; in my junior days
           | I worked at a relatively big adtech platform (ie billions of
           | impressions per day) and as cowboy as we were about lots of
           | things, all our systems always had kill switches that could
           | stop spending money and I could have pulled them with minimal
           | red tape if I suspected something was wrong.
           | 
           | And this was for a platform where our max loss for an hour
           | would have hurt but not killed the business (maybe a six
           | figure loss), I can't imagine not having layers of risk
           | management systems in HFT software.
        
           | [deleted]
        
         | moeris wrote:
         | Automated deployments would have allowed you to review the
         | deployment before it happened. A failed deployment could be
         | configured to allow automatic rollbacks. Automated deployments
         | should also handle experiment flags, which could have been
         | toggled to reduce impact. There are a bunch of places where it
         | could have intervened and mitigated/prevented this whole
         | situation.
        
         | stusmall wrote:
         | I think the big improvement would be consistency. Either all
         | servers would be correct or all servers would be incorrect. The
         | step where "Since they were unable to determine what was
         | causing the erroneous orders they reacted by uninstalling the
         | new code from the servers it was deployed to correctly"
         | wouldn't have had a negative impact. They could have even
         | instantly rolled back. Also if they were using the same
         | automated deployment processes for their test environment they
         | might have even caught this in QA.
        
         | rcpt wrote:
         | It seems like the kind of thing that would be canaried which is
         | the kind of thing that you'd typically build alongside
         | automated deployment
        
       | jokoon wrote:
       | Oh no
       | 
       | Anyways
        
       | m3kw9 wrote:
       | Someone missed a blind spot
        
       | gumby wrote:
       | > (why code that had been dead for 8-years was still present in
       | the code base is a mystery, but that's not the point).
       | 
       | Actually it's a big part of the point: they have a system that
       | works with dead code in it. If you remove that dead code perhaps
       | it unwittingly breaks something else.
       | 
       | That kinds of chesterson's fence is a good practice.
        
         | rkuykendall-com wrote:
         | Leaving dead code in is not good practice?? I would love more
         | explanation here because that sounds like crazy talk to me.
        
           | gumby wrote:
           | You'll have to ask the author of the article.
        
             | fphhotchips wrote:
             | Your original comment is somewhat unclear. Are you
             | advocating for leaving old code in because the system works
             | and it's more stable that way, or taking it out to force
             | the necessary refactoring steps and understanding that will
             | bring?
        
               | gumby wrote:
               | I'm sorry I wasn't clear: I re-read my comment and
               | couldn't think of a decent edit.
               | 
               | It was the author whom I was quoting as saying "why would
               | someone have old code lying around." It seems obvious why
               | that's a good idea and it seems commenters in this thread
               | (including you) agree with me and not the author.
               | 
               | Sorry again if I was unclear.
        
           | [deleted]
        
           | lionkor wrote:
           | It may not be obvious that it's dead code - in a lot of
           | popular interpreted languages, it's impossible to tell if a
           | given function can be called or not
        
           | OtherShrezzing wrote:
           | Chesterton's Fence states that you shouldn't make a change
           | until you understand something's current state. Removing code
           | because it's dead is folly, if you don't understand 1) why
           | it's there, and 2) why nobody else removed it yet.
        
             | meiraleal wrote:
             | As this is a postmortem, it was proven dead code. There is
             | nothing in the text that mentions that they didn't know
             | what the code did (which then wouldn't be dead code).
        
         | [deleted]
        
       | motoboi wrote:
       | Changes we make to software and hardware infrastructure are
       | essentially hypotheses. They're backed by evidence suggesting
       | that these modifications will achieve our intended objectives.
       | 
       | What's crucial is to assess how accurately your hypothesis
       | reflects the reality once it's been implemented. Above all, it's
       | important to establish an instance that would definitively
       | disprove your hypothesis - an event that wouldn't occur if your
       | hypothesis holds true.
       | 
       | Harnessing this viewpoint can help you sidestep a multitude of
       | issues.
        
       | rvz wrote:
       | But ChatGPT would have fixed the issue faster in 45 mins than a
       | human would. /s
       | 
       | A high risk situation like this would make the idea of using LLMs
       | for this as not an option; before someone puts out a 'use-case'
       | for a LLM to fix this issue.
       | 
       | I'm sorry to preempt the thought of this in advance, but it would
       | not.
        
         | uxp8u61q wrote:
         | Who are you replying to? Nobody but you talked about chatbots
         | in this thread. Are you talking to yourself?
        
           | bsagdiyev wrote:
           | No they're preempting someone coming along and claiming this.
           | Haven't seen it in the replies yet but there's typically one
           | (or a lot in some cases) person(s) claiming ChatGPT will
           | bring Jesus back from the dead sort of thing.
        
       | gumballindie wrote:
       | > Had Knight implemented an automated deployment system -
       | complete with configuration, deployment and test automation - the
       | error that cause the Knightmare would have been avoided.
       | 
       | Would it have been avoided though? Configuration, deployment and
       | test automation mean nothing if they don't do what they are
       | supposed to do. Regardless of how many tests you have, if you
       | don't test for the right stuff it's all useless.
        
       | firesteelrain wrote:
       | Automation is not a silver bullet. Automation is still designed
       | by humans. Peer reviews, acceptance test procedures, promotion
       | procedures, etc all would have helped. And yes some of those
       | things are manual. Sandbox environments, etc
        
       | dkarl wrote:
       | > why code that had been dead for 8-years was still present in
       | the code base is a mystery, but that's not the point
       | 
       | It's not the worst mistake in the story, but it's not "not the
       | point." A proactive approach to pruning dead functionality would
       | have resulted in a less complex, better-understood piece of
       | software with less potential to go haywire. Driving relentlessly
       | forward without doing this kind of maintenance work is a risk,
       | calculated or otherwise.
        
       | 40yearoldman wrote:
       | lol. No. Deployments were not the issue. At any given time an
       | automated deployment system could have had a mistake introduced
       | that resulted in bad code being sent to the system. It does not
       | matter if it was old or new code. Any code could have had this
       | bug.
       | 
       | What the issue was, and it's one that I see often. Firstly no
       | vision into the system. Not even a dash board showing the
       | softwares running version. How often i see people ship software
       | without a banner posting its version and or an endpoint that
       | simply reports the version.
       | 
       | Secondly no god damn kill switch. You are working with money!!
       | Shutting down has to be an option.
        
         | [deleted]
        
         | 40yearoldman wrote:
         | Oh god. I just realized this is a PM. A plight on software
         | engineering. People who play technical, and "take the
         | requirements from the customer to the engineer". What's worse
         | is when they play engineer too.
        
         | INTPenis wrote:
         | I mean it makes no sense, without even reading the article,
         | just by working in IT I can tell you that if you're one
         | deployment away from being bankrupt then you're either doing it
         | wrong, or in the wrong business.
        
       | foota wrote:
       | The real issue here (sorry for true Scotsman-ing) is that they
       | were using an untested combination of configuration and binary
       | release. Configuration and binaries can be rolled out in
       | lockstep, preventing this class of issues.
       | 
       | Of course there were other mistakes here etc., but the issue
       | wouldn't have been possible if this weren't the case.
        
       | nickdothutton wrote:
       | "The code that that was updated repurposed an old flag..." Was as
       | far as I needed to read. Never do this.
        
       | valdiorn wrote:
       | Literally everyone in quant finance knows about knight capital.
       | It even has its own phrase; "pulling a knight capital" (meaning;
       | cutting corners on mission critical systems, even ones that can
       | bankrupt the company in an instant, and experiencing the
       | consequences)
        
         | shric wrote:
         | Indeed, it's used in onboarding material at my employer.
        
           | mxz3000 wrote:
           | Yeap, it's used as a case study for us as to the worst case
           | scenario in trading incidents. Definitely humbling.
        
       | supportengineer wrote:
       | They were missing any kind of risk mitigation steps, in their
       | deployment practice.
        
         | hyperhello wrote:
         | There's no money for that.
        
           | earnesti wrote:
           | It is funny, but in one company I was working for, the more
           | people they added the more they neglected all basics, such as
           | backups. There were heavy processes for many things and they
           | were followed very well, but for whatever reasons some really
           | basic things went unnoticed for many years.
        
         | Gibbon1 wrote:
         | I think Goldman Sachs or someone big like that had a similar
         | oopsie. And what happened was the exchange reversed all their
         | bad trades.
        
       | dang wrote:
       | Related:
       | 
       |  _Knightmare: A DevOps Cautionary Tale (2014)_ -
       | https://news.ycombinator.com/item?id=22250847 - Feb 2020 (33
       | comments)
       | 
       |  _Knightmare: A DevOps Cautionary Tale (2014)_ -
       | https://news.ycombinator.com/item?id=8994701 - Feb 2015 (85
       | comments)
       | 
       |  _Knightmare: A DevOps Cautionary Tale_ -
       | https://news.ycombinator.com/item?id=7652036 - April 2014 (60
       | comments)
       | 
       | Also:
       | 
       |  _The $440M software error at Knight Capital (2019)_ -
       | https://news.ycombinator.com/item?id=31239033 - May 2022 (172
       | comments)
       | 
       |  _Bugs in trading software cost Knight Capital $440M_ -
       | https://news.ycombinator.com/item?id=4329495 - Aug 2012 (1
       | comment)
       | 
       |  _Knight Capital Says Trading Glitch Cost It $440 Million_ -
       | https://news.ycombinator.com/item?id=4329101 - Aug 2012 (90
       | comments)
       | 
       | Others?
        
       | taspeotis wrote:
       | Needs (2014) in the title.
        
       | [deleted]
        
       | realreality wrote:
       | The moral of the story is: don't engage in dubious practices like
       | high speed trading.
        
         | tacker2000 wrote:
         | How is high speed trading any more dubious than long term
         | holding, or shorting, etc?
        
           | realreality wrote:
           | It's all dubious.
        
         | alphanullmeric wrote:
         | good to know that you don't consent to what other people do
         | with their money.
        
         | dexwiz wrote:
         | They were market makers, which is different. They help so when
         | you push sell on E*trade you actually get a price somewhat
         | close to your order in relatively short time. No need to call
         | up a broker who will route the order so a guy shouting on the
         | floor.
        
         | eddtests wrote:
         | And remove easy/quick liquidity for the rest of the market?
         | 
         | Edit: downvotes, any reason why? Or just HFT == Bad?
        
           | [deleted]
        
           | realreality wrote:
           | "The market" shouldn't even exist.
        
       | daft_pink wrote:
       | I'm so glad I don't write code that automatically routes millions
       | of dollars with no human intervention.
       | 
       | It's like writing code that flies a jumbo jet.
       | 
       | Who wants that kind of responsibility.
        
         | hgomersall wrote:
         | I'm so glad I'm not wasting my life working in finance.
        
           | shric wrote:
           | I've worked in various small to medium IT companies, a FAANG
           | and another fortune 500 tech company. 6 months ago I moved to
           | a proprietary trading company/market maker and it's the most
           | interesting and satisfying place I've worked so far.
           | 
           | I hope to continue to "waste my life" for many years to come.
        
             | hammeiam wrote:
             | May I ask which one, and what your process was that led you
             | to them?
        
           | envsubst wrote:
           | I'm sure all your jobs have contributed to the well being of
           | humanity.
        
             | goldinfra wrote:
             | Most jobs do in fact contribute to the well being of
             | humanity, however little. It's few jobs, like most in
             | financial trading, that actively reduce the well being of
             | humanity.
             | 
             | Never will you meet a more self-deluded and pathetic set of
             | humans. Desperate money addicts that often become other
             | kinds of addicts. Whole thing should be abolished.
             | 
             | Source: I worked in finance when I was young and dumb.
        
               | yieldcrv wrote:
               | I, for one, am so glad to "own" the "compose" button at a
               | democracy destabilizing adtech-conglomerate
        
               | meiraleal wrote:
               | > Most jobs do in fact contribute to the well being of
               | humanity, however little.
               | 
               | No, they don't. A lot of jobs hold os back, actually.
               | Salespeople selling things people don't wanna buy,
               | finance and tech bros vampirizing third world countries
               | without the safeguards that western countries have on
               | their capital markets, etc.
        
         | m463 wrote:
         | > It's like writing code that flies a jumbo jet.
         | 
         | and upgrading it from a coach seat
        
         | wruza wrote:
         | It feels anxiety inducing at first, but if you have good
         | controls and monitoring in place, it becomes daily routine. You
         | basically address the points you naturally have and the more
         | reasonably anxious you are, the better for the business. From
         | my experience with finance, I'd wager that problem at Knight
         | was 10% tech issues, 90% CTO-ish person feeling ballsy. In
         | general, not exactly that day or week.
        
         | Waterluvian wrote:
         | It's not scary when it's done properly. And done properly can
         | look like an incredibly tedious job. I think it's for a certain
         | kind of person who loves the process and the tests and the
         | simulators and the redundancy. Where only 1% of the engineering
         | is the code that flies the plane.
        
         | callalex wrote:
         | It's fine to have that kind of responsibility, but it has to
         | actually be your responsibility. Which means you have to be
         | empowered to say "no, we aren't shipping this until XYZ is
         | fixed" even if XYZ will take another two years to build and the
         | boss wants to ship tomorrow.
        
           | salawat wrote:
           | Yep. Until the capacity to say unoverridably "No"
           | materializes, there's a lot of code I refuse to have
           | responsibility for delegated to me.
        
             | wruza wrote:
             | As a profit non-taker, which responsibility a worker can
             | even have? Realistically it lies in range of their monthly
             | paycheck and pending bonuses and in a moral obligation to
             | operate a failing system until it lands somewhere.
             | Everything above it is a systemic risk for a profit taker
             | which if left unaddressed is absolutely on them. There's no
             | way you can take responsibility for $400M unless you have
             | that money.
        
       | codegeek wrote:
       | I refuse to believe that failed deployment can bring a company
       | down. That is just a symptom. The root cause has to be a whole
       | big collection of decisions and processes/systems built over
       | years.
        
       ___________________________________________________________________
       (page generated 2023-09-10 23:00 UTC)