[HN Gopher] Knightmare: A DevOps Cautionary Tale (2014)
___________________________________________________________________
Knightmare: A DevOps Cautionary Tale (2014)
Author : sathishmanohar
Score : 218 points
Date : 2023-09-10 20:07 UTC (2 hours ago)
(HTM) web link (dougseven.com)
(TXT) w3m dump (dougseven.com)
| codeulike wrote:
| When I got to the memorable words "Power Peg" I remembered I'd
| heard all about this before.
| xyst wrote:
| Having worked in some Fortune 500 financial firms and low rent
| "fintech" upstarts, I am not surprised this happened. Decades of
| bandaid fixes, years of rotating out different
| consultants/contractors, and software rot. Plus years of
| emphasizing mid level management over software quality.
|
| As other have mentioned, I don't think "automation of deployment"
| would have prevented this company's inevitable downfall. If it
| wasn't this one incident in 2014, then it would have been another
| incident later on.
| hinkley wrote:
| It's an entire industry built on adrenaline, bravado, and let's
| be honest: testosterone. How could their IT discipline be
| described as anything other than "YOLO"?
|
| Trading is mostly based on a book that, like the waterfall
| model, was meant to be a cautionary tale on how _not_ to do
| things. Liars ' Poker had the exact opposite effect of Silent
| Spring. Imagine if Rachel Carson's book came out and people
| decided that a career in pesticides was more glamorous than
| being a doctor or a laywer, we made movies glorifying spraying
| pesticides everywhere and on everything, and telling anyone who
| thought you were crazy that they're a jealous loser and to fuck
| off.
| jonplackett wrote:
| The thing I was surprised about is that they survived!
|
| After all that they still got a $400 million cash bailout!
| m_0x wrote:
| It wasn't a bailout, it was investment money.
|
| Bailout could imply government throwing a lifeguard
| kaashif wrote:
| > It wasn't a bailout, it was investment money.
|
| Almost all bailouts are investments, whether it's a
| government or private bailout.
|
| The investments are questionable sometimes, but they're
| still investments.
| leoqa wrote:
| The nuance is a) what happens to existing equity
| stakeholders and b) does the bailout have to be repaid.
|
| If the answer is nothing and no, then it's a bailout
| philosophically. If the existing investors get diluted
| then they're in part paying for the new capital
| injection.
| [deleted]
| ben-gy wrote:
| It wasn't a bail out, it was an opportunity for investors to
| get great terms on equity at a crucial juncture for the
| company.
| swores wrote:
| A government bail out isn't the exclusive use of the phrase
| "bail out", it was both a bail out and an opportunity for
| investors to get great terms on equity.
| xcv123 wrote:
| They were probably worth much more than $400M before the
| failure so it was a good investment opportunity. They would
| have been a money printing machine aside from this one major
| fuckup.
| nly wrote:
| Their IP (proprietary trading algos) etc were probably
| worth a lot at the time.
|
| These days probably not so. I wouldn't imagine there are
| any market makers left in NYSE and NASDAQ who aren't
| deploying FPGAs to gain a speed edge.
| [deleted]
| hedora wrote:
| No continuous deployment system I have worked with would have
| blocked this particular bug.
|
| They were in a situation where they were incrementally rolling
| out, but the code had a logic bug where the failure of one
| install within an incremental rollout step bankrupted the
| company.
|
| I'd guard against this with runtime checks that the software
| version (e.g. git sha) matches, and also add fault injection into
| tests that invoke the software rollout infrastructure.
| hinkley wrote:
| > $400M in assets to bankrupt
|
| Was this Knight Capital?
|
| > Knight Capital Group
|
| Yep. Practically the canonical case study in deployment errors.
| pavas wrote:
| My team's systems play a critical role for several $100M of sales
| per day, such that if our systems go down for long enough, these
| sales will be lost. Long enough means at least several hours and
| in this time frame we can get things back to a good state, often
| without much external impact.
|
| We too have manual processes in place, but for any manual process
| we document the rollback steps (before starting) and monitor the
| deployment. We also separate deployment of code with deployment
| of features (which is done gradually behind feature flags). We
| insist that any new features (or modification of code) requires a
| new feature flag; while this is painful and slow, it has helped
| us avoid risky situations and panic and alleviated our ops and
| on-call burden considerably.
|
| For something to go horribly wrong, it would have to fail many
| "filters" of defects: 1. code review--accidentally introducing a
| behavioral change without a feature flag (this can happen, e.g.
| updating dependencies), 2. manual and devo testing (which is hit
| or miss), 3. something in our deployment fails (luckily this is
| mostly automated, though as with all distributed systems there
| are edge cases), 4. Rollback fails or is done incorrectly 5.
| Missing monitoring to alert us that issue still hasn't been
| fixed. 5. Fail to escalate the issue in time to higher-levels. 6.
| Enough time passes that we miss out on ability to meet our SLA,
| etc.
|
| For any riskier manual changes we can also require two people to
| make the change (one points out what's being changed over a video
| call, the other verifies).
|
| If you're dealing with a system where your SLA is in minutes, and
| changes are irreversible, you need to know how to practically
| monitor and rollback within minutes, and if you're doing
| something new and manually, you need to quadruple check
| everything and have someone else watching you make the change, or
| its only a matter of time before enough things go wrong in a row
| and you can't fix it. It doesn't matter how good or smart you
| are, mistakes will always happen when people have to manually
| make or initiate a change, and that chance of making mistakes
| needs to be built into your change management process.
| coldtea wrote:
| > _My team 's systems play a critical role for several $100M of
| sales per day, such that if our systems go down for long
| enough, these sales will be lost._
|
| Would they? Or would they just happen later? In a lot of cases
| in regular commerce, or even B2B, the same sales can often be
| attempted again by the client for a little later, it's not "now
| or never". As a user I have retried things I wanted to buy when
| a vendor was down (usually because of a new announcement and
| big demand breaking their servers) or when my bank had some
| maintainance issue, and so on.
| yardstick wrote:
| It would be a serious issue for in person transactions like
| shops, supermarkets, gas stations, etc
|
| Imagine Walmart or Costco or Chevron centralised payment
| services went down for 30+ mins. You would get a lot of lost
| sales from those who don't carry enough cash to cover it
| otherwise. Maybe a retailer might have a zapzap machine but
| lots of cards aren't imprinted these days so that's a non
| starter too.
| kaashif wrote:
| > Maybe a retailer might have a zapzap machine but lots of
| cards aren't imprinted these days so that's a non starter
| too.
|
| When I Google "zapzap machine" this comment is the only
| result, but after looking around on Wikipedia, I see this
| is a typo for "zipzap".
|
| Is this really the only time in history someone has typoed
| zipzap as zapzap? I guess so.
| pavas wrote:
| It's both (though I would lean towards lost for a majority of
| them). It's also true that the longer the outage, the greater
| the impact, and you have to take into account knock-on
| effects such as loss of customer trust. Since these are
| elastic customer-goods, and ours isn't the only marketplace,
| customers have choice. Customers will typically compare
| price, then speed.
|
| It's also probably true that a one-day outage would have a
| negative net present value (taking into account all future
| sales) far exceeding the daily loss in sales, due to loss of
| customer goodwill.
| reedf1 wrote:
| Gotta imagine the sinking feeling that guy felt.
| thorum wrote:
| Honestly seems like the market itself should have safeguards
| against this kind of thing.
| alpark3 wrote:
| It's interesting to note that exchanges are adding "obvious
| error" rules that can slash trades under certain circumstances.
| SoftTalker wrote:
| The safeguard is "you go bankrupt if you fuck up"
|
| Imagine there was some way for a trading company to execute
| billions of dollars of trades and they say "ooops, sorry, that
| was all a mistake" can you not see how that would be abused?
|
| Now, the story also says that within a minute of the market
| opening, the experienced traders knew something was wrong. Do
| they bear any culpability for jumping on those trades, making
| their money off of something they knew couldn't be intentional?
| tomp wrote:
| Ah, Knight Capital. The warning story for every quant trader /
| engineer.
|
| This is what people don't realize when they say HFT (high
| frequency trading) is risk-free, leeching off people, etc.
|
| You make a million every day with very little volatility (the
| traditional way of quantifying "risk" in finance) but one little
| mistake, and you're gone. The technical term is "picking up
| pennies in front of a steamroller (train)". Selling options is
| also like that.
| [deleted]
| loa_in_ wrote:
| I don't see how anything here goes against it leeching off
| people
| dasil003 wrote:
| Depends on whether they truly take on the risk. Interestingly
| I can't clearly tell from a quick google who exactly ended up
| holding the bag here, and what became of upper management.
| Nevermark wrote:
| Leeching implies someone has found a way to skim value from
| you without providing value.
|
| Someone taking on loads of risk to carry out your commands
| efficiently is providing value.
|
| You can argue whether they are doing so competently or not,
| or whether they are pricing optimally or not, but they are
| not just "takers" or "leeches".
| Guvante wrote:
| What risk are they taking exactly? Bugs ruining the
| business isn't meaningful risk for the customer. It isn't
| like day traders are at risk of going bankrupt due to that
| after all.
|
| They claim liquidity is their value but given how they act
| they don't seem to be providing measurable liquidity,
| either in terms of price or volume. (Yes they increase
| volume by getting in the middle of trades but that isn't
| useful volume...)
| hackerlight wrote:
| Market risk isn't the only type of risk. Many businesses
| in other industries don't have market risk, that isn't
| abnormal. Even businesses that you would expect to be
| exposed to market risk aren't, since they hedge most or
| all of it.
|
| There's operational risk, like what brought down Knight
| Capital, that's a type of risk. Or the risk that you will
| be put out of business by competition because you were
| too slow to innovate while burning through all your cash
| runway. HFT firms face the same risks that other types of
| businesses face. Smaller HFT firms fail often, and larger
| firms tend to stay around (although sometimes they also
| fail and often they shrink), which is similar to many
| mature competitive industries.
|
| > given how they act they don't seem to be providing
| measurable liquidity
|
| I'm not sure "How they act" should inform one's
| perspective on the empirical question of whether or not
| they are adding to liquidity. There is a lot of serious
| debate and research that has gone into that question.
| erik_seaberg wrote:
| If a seller and a buyer are in market within seconds of
| each other, they would have traded successfully without a
| third party taking some of their money. As I understand it,
| HFTs are trying to _avoid_ taking meaningful long-term
| positions (which is why latency matters to only them).
| UncleMeat wrote:
| How are they taking on loads of risk? Risk has a particular
| meaning in investment and "well, a bug can blow up my
| company" isn't part of that meaning.
|
| Simply creating risky (in the colloquial meaning) things is
| not itself a reason to deserve money.
| siftrics wrote:
| When you make markets you are literally paid the spread
| to assume the risk of holding the position.
| envsubst wrote:
| > Risk has a particular meaning in investment
|
| Risk in finance definitely takes on more meaning than the
| narrow definition in modern portfolio theory (stddev of
| price).
| jraph wrote:
| What value do they provide?
| callalex wrote:
| They claim to provide liquidity, even though they are
| just front-running trades that are already happening
| anyway.
| alpark3 wrote:
| Most people confuse market making/risk holding with high
| frequency statistical arbitrage strategies. I'm not totally
| sure exactly what Knight Capital was running, but generally the
| only "little" mistakes that would cause HFT market takers such
| as Jump(for the most part) would blow up is some type of
| egregious technical error like this, or some type of assumption
| violations outside of market conditions(legal, structural,
| etc.). Compare this to market makers like Jane Street who hold
| market risk in exchange for EV, and thus could lose money just
| based off of market swings (not to blowup levels if they know
| what they're doing), and you can see the difference between the
| styles.
|
| I'm a proponent of both. But generally I hold more respect for
| actual market makers who hold positions and can warehouse risk.
| hyperhopper wrote:
| Yes, the deployment practices were bad, but they still would have
| had an issue even with proper practices.
|
| The real issue was re-using an old flag. That should have never
| been thought of or approved.
| rwmj wrote:
| There's definitely more to this story. Why was there a fixed
| number of "flags" so that they needed to be reused? I wish
| there was a true technical explanation.
| amluto wrote:
| I would argue the real issue was the lack of an automated
| system (or multiple automated systems) that would hit the kill
| switch if the trading activity didn't look right.
| distortionfield wrote:
| But how would you even start to define something as
| stochastic as trading activity as "not looking right"?
| mxz3000 wrote:
| spamming the market with orders for one
| Jorge1o1 wrote:
| I've had to fill out forms for new algorithms / quant
| strategies with questions like:
|
| - how many orders per minute do you expect to create?
|
| - how many orders per minute do you expect to cancel/amend?
|
| - what's your max per-ticker position?
|
| - what's your max strategy-level GMV/NMV?
|
| Etc.
|
| Any one of those questions can be used to set up
| killswitches.
|
| [edited for formatting]
| notnmeyer wrote:
| this is what stood out to me reading the story. i wonder if
| there was a reason why they opted for this, however half-baked.
|
| it reads less to me like a case for devops as it does a case
| for better practices at every stage of development. how
| arrogant or willfully ignorant do you have to be to operate
| like this considering what's at stake?
| SoftTalker wrote:
| They probably already had a bitfield of feature flags, maybe
| it was a 16-bit integer and full, and someone notices "hey
| this one is old, we can reuse it and not have to change the
| datatype"
| notnmeyer wrote:
| ah, yeah--hadn't considered that!
| Neil44 wrote:
| I can only think that it was some kind of fixed binary blob of
| 1/0 flags where all the positions had been used umpteen times
| over the years and nobody wanted to mess with the system to
| replace it with something better.
| lopkeny12ko wrote:
| I'm not sure how automated deployments would have solved this
| problem. In fact, if anything, it would have magnified the impact
| and fallout of the problem.
|
| Substitute "a developer forgot to upload the code to one of the
| servers" for "the deployment agent errored while downloading the
| new binary/code onto the server and a bug in the agent prevented
| the error from being surfaced." Now you have the same failure
| mode, and the impact happens even faster.
|
| The blame here lies squarely with the developers--the code was
| written in a non-backwards-compatible way.
| lwhi wrote:
| Automated deployments require planning before the time they're
| executed.
|
| If code is involved, someone likely reviews and approves it.
|
| There are naturally far more safeguards in place than there
| would be for a manual deployment.
| justinclift wrote:
| In an ideal world, sure.
|
| In the current one, we have Facebook's "Move fast and break
| things" being applied to many things where it has no business
| being.
|
| Banking and communications infrastructure comes to my mind,
| but there are definitely others. :)
| lwhi wrote:
| I think the benefits from automated deployments are things
| that are just par for the course to be honest.
|
| Sure, you can mess these things up .. but doing so would
| involve willful negligence rather than someone's absent
| mindedness.
|
| Basically, I think the takeaway from the article is
| probably worth taking.
| schneems wrote:
| I see this as a problem of not investing enough in the deploy
| process. (Disclosure: I maintain an open source deploy tool for
| a living).
|
| Charity Majors gave a talk in Euruko that talked a lot about
| this. Deploy tooling shouldn't be a bunch of bash scripts in a
| trench coat, it should be fully staffed, fully tested, and
| automated within an inch of its life.
|
| If you have a deploy process that has some king of immutable
| architecture, tooling to monitor (failed/stuck/incomplete)
| rollouts, and the ability to quickly rollback to a prior known
| good stage then you have layers of protection and an easy
| course of action for when things do go sideways. It might not
| have made this problem impossible, but it would have made it
| harder to happen.
| hooverd wrote:
| > bash scripts in a trench coat That's an amazing turn of
| phrase.
| hinkley wrote:
| I wrote a tool to automate our hotfix process, and people
| were somewhat surprised that you could kill the process at
| any step and start over and it would almost always do the
| right thing. Like how did you expect it to work? Why replace
| an error prone process with an error prone and opaque one
| that you can't restart?
| Waterluvian wrote:
| Which really means is a failure of leadership for being so
| incompetent as to allow such an intensely risky situation to
| exist.
| hinkley wrote:
| The goal with automation is that the number of unidentified
| corner cases reduces over time.
|
| A manual runbook is a game of, "I did step 12, I think I did
| step 13, so the next step is 14." that plays out every single
| time you do server work. The thing with the human brain is that
| when you interrupt a task you've done a million times in the
| middle, most people can't reliably discern between this
| iteration and false memories from the last time they did it.
|
| So unless there are interlocks that prevent skipping a step,
| it's a gamble every single time. And the effort involved in
| creating interlocks is a large fraction of the cost of
| automating.
| ledauphin wrote:
| The blame here may indeed lie with whoever decided that reusing
| an old flag was a good idea. As anyone who has been in software
| development for any time can attest, this decision was not
| necessarily - and perhaps not even likely - made by a
| "developer."
| Nekhrimah wrote:
| >whoever decided that reusing an old flag was a good idea.
|
| My understanding is that in high frequency trading,
| minimizing the size of the transmission is paramount. Hence
| re-purposing an existing flag, rather than adding size to the
| packet makes some sense.
| hinkley wrote:
| Flag recycling is a task that should be measured in months to
| quarters, and from what I recall of the postmortem they tried
| to achieve it in weeks, which is just criminally stupid.
|
| It's this detail of the story which flips me from sympathy to
| schadenfreude. You dumb motherfuckers fucked around and found
| out.
| SoftTalker wrote:
| Or at least not by a developer who has made that sort of
| mistake in the past.
|
| I don't know what software engineering programs teach these
| days, but in the 1980s there was very little inclusion of
| case studies of things that went wrong. This was unlike the
| courses in the business school (my undergrad was CS major +
| business minor) nor I would presume what real engineering
| disciplines teach.
|
| My first exposure to a fuckup in production was a fuckup in
| production on my first job.
| lopkeny12ko wrote:
| I doubt any manager or VP cares or knows enough about the
| technical details of the code to dictate the name that should
| be used for a feature flag, of all things.
| andersa wrote:
| I wonder if this code was written in c++ or similar, the
| flags were actually a bitfield, and they repurposed it
| because they ran out of bits.
|
| Need a space here? Oh, let's throw out this junk nobody used
| in 8 years and there we go...
| CraigRo wrote:
| It is very hard to change the overall size of the messages,
| and there's a lot of pressure to keep them short. So it
| could have been a bitfield or several similar things... e.g
| a value in a char field
| manicennui wrote:
| 9 times out of 10, I see developers making the mistakes that
| everyone seems to want to blame on non-technical people.
| There is a massive amount of software being written by people
| with a wide range of capabilities, and a large number of
| developers never master the basics. It doesn't help that some
| of the worst tools "win" and offer little protection against
| many basic mistakes.
| hinkley wrote:
| For a group who so thoroughly despises bosses that operate
| on 'blame allocation', we spend a lot of time shopping
| around for permission to engage in reckless behavior. Most
| people would call that being a hypocrite.
|
| Whereas I would call it... no, hypocrite works just fine.
| mijoharas wrote:
| At the very least have a two deploys - actually removing the
| old code that relies on it and then repurposing it. Giant
| foot gun to do it all in one especially without any automated
| deploys.
| thrashh wrote:
| I agree. It doesn't matter if you give an inexperienced person
| a hammer or a saw -- they'll still screw it up.
|
| My biggest pet peeve is they NO ONE ever does failure modeling.
|
| I swear everyone builds things assuming it will work perfectly.
| Then when you mention if one part fails, it will completely
| bring down everything, they'll say that it's a 1 in a million
| chance. Yeah, the problem isn't that it's unlikely, it's that
| when it does happen, you've perfectly designed your system to
| destroy itself.
| dmurray wrote:
| > The blame here lies squarely with the developers--the code
| was written in a non-backwards-compatible way.
|
| The blame completely lies with the risk management team.
|
| The market knew there was a terrible problem, Knight knew there
| was a problem, yet it took 45 minutes of trying various
| hotfixes before they ceased trading. Either because they didn't
| have a kill switch, or because no one was empowered to pull the
| kill switch because of the opportunity cost (perhaps pulling
| the switch at the wrong time costs $500k in opportunity).
|
| I worked for a competitor to Knight at the time, and we
| deployed terrible bugs to production all the time, and during
| post mortems we couldn't fathom the same thing happening to us.
| A dozen automated systems would have kicked in to stop
| individual trades, and any senior trader or operations person
| could have got a kill switch pulled with 60 seconds of
| dialogue, and not feared the repercussions. Actually, we made
| way less of Knight's $400m than we could have because our risk
| systems kept shutting strategies down because what was
| happening was "too good to be true".
| mpeg wrote:
| It's nice to see your perspective as someone familiar with
| better systems.
|
| I have always found this story fascinating; in my junior days
| I worked at a relatively big adtech platform (ie billions of
| impressions per day) and as cowboy as we were about lots of
| things, all our systems always had kill switches that could
| stop spending money and I could have pulled them with minimal
| red tape if I suspected something was wrong.
|
| And this was for a platform where our max loss for an hour
| would have hurt but not killed the business (maybe a six
| figure loss), I can't imagine not having layers of risk
| management systems in HFT software.
| [deleted]
| moeris wrote:
| Automated deployments would have allowed you to review the
| deployment before it happened. A failed deployment could be
| configured to allow automatic rollbacks. Automated deployments
| should also handle experiment flags, which could have been
| toggled to reduce impact. There are a bunch of places where it
| could have intervened and mitigated/prevented this whole
| situation.
| stusmall wrote:
| I think the big improvement would be consistency. Either all
| servers would be correct or all servers would be incorrect. The
| step where "Since they were unable to determine what was
| causing the erroneous orders they reacted by uninstalling the
| new code from the servers it was deployed to correctly"
| wouldn't have had a negative impact. They could have even
| instantly rolled back. Also if they were using the same
| automated deployment processes for their test environment they
| might have even caught this in QA.
| rcpt wrote:
| It seems like the kind of thing that would be canaried which is
| the kind of thing that you'd typically build alongside
| automated deployment
| jokoon wrote:
| Oh no
|
| Anyways
| m3kw9 wrote:
| Someone missed a blind spot
| gumby wrote:
| > (why code that had been dead for 8-years was still present in
| the code base is a mystery, but that's not the point).
|
| Actually it's a big part of the point: they have a system that
| works with dead code in it. If you remove that dead code perhaps
| it unwittingly breaks something else.
|
| That kinds of chesterson's fence is a good practice.
| rkuykendall-com wrote:
| Leaving dead code in is not good practice?? I would love more
| explanation here because that sounds like crazy talk to me.
| gumby wrote:
| You'll have to ask the author of the article.
| fphhotchips wrote:
| Your original comment is somewhat unclear. Are you
| advocating for leaving old code in because the system works
| and it's more stable that way, or taking it out to force
| the necessary refactoring steps and understanding that will
| bring?
| gumby wrote:
| I'm sorry I wasn't clear: I re-read my comment and
| couldn't think of a decent edit.
|
| It was the author whom I was quoting as saying "why would
| someone have old code lying around." It seems obvious why
| that's a good idea and it seems commenters in this thread
| (including you) agree with me and not the author.
|
| Sorry again if I was unclear.
| [deleted]
| lionkor wrote:
| It may not be obvious that it's dead code - in a lot of
| popular interpreted languages, it's impossible to tell if a
| given function can be called or not
| OtherShrezzing wrote:
| Chesterton's Fence states that you shouldn't make a change
| until you understand something's current state. Removing code
| because it's dead is folly, if you don't understand 1) why
| it's there, and 2) why nobody else removed it yet.
| meiraleal wrote:
| As this is a postmortem, it was proven dead code. There is
| nothing in the text that mentions that they didn't know
| what the code did (which then wouldn't be dead code).
| [deleted]
| motoboi wrote:
| Changes we make to software and hardware infrastructure are
| essentially hypotheses. They're backed by evidence suggesting
| that these modifications will achieve our intended objectives.
|
| What's crucial is to assess how accurately your hypothesis
| reflects the reality once it's been implemented. Above all, it's
| important to establish an instance that would definitively
| disprove your hypothesis - an event that wouldn't occur if your
| hypothesis holds true.
|
| Harnessing this viewpoint can help you sidestep a multitude of
| issues.
| rvz wrote:
| But ChatGPT would have fixed the issue faster in 45 mins than a
| human would. /s
|
| A high risk situation like this would make the idea of using LLMs
| for this as not an option; before someone puts out a 'use-case'
| for a LLM to fix this issue.
|
| I'm sorry to preempt the thought of this in advance, but it would
| not.
| uxp8u61q wrote:
| Who are you replying to? Nobody but you talked about chatbots
| in this thread. Are you talking to yourself?
| bsagdiyev wrote:
| No they're preempting someone coming along and claiming this.
| Haven't seen it in the replies yet but there's typically one
| (or a lot in some cases) person(s) claiming ChatGPT will
| bring Jesus back from the dead sort of thing.
| gumballindie wrote:
| > Had Knight implemented an automated deployment system -
| complete with configuration, deployment and test automation - the
| error that cause the Knightmare would have been avoided.
|
| Would it have been avoided though? Configuration, deployment and
| test automation mean nothing if they don't do what they are
| supposed to do. Regardless of how many tests you have, if you
| don't test for the right stuff it's all useless.
| firesteelrain wrote:
| Automation is not a silver bullet. Automation is still designed
| by humans. Peer reviews, acceptance test procedures, promotion
| procedures, etc all would have helped. And yes some of those
| things are manual. Sandbox environments, etc
| dkarl wrote:
| > why code that had been dead for 8-years was still present in
| the code base is a mystery, but that's not the point
|
| It's not the worst mistake in the story, but it's not "not the
| point." A proactive approach to pruning dead functionality would
| have resulted in a less complex, better-understood piece of
| software with less potential to go haywire. Driving relentlessly
| forward without doing this kind of maintenance work is a risk,
| calculated or otherwise.
| 40yearoldman wrote:
| lol. No. Deployments were not the issue. At any given time an
| automated deployment system could have had a mistake introduced
| that resulted in bad code being sent to the system. It does not
| matter if it was old or new code. Any code could have had this
| bug.
|
| What the issue was, and it's one that I see often. Firstly no
| vision into the system. Not even a dash board showing the
| softwares running version. How often i see people ship software
| without a banner posting its version and or an endpoint that
| simply reports the version.
|
| Secondly no god damn kill switch. You are working with money!!
| Shutting down has to be an option.
| [deleted]
| 40yearoldman wrote:
| Oh god. I just realized this is a PM. A plight on software
| engineering. People who play technical, and "take the
| requirements from the customer to the engineer". What's worse
| is when they play engineer too.
| INTPenis wrote:
| I mean it makes no sense, without even reading the article,
| just by working in IT I can tell you that if you're one
| deployment away from being bankrupt then you're either doing it
| wrong, or in the wrong business.
| foota wrote:
| The real issue here (sorry for true Scotsman-ing) is that they
| were using an untested combination of configuration and binary
| release. Configuration and binaries can be rolled out in
| lockstep, preventing this class of issues.
|
| Of course there were other mistakes here etc., but the issue
| wouldn't have been possible if this weren't the case.
| nickdothutton wrote:
| "The code that that was updated repurposed an old flag..." Was as
| far as I needed to read. Never do this.
| valdiorn wrote:
| Literally everyone in quant finance knows about knight capital.
| It even has its own phrase; "pulling a knight capital" (meaning;
| cutting corners on mission critical systems, even ones that can
| bankrupt the company in an instant, and experiencing the
| consequences)
| shric wrote:
| Indeed, it's used in onboarding material at my employer.
| mxz3000 wrote:
| Yeap, it's used as a case study for us as to the worst case
| scenario in trading incidents. Definitely humbling.
| supportengineer wrote:
| They were missing any kind of risk mitigation steps, in their
| deployment practice.
| hyperhello wrote:
| There's no money for that.
| earnesti wrote:
| It is funny, but in one company I was working for, the more
| people they added the more they neglected all basics, such as
| backups. There were heavy processes for many things and they
| were followed very well, but for whatever reasons some really
| basic things went unnoticed for many years.
| Gibbon1 wrote:
| I think Goldman Sachs or someone big like that had a similar
| oopsie. And what happened was the exchange reversed all their
| bad trades.
| dang wrote:
| Related:
|
| _Knightmare: A DevOps Cautionary Tale (2014)_ -
| https://news.ycombinator.com/item?id=22250847 - Feb 2020 (33
| comments)
|
| _Knightmare: A DevOps Cautionary Tale (2014)_ -
| https://news.ycombinator.com/item?id=8994701 - Feb 2015 (85
| comments)
|
| _Knightmare: A DevOps Cautionary Tale_ -
| https://news.ycombinator.com/item?id=7652036 - April 2014 (60
| comments)
|
| Also:
|
| _The $440M software error at Knight Capital (2019)_ -
| https://news.ycombinator.com/item?id=31239033 - May 2022 (172
| comments)
|
| _Bugs in trading software cost Knight Capital $440M_ -
| https://news.ycombinator.com/item?id=4329495 - Aug 2012 (1
| comment)
|
| _Knight Capital Says Trading Glitch Cost It $440 Million_ -
| https://news.ycombinator.com/item?id=4329101 - Aug 2012 (90
| comments)
|
| Others?
| taspeotis wrote:
| Needs (2014) in the title.
| [deleted]
| realreality wrote:
| The moral of the story is: don't engage in dubious practices like
| high speed trading.
| tacker2000 wrote:
| How is high speed trading any more dubious than long term
| holding, or shorting, etc?
| realreality wrote:
| It's all dubious.
| alphanullmeric wrote:
| good to know that you don't consent to what other people do
| with their money.
| dexwiz wrote:
| They were market makers, which is different. They help so when
| you push sell on E*trade you actually get a price somewhat
| close to your order in relatively short time. No need to call
| up a broker who will route the order so a guy shouting on the
| floor.
| eddtests wrote:
| And remove easy/quick liquidity for the rest of the market?
|
| Edit: downvotes, any reason why? Or just HFT == Bad?
| [deleted]
| realreality wrote:
| "The market" shouldn't even exist.
| daft_pink wrote:
| I'm so glad I don't write code that automatically routes millions
| of dollars with no human intervention.
|
| It's like writing code that flies a jumbo jet.
|
| Who wants that kind of responsibility.
| hgomersall wrote:
| I'm so glad I'm not wasting my life working in finance.
| shric wrote:
| I've worked in various small to medium IT companies, a FAANG
| and another fortune 500 tech company. 6 months ago I moved to
| a proprietary trading company/market maker and it's the most
| interesting and satisfying place I've worked so far.
|
| I hope to continue to "waste my life" for many years to come.
| hammeiam wrote:
| May I ask which one, and what your process was that led you
| to them?
| envsubst wrote:
| I'm sure all your jobs have contributed to the well being of
| humanity.
| goldinfra wrote:
| Most jobs do in fact contribute to the well being of
| humanity, however little. It's few jobs, like most in
| financial trading, that actively reduce the well being of
| humanity.
|
| Never will you meet a more self-deluded and pathetic set of
| humans. Desperate money addicts that often become other
| kinds of addicts. Whole thing should be abolished.
|
| Source: I worked in finance when I was young and dumb.
| yieldcrv wrote:
| I, for one, am so glad to "own" the "compose" button at a
| democracy destabilizing adtech-conglomerate
| meiraleal wrote:
| > Most jobs do in fact contribute to the well being of
| humanity, however little.
|
| No, they don't. A lot of jobs hold os back, actually.
| Salespeople selling things people don't wanna buy,
| finance and tech bros vampirizing third world countries
| without the safeguards that western countries have on
| their capital markets, etc.
| m463 wrote:
| > It's like writing code that flies a jumbo jet.
|
| and upgrading it from a coach seat
| wruza wrote:
| It feels anxiety inducing at first, but if you have good
| controls and monitoring in place, it becomes daily routine. You
| basically address the points you naturally have and the more
| reasonably anxious you are, the better for the business. From
| my experience with finance, I'd wager that problem at Knight
| was 10% tech issues, 90% CTO-ish person feeling ballsy. In
| general, not exactly that day or week.
| Waterluvian wrote:
| It's not scary when it's done properly. And done properly can
| look like an incredibly tedious job. I think it's for a certain
| kind of person who loves the process and the tests and the
| simulators and the redundancy. Where only 1% of the engineering
| is the code that flies the plane.
| callalex wrote:
| It's fine to have that kind of responsibility, but it has to
| actually be your responsibility. Which means you have to be
| empowered to say "no, we aren't shipping this until XYZ is
| fixed" even if XYZ will take another two years to build and the
| boss wants to ship tomorrow.
| salawat wrote:
| Yep. Until the capacity to say unoverridably "No"
| materializes, there's a lot of code I refuse to have
| responsibility for delegated to me.
| wruza wrote:
| As a profit non-taker, which responsibility a worker can
| even have? Realistically it lies in range of their monthly
| paycheck and pending bonuses and in a moral obligation to
| operate a failing system until it lands somewhere.
| Everything above it is a systemic risk for a profit taker
| which if left unaddressed is absolutely on them. There's no
| way you can take responsibility for $400M unless you have
| that money.
| codegeek wrote:
| I refuse to believe that failed deployment can bring a company
| down. That is just a symptom. The root cause has to be a whole
| big collection of decisions and processes/systems built over
| years.
___________________________________________________________________
(page generated 2023-09-10 23:00 UTC)