[HN Gopher] The Knight Capital Disaster
___________________________________________________________________
The Knight Capital Disaster
Author : bo0tzz
Score : 142 points
Date : 2023-11-28 12:55 UTC (10 hours ago)
(HTM) web link (specbranch.com)
(TXT) w3m dump (specbranch.com)
| jjoonathan wrote:
| > the flag word was out of new bits for flags, so an engineer
| reused a bit from a deprecated flag
|
| Plus some silent update failures meant that new-feature orders
| sent to out-of-date servers transformed into old-feature orders.
| Boom!
|
| > Adding risk checks to the last stage of an order's life became
| universal in the industry
|
| I wonder what these "fast-twitch" sanity checks / circuit
| breakers look like. Whenever I try to model risk, things get
| complicated quickly -- but presumably simple heuristics must
| exist if they became universal in the industry.
| kmeisthax wrote:
| It could be as simple as...
|
| - Do we have accounting for what trading strategy generated
| this order?
|
| - Will this order immediately lose us money? (e.g. are we
| buying out-of-the-money options)
|
| - Did we accidentally set the PhysicallyDeliver flag?
|
| - Have we hit our organization's margin limits?
|
| Any situation that no reasonable trading strategy would put you
| in, or that would otherwise be outright illegal, is a good
| thing to put in risk checks for.
| twic wrote:
| What became universal in the industry is an item on the
| checklist saying that you have safety checks to prevent this.
| How seriously that item is taken, i suspect, varies quite a
| bit.
|
| In my experience, you aim for multiple extremely simple checks,
| with minimal logic and minimal calculation, so you can have
| confidence they won't have surprising behaviour in an unusual
| situation like this.
|
| The classic example is an order count limit - initialise a
| counter to some value at startup, and every time the machine
| sends an order, try to decrement it. When it hits zero, it
| can't send orders any more. Just throw an exception or return
| early or something. You display the value of the counter to
| human operators, and give them a button which resets it to the
| initial value. In normal operation, you are sending orders at a
| steady trickle, and humans will have to press the button every
| now and then. If something goes insanely wrong, as here, the
| counter will run down quickly, and then the humans hopefully
| won't push the button, because something is obviously wrong.
| It's a very crude safety, but it is a simple one.
|
| Another is a limit on message rate. You could use a token
| bucket filter. Does not affect normal operation, but stops a
| machine which is spraying out excessive orders. You could have
| it so that if the bucket runs out, it turns off until a human
| explicitly turns it back on.
|
| You have limits on net position too, to stop you running up
| huge positions in anything, but those are higher-level, and not
| quite the same kind of last-ditch safety check.
|
| I don't really know that either of these would have helped in
| Knight Capital's situation, because the precise mechanics of
| the "power peg" aren't clear. It sounds like a kind of
| explicitly-managed iceberg order, which these safeties would
| have caught. But another writeup [1] says it was a testing
| tool, not intended to be used on a real exchange at all, in
| which case who knows.
|
| [1] https://www.henricodolfing.com/2019/06/project-failure-
| case-...
| pclmulqdq wrote:
| (Author) As far as I know, Power Peg was indeed intended to
| essentially be a manual iceberg order from the time before
| that was an order type on the exchange (with slightly
| different semantics).
|
| Rereading the source you quoted, it definitely wasn't a "buy
| high sell low" system, even if it was never used in prod.
| pgwhalen wrote:
| > I wonder what these "fast-twitch" sanity checks / circuit
| breakers look like. Whenever I try to model risk, things get
| complicated quickly -- but presumably simple heuristics must
| exist if they became universal in the industry.
|
| You're right, they are very simple. Think things like orders
| per second, quantity of order, price of order, notional ordered
| over time, etc. You basically want to ensure things aren't "too
| big" or "too fast" as simply as possible.
|
| Other types of risk (portfolio risk, greek risk, etc.) are
| handled in different ways, upstream of these final checks.
| reese_john wrote:
| Related discussions:
|
| Knightmare: A DevOps Cautionary Tale
|
| https://news.ycombinator.com/item?id=37459495
|
| The $440M software error at Knight Capital (2019)
|
| https://news.ycombinator.com/item?id=31239033
| neilv wrote:
| Interesting analysis. You can easily imagine each bit of
| sloppiness and oops happening in many shops. Even many instances
| of sloppiness combining for emergent worse effect isn't unusual.
| Much less common is for the company to be wiped out by it.
|
| > _At 10:15, the kill switch was flipped, stopping the company's
| trading operations for the day. By early afternoon, many of
| Knight Capital's employees had already sent out resumes,_
|
| Was the patient obviously dead in those first few hours, or had
| people written it off prematurely when they might've been called
| to help perform CPR?
|
| Maybe there's two problems here: insufficient culture of
| diligence, and insufficient culture of loyalty.
| randmeerkat wrote:
| > insufficient culture of loyalty
|
| Loyalty in tech..? Have you not seen the layoffs? Employees are
| just a line item in someone's cost center.
| tekla wrote:
| So many developers are fungible, it makes sense for everyone
| involved.
| golergka wrote:
| Quants and HFT developers are one of the best compensated
| people on Earth. I don't think that someone who's being paid
| over a million dollars a year to sit in a cubicle and write
| code should use the same rhetoric as 19th century factory
| workers.
| willdr wrote:
| Then you don't understand the relationship between capital,
| labour and being an employee vs employer. Yes, they are
| well-compensated, but they can still have their lives
| turned upside down by bosses. Granted, their life getting
| turned upside down probably means less front-row Knicks
| tickets and not choosing between paying a water bill and a
| power bill.
| nly wrote:
| Best paid on Earth in the US. In other countries we're paid
| well but about average by ordinary Valley tech standards.
| mhh__ wrote:
| A lot of quants are objectively paid very well but not
| _that_ well (at least pre layoffs) compared to some people
| who do comparatively little work in tech.
| golergka wrote:
| Good. Compensation should never be about the amount of
| work you do. It should be about your impact and market
| value.
| neilv wrote:
| I'm not necessarily blaming labor.
|
| For example: a company that needs loyalty, but then does
| things like the layoffs we've seen recently (or earlier
| behavior consistent with that thinking), is creating
| insufficient culture of loyalty.
| pclmulqdq wrote:
| I am the author, and everyone I knew who was at Knight at the
| time had no power to fix anything. After all the trades
| happened, the quants and tech people couldn't really do
| anything to unwind them. That was up to the "business" side of
| the org, who still had no power because they were essentially
| left looking for a deal to rescue a bankrupt company. Those
| people actually knew that this was a company-destroying event,
| and apparently it was a late night for them at work and they
| were also quietly talking to recruiters.
|
| At least one manager actually gave his whole team the afternoon
| off while senior management worked this out, anticipating that
| there would be no Knight by the end of the week.
| fl7305 wrote:
| In the movie WarGames, someone suggested "just turn the damn
| power off", and got a good answer on why that was "a bad
| idea"(tm).
|
| I'm kinda wondering the same thing here, and can't think of
| the reason why not?
| yellowstuff wrote:
| That's what always struck me about this story. I like to
| think that if I was in the room we would've turned the
| system off at 9:31 when we knew it was working abnormally
| but didn't know why. Instead they let it run for over an
| hour while they tried to QA it. I've heard that Knight had
| no kill switch to stop all systems from trading, which
| seems like by far the biggest mistake they made.
| pclmulqdq wrote:
| As far as I know, Knight did not have a "clean" kill
| switch, but they did have the ability to shut everything
| down. If they killed trading, that meant killing some
| processes that would screw up their accounting, which
| would mean losing the rest of the trading day.
|
| Firms I worked at after the fact tried to make sure the
| kill switch was non-disruptive, and actually did push it
| once or twice.
| KMag wrote:
| The power switches were in locked cages inside colos miles
| from the nearest Knight employee. The sysadmins could
| probably power off the boxes remotely, but the devs
| probably didn't have that access.
|
| Yes, a "big red button" kill switch was the right answer,
| but they didn't have that, and it took time to work through
| several layers of corporate bureaucracy while losing
| $150,000 per second, time they didn't have.
|
| Anyone with experience in the industry knows that,
| especially the day of/following a software release, "if in
| doubt, kill and roll back". Being out of the market for
| even a few hours is fairly easily to come back from, even
| in places where you are legally required to be in the
| market for a given percentage of the time in order to
| qualify for transaction tax breaks on your market making
| trades.
|
| That's why on that day my manger called me after hours to
| come in early the next morning and test our "big red
| button" kill switch before the market open.
| JumpCrisscross wrote:
| > _Being out of the market for even a few hours is fairly
| easily to come back from_
|
| Not if you're dynamically hedging an options book.
| KMag wrote:
| Fair enough. At Goldman, I don't recall our options and
| equities trading systems simultaneously having outages,
| so I think we could always shed risk by reducing our
| options exposure if the auto-hedger was unable to delta-
| hedge in the equities market. I'm not actually sure if
| the relative independence of the options and equity
| execution systems was intentional.
|
| I did some work with connecting the options auto-hedger
| to the equity execution system, and certainly failures on
| the delta-one side prevented increasing exposure on the
| options side. "How long can we be out of the equities
| market and still be certain of meeting our options
| market-making obligations with the Hong Kong Exchange?"
| did come up a couple of times.
|
| Depending on exactly where the outage was, there was
| potentially also the option of manually hedging the
| options book like the bad-old days. (Execution engines
| failed, but order management system and exchange
| connectivity still intact would be one such scenario.)
| fl7305 wrote:
| Thanks, good answer.
|
| I've been in other types of very stressful situations.
| Simple things like getting hold of the right person or
| the right tool can be astonishingly time consuming in a
| tight spot.
| Kranar wrote:
| Loyalty had nothing to do with this. The company lost $400
| million dollars in about 45 minutes. That's 9 million dollars a
| minute or 150 thousand dollars a second.
|
| I work as a quant and remember that morning vividly. It was as
| if trucks of free money were falling from the sky to the point
| that many of us were skeptical that these were actual trades;
| we were convinced that this was some kind of bug at the
| exchange and these trades would get broken.
|
| Mistakes like this do happen, in fact it's not thaaaat
| uncommon. What made this situation so unusual is that it just
| kept going and going.
| mhh__ wrote:
| > Maybe there's two problems here: insufficient culture of
| diligence
|
| I know nothing about Knight but I will say that traditional
| finance often has no ability to think about systems
| particularly well so the local optimum is usually a rickety
| shack that has a lot of smart people working on it, and some
| less smart people making sure problems are being _seen_ to be
| fixed, but the overall architecture can be extremely brittle or
| non-existent. It 's not like (good) tech companies.
| kunwon1 wrote:
| This bit at the bottom was gratifying to read: As
| of 2016, the engineer who did the update still worked at KCG. His
| entire management chain had been replaced, all the way up to the
| CTO.
| blantonl wrote:
| The amount of experience this engineer has is invaluable.
| jameshart wrote:
| It cost $440m. Unclear if it was valued as much.
| brazzledazzle wrote:
| That's a refreshing reversal from what you usually hear about.
| You can bet this engineer will never do anything like that
| again and will likely lead the way toward implementing
| comprehensive and effective safety mechanisms.
| paxys wrote:
| That sentence isn't as impactful as it sounds. There aren't
| very many engineers out there whose reporting chain remains
| intact over the better part of a decade, disaster or no
| disaster.
| pclmulqdq wrote:
| Author here, sorry - it should read that his management chain
| was replaced (resigned or fired) within the week.
| CobrastanJorji wrote:
| I see the "nanoseconds counted so they didn't use protobufs"
| note, but in case you do you protobufs and want to make sure this
| never happens to you, I heartily recommend using the "reserved"
| keyword in your protos whenever you remove a field. Reserving a
| number is a note to the proto compiler that says "I will not use
| this number again, and please generate an error if I foolishly
| later try."
|
| It's a useful and probably under-used feature.
| scottshamus wrote:
| If anyone reading this hasn't heard this before, I encourage
| you to read the do's/dont's of protobufs for some other good
| best practices.
|
| https://protobuf.dev/programming-guides/dos-donts/
| GauntletWizard wrote:
| There are several protobuf linters out there, but what I
| haven't seen is a protobuf linter that integrates with Git to
| look behind and verify that you haven't accidentally changed
| protobuf numbers, or reused fields, or etc. Would be handy.
| orbz wrote:
| I believe buf's breaking change detection will pick up on
| this. It also allows you to specify a git ref to target.
|
| https://buf.build/docs/reference/cli/buf/breaking#against
| blowski wrote:
| HN conversation from the time.
|
| https://news.ycombinator.com/item?id=4329101
| _boffin_ wrote:
| Posted in a prior thread, but here's a trade-by-trade recap
|
| Original Thread: https://news.ycombinator.com/item?id=37459495
| ---
|
| If you want to see how it looked like from the tick scale, take a
| look here: http://www.nanex.net/aqck2/3522.html Ps. Anyone know
| of any other sites / places that does comparable level of
| research that's open to the public?
| upbeat_general wrote:
| In my opinion the root cause is pretty clear: they had a network
| protocol update that was not backwards compatible and didn't
| verify the runtime versions or have versioning.
|
| Everything else isn't really core to the issue [even that the
| update script silently failed].
| lmm wrote:
| Mistakes happen. Bad processes and bad governance are what turn
| mistakes into disasters.
| rwmj wrote:
| I suspect the main effect of "modern practices" is to make it
| much easier to make mistakes at scale. Maybe not the exact same
| mistake as here, but some other one.
|
| By the way, have we heard from Google about how they managed to
| roll back 6 months of customer data yesterday?
| ShakataGaNai wrote:
| They say, in Aircraft, safety regulations are written in blood.
| While no one died from this event, its clear the the metaphorical
| corporate blood spilled probably did wonders to help a lot of
| other groups.
| MichaelRo wrote:
| Reminds me of this blunder: trader accidentally switches places
| for price and quantity and instead of selling 1 contract at
| Y=610,000 manages to send an order to sell 610,000 contracts at
| Y=1. Order passes through the GUI, limits checker and several
| dozen systems like knife through cheese and is placed on the
| exchange. Exchange happily accepts the order and mayhem is
| ensured.
|
| If all were executed that would be more than $3B (billions!)
| loss, heck almost 4 billions. Eventually the company settles for
| about $300M (millions).
|
| So Knight Capital isn't alone in this "hall of fame" :)
|
| Here's the story: https://www.cbsnews.com/news/stock-trade-typo-
| costs-firm-225...
| artursapek wrote:
| I implemented a slippage warning system in a trading GUI I was
| in charge of after exactly this scenario happened once: a
| trader switching price and quantity and temporarily cratering a
| market. It would show a second order confirmation screen if
| your order was going to fill with high slippage, and it made
| you type the words "SHOOT ME" into a text field to send the
| order. After we had built it, it seemed so obvious to have that
| kind of sanity checking.
|
| It makes even more sense for the matching engine to disallow
| this on the back end, though.
| yellowstuff wrote:
| I believe that Mizuho had a similar check in place. This
| stock was an IPO so there was no last price to compare
| against.
| artursapek wrote:
| It shouldn't require price history. You just need the order
| book and you can simulate the execution of any order, and
| figure out its average price.
|
| If the average price is X basis points worse than the
| current top of book, that's slippage. So eg if highest bid
| in book is $100 and you are entering a sell order that
| would eat so much of the book that it would fill with an
| average price of $70, that's 30% slippage and probably not
| what you meant to do.
| gpderetta wrote:
| You need to have a book in the first place though. If the
| instrument is highly illiquid the spread might be huge
| and the price have little real world relevance.
|
| For liquid books, yes definitely I would expect these
| sort of checks (tipically against historical prices) to
| be in place.
| kstrauser wrote:
| On the opposite end, sometimes the traders don't _want_ that
| kind of reminder: https://thedailywtf.com/articles/special-
| delivery
|
| Relevant quote:
|
| > As the senior trader at AExecor, Brad made it very clear
| that no one -- "not even His Holiness, the Pope" -- shall
| question his trades. After all, Brad makes complex trading
| decisions that no one else could possibly comprehend.
| gosub100 wrote:
| I worked on a compliance add-on that blocked institutional
| traders based on rules they set for themselves. All day long
| was spent fielding their urgent support requests complaining
| that the rule was wrong. Massive amounts of time were spent
| finding the market data or computing their account value _at
| the specific time_ the trade was blocked, 90% of the time
| arriving at the conclusion that the product worked as
| intended. I complained that I wasn 't getting to code enough,
| I was told to "code on the train [while riding to work]".
|
| One time, a client wanted a rule to block a trade if the
| price exceeded the daily high/low. Well, when you define the
| limit this way, you pretty much can't trade right at the
| opening because many/most price ticks _are_ the highest
| /lowest of the day SO FAR, because the day is only a few
| seconds old! Customer had a meltdown, we traced the market
| data back, and realized yeah, it worked exactly as designed.
| sigh
| dang wrote:
| Related. Others?
|
| _Counterfactual Thinking, Rules, and the Knight Capital Accident
| (2013)_ - https://news.ycombinator.com/item?id=37472422 - Sept
| 2023 (1 comment)
|
| _Knightmare: A DevOps Cautionary Tale (2014)_ -
| https://news.ycombinator.com/item?id=37459495 - Sept 2023 (275
| comments)
|
| _The $440M software error at Knight Capital (2019)_ -
| https://news.ycombinator.com/item?id=31239033 - May 2022 (172
| comments)
|
| _Knightmare: A DevOps Cautionary Tale (2014)_ -
| https://news.ycombinator.com/item?id=22250847 - Feb 2020 (33
| comments)
|
| _Knightmare: A DevOps Cautionary Tale (2014)_ -
| https://news.ycombinator.com/item?id=8994701 - Feb 2015 (85
| comments)
|
| _Knightmare: A DevOps Cautionary Tale_ -
| https://news.ycombinator.com/item?id=7652036 - April 2014 (60
| comments)
|
| _Knight Capital Says Trading Glitch Cost It $440 Million_ -
| https://news.ycombinator.com/item?id=4329101 - Aug 2012 (91
| comments)
| fintechie wrote:
| This story now looks like a minor incident if you compare it to
| FTX.
| KMag wrote:
| I was working on trading systems at Goldman in NYC at the time.
| After hours on that day, I got a call from my manager to come in
| early the next day to ensure our kill switches worked properly,
| that our release and review processes were sufficient, and that
| our monitoring systems were sufficient.
|
| A few years later, I was working on trading systems at Goldman in
| Hong Kong. I sent a change out for review, went out for dinner
| and drinks with a colleague visiting from Tokyo, and swung by the
| office on my way home. My change had been approved by my NYC
| colleagues, so I merged it and went to bed. The next morning, I
| woke up to news that Goldman had a 100 million USD trading lost
| due to a software bug.
|
| Edit: This was in Goldman's Slang language, where source code is
| loaded from a globally-distributed eventually-consistent NoSQL
| DB. Most applications execute from read-only DB snapshots after
| extensive release testing. However, as soon as you merge your
| change, it's potentially instantly running in production
| somewhere in the world by some team you might not even know
| exists. It was possible my merge, maybe 45 minutes before the NYC
| market open, had gotten picked up by the errant system.
|
| I spent a while convincing myself that there was no way my change
| was the cause, and realized my phone would be ringing off the
| hook had my change been the cause.
|
| The guy who made the software change (let's call him Zaphod
| Beeblebrox since that's clearly not his name), and the guy who
| approved it, were both put on leave before I woke up. I found out
| who made the change only because I had an open chat window with
| Zaphod, and through several rounds of "fifth quartile" annual
| layoffs, knew how the chat and email systems responded when
| accounts got locked out. The chat system showed Zaphod's location
| as unknown, and a test email to him came back with the "mailbox
| full" message for a locked account. I walked over to the desk of
| one of the senior Equity Options Flow Strats in Hong Kong, and
| whispered "So... Zaphod Beeblebrox", and the Strat's face lit up
| and he whispered back "How did you know?", to which I responded
| "I didn't until I saw your reaction".
|
| The guy who mode the 100 million mistake was actually very very
| good at his job. He caught quite a few subtle bugs in other
| people's code that he wasn't even asked to review, but was
| reviewing out of curiosity. But, he was working late under time
| pressure, didn't test his change properly, and you only have to
| slip up once.
|
| As I remember, many of the trades were broken by the exchange,
| and the total loss came out to about 28 million USD.
|
| On the one hand, the guy didn't deserve to get fired, and I'd
| totally hire him for my team. On the other hand, if someone cuts
| corners and that results in tens of millions of USD in losses and
| doesn't get fired, that's very demotivating for everyone else at
| the firm. They did a very good job about not naming and shaming.
|
| After waking up to being momentarily scared I had made a 100
| million USD error, I don't merge changes after-hours any more,
| and certainly never after having consumed any alcohol. If a guy
| like Zaphod can lose 28 million USD from a tired merge, so can I.
|
| Zaphod, if you're reading this and ever looking for a job, give
| me a ring.
| yellowstuff wrote:
| Great story. The part that sticks out to me is 72% (the
| discount that GS got due to busted trades) and 0% (the discount
| Knight got due to busted trades.) If you're going to eff up,
| first make sure you're a big player!
| KMag wrote:
| In Goldman's case, Goldman was literally sending out options
| orders with as ask price of $0. I don't recall if it was the
| exchange or regulators that decided "If you bought below $x,
| you knew you were trading against a broken algorithm and
| shouldn't have expected the trade to last".
|
| I'm not sure if any of the orders Knight was sending out were
| clearly so erroneous. It's also possible that only after the
| Knight incident is when it was made clear to market
| participants that they should expect clearly erroneous trades
| to be broken.
|
| In any case, Knight was a major liquidity provider, and it
| wasn't in the market's best interest for them to go bankrupt,
| but it also sets a very bad precedent if plausible orders get
| broken.
| pclmulqdq wrote:
| As far as I know, Knight was unique in that its orders were
| obviously stupid, but not obviously mispriced or mis-sized.
| It's not a case of a clear fat-finger error that would be
| visible to other market participants.
|
| I know that in many markets, the exchange will reverse your
| trades if the counterparty made an obvious, visible error
| (eg $0 ask price on a limit order).
| esotericimpl wrote:
| The best part of these stories is that nothing of value was ever
| created or lost.
|
| Just a shell game in a casino, our previous generations built
| rockets with their best and brightest, now we build Ad Exchanges
| and High frequency trading bots.
___________________________________________________________________
(page generated 2023-11-28 23:00 UTC)