[HN Gopher] The Knight Capital Disaster
       ___________________________________________________________________
        
       The Knight Capital Disaster
        
       Author : bo0tzz
       Score  : 142 points
       Date   : 2023-11-28 12:55 UTC (10 hours ago)
        
 (HTM) web link (specbranch.com)
 (TXT) w3m dump (specbranch.com)
        
       | jjoonathan wrote:
       | > the flag word was out of new bits for flags, so an engineer
       | reused a bit from a deprecated flag
       | 
       | Plus some silent update failures meant that new-feature orders
       | sent to out-of-date servers transformed into old-feature orders.
       | Boom!
       | 
       | > Adding risk checks to the last stage of an order's life became
       | universal in the industry
       | 
       | I wonder what these "fast-twitch" sanity checks / circuit
       | breakers look like. Whenever I try to model risk, things get
       | complicated quickly -- but presumably simple heuristics must
       | exist if they became universal in the industry.
        
         | kmeisthax wrote:
         | It could be as simple as...
         | 
         | - Do we have accounting for what trading strategy generated
         | this order?
         | 
         | - Will this order immediately lose us money? (e.g. are we
         | buying out-of-the-money options)
         | 
         | - Did we accidentally set the PhysicallyDeliver flag?
         | 
         | - Have we hit our organization's margin limits?
         | 
         | Any situation that no reasonable trading strategy would put you
         | in, or that would otherwise be outright illegal, is a good
         | thing to put in risk checks for.
        
         | twic wrote:
         | What became universal in the industry is an item on the
         | checklist saying that you have safety checks to prevent this.
         | How seriously that item is taken, i suspect, varies quite a
         | bit.
         | 
         | In my experience, you aim for multiple extremely simple checks,
         | with minimal logic and minimal calculation, so you can have
         | confidence they won't have surprising behaviour in an unusual
         | situation like this.
         | 
         | The classic example is an order count limit - initialise a
         | counter to some value at startup, and every time the machine
         | sends an order, try to decrement it. When it hits zero, it
         | can't send orders any more. Just throw an exception or return
         | early or something. You display the value of the counter to
         | human operators, and give them a button which resets it to the
         | initial value. In normal operation, you are sending orders at a
         | steady trickle, and humans will have to press the button every
         | now and then. If something goes insanely wrong, as here, the
         | counter will run down quickly, and then the humans hopefully
         | won't push the button, because something is obviously wrong.
         | It's a very crude safety, but it is a simple one.
         | 
         | Another is a limit on message rate. You could use a token
         | bucket filter. Does not affect normal operation, but stops a
         | machine which is spraying out excessive orders. You could have
         | it so that if the bucket runs out, it turns off until a human
         | explicitly turns it back on.
         | 
         | You have limits on net position too, to stop you running up
         | huge positions in anything, but those are higher-level, and not
         | quite the same kind of last-ditch safety check.
         | 
         | I don't really know that either of these would have helped in
         | Knight Capital's situation, because the precise mechanics of
         | the "power peg" aren't clear. It sounds like a kind of
         | explicitly-managed iceberg order, which these safeties would
         | have caught. But another writeup [1] says it was a testing
         | tool, not intended to be used on a real exchange at all, in
         | which case who knows.
         | 
         | [1] https://www.henricodolfing.com/2019/06/project-failure-
         | case-...
        
           | pclmulqdq wrote:
           | (Author) As far as I know, Power Peg was indeed intended to
           | essentially be a manual iceberg order from the time before
           | that was an order type on the exchange (with slightly
           | different semantics).
           | 
           | Rereading the source you quoted, it definitely wasn't a "buy
           | high sell low" system, even if it was never used in prod.
        
         | pgwhalen wrote:
         | > I wonder what these "fast-twitch" sanity checks / circuit
         | breakers look like. Whenever I try to model risk, things get
         | complicated quickly -- but presumably simple heuristics must
         | exist if they became universal in the industry.
         | 
         | You're right, they are very simple. Think things like orders
         | per second, quantity of order, price of order, notional ordered
         | over time, etc. You basically want to ensure things aren't "too
         | big" or "too fast" as simply as possible.
         | 
         | Other types of risk (portfolio risk, greek risk, etc.) are
         | handled in different ways, upstream of these final checks.
        
       | reese_john wrote:
       | Related discussions:
       | 
       | Knightmare: A DevOps Cautionary Tale
       | 
       | https://news.ycombinator.com/item?id=37459495
       | 
       | The $440M software error at Knight Capital (2019)
       | 
       | https://news.ycombinator.com/item?id=31239033
        
       | neilv wrote:
       | Interesting analysis. You can easily imagine each bit of
       | sloppiness and oops happening in many shops. Even many instances
       | of sloppiness combining for emergent worse effect isn't unusual.
       | Much less common is for the company to be wiped out by it.
       | 
       | > _At 10:15, the kill switch was flipped, stopping the company's
       | trading operations for the day. By early afternoon, many of
       | Knight Capital's employees had already sent out resumes,_
       | 
       | Was the patient obviously dead in those first few hours, or had
       | people written it off prematurely when they might've been called
       | to help perform CPR?
       | 
       | Maybe there's two problems here: insufficient culture of
       | diligence, and insufficient culture of loyalty.
        
         | randmeerkat wrote:
         | > insufficient culture of loyalty
         | 
         | Loyalty in tech..? Have you not seen the layoffs? Employees are
         | just a line item in someone's cost center.
        
           | tekla wrote:
           | So many developers are fungible, it makes sense for everyone
           | involved.
        
           | golergka wrote:
           | Quants and HFT developers are one of the best compensated
           | people on Earth. I don't think that someone who's being paid
           | over a million dollars a year to sit in a cubicle and write
           | code should use the same rhetoric as 19th century factory
           | workers.
        
             | willdr wrote:
             | Then you don't understand the relationship between capital,
             | labour and being an employee vs employer. Yes, they are
             | well-compensated, but they can still have their lives
             | turned upside down by bosses. Granted, their life getting
             | turned upside down probably means less front-row Knicks
             | tickets and not choosing between paying a water bill and a
             | power bill.
        
             | nly wrote:
             | Best paid on Earth in the US. In other countries we're paid
             | well but about average by ordinary Valley tech standards.
        
             | mhh__ wrote:
             | A lot of quants are objectively paid very well but not
             | _that_ well (at least pre layoffs) compared to some people
             | who do comparatively little work in tech.
        
               | golergka wrote:
               | Good. Compensation should never be about the amount of
               | work you do. It should be about your impact and market
               | value.
        
           | neilv wrote:
           | I'm not necessarily blaming labor.
           | 
           | For example: a company that needs loyalty, but then does
           | things like the layoffs we've seen recently (or earlier
           | behavior consistent with that thinking), is creating
           | insufficient culture of loyalty.
        
         | pclmulqdq wrote:
         | I am the author, and everyone I knew who was at Knight at the
         | time had no power to fix anything. After all the trades
         | happened, the quants and tech people couldn't really do
         | anything to unwind them. That was up to the "business" side of
         | the org, who still had no power because they were essentially
         | left looking for a deal to rescue a bankrupt company. Those
         | people actually knew that this was a company-destroying event,
         | and apparently it was a late night for them at work and they
         | were also quietly talking to recruiters.
         | 
         | At least one manager actually gave his whole team the afternoon
         | off while senior management worked this out, anticipating that
         | there would be no Knight by the end of the week.
        
           | fl7305 wrote:
           | In the movie WarGames, someone suggested "just turn the damn
           | power off", and got a good answer on why that was "a bad
           | idea"(tm).
           | 
           | I'm kinda wondering the same thing here, and can't think of
           | the reason why not?
        
             | yellowstuff wrote:
             | That's what always struck me about this story. I like to
             | think that if I was in the room we would've turned the
             | system off at 9:31 when we knew it was working abnormally
             | but didn't know why. Instead they let it run for over an
             | hour while they tried to QA it. I've heard that Knight had
             | no kill switch to stop all systems from trading, which
             | seems like by far the biggest mistake they made.
        
               | pclmulqdq wrote:
               | As far as I know, Knight did not have a "clean" kill
               | switch, but they did have the ability to shut everything
               | down. If they killed trading, that meant killing some
               | processes that would screw up their accounting, which
               | would mean losing the rest of the trading day.
               | 
               | Firms I worked at after the fact tried to make sure the
               | kill switch was non-disruptive, and actually did push it
               | once or twice.
        
             | KMag wrote:
             | The power switches were in locked cages inside colos miles
             | from the nearest Knight employee. The sysadmins could
             | probably power off the boxes remotely, but the devs
             | probably didn't have that access.
             | 
             | Yes, a "big red button" kill switch was the right answer,
             | but they didn't have that, and it took time to work through
             | several layers of corporate bureaucracy while losing
             | $150,000 per second, time they didn't have.
             | 
             | Anyone with experience in the industry knows that,
             | especially the day of/following a software release, "if in
             | doubt, kill and roll back". Being out of the market for
             | even a few hours is fairly easily to come back from, even
             | in places where you are legally required to be in the
             | market for a given percentage of the time in order to
             | qualify for transaction tax breaks on your market making
             | trades.
             | 
             | That's why on that day my manger called me after hours to
             | come in early the next morning and test our "big red
             | button" kill switch before the market open.
        
               | JumpCrisscross wrote:
               | > _Being out of the market for even a few hours is fairly
               | easily to come back from_
               | 
               | Not if you're dynamically hedging an options book.
        
               | KMag wrote:
               | Fair enough. At Goldman, I don't recall our options and
               | equities trading systems simultaneously having outages,
               | so I think we could always shed risk by reducing our
               | options exposure if the auto-hedger was unable to delta-
               | hedge in the equities market. I'm not actually sure if
               | the relative independence of the options and equity
               | execution systems was intentional.
               | 
               | I did some work with connecting the options auto-hedger
               | to the equity execution system, and certainly failures on
               | the delta-one side prevented increasing exposure on the
               | options side. "How long can we be out of the equities
               | market and still be certain of meeting our options
               | market-making obligations with the Hong Kong Exchange?"
               | did come up a couple of times.
               | 
               | Depending on exactly where the outage was, there was
               | potentially also the option of manually hedging the
               | options book like the bad-old days. (Execution engines
               | failed, but order management system and exchange
               | connectivity still intact would be one such scenario.)
        
               | fl7305 wrote:
               | Thanks, good answer.
               | 
               | I've been in other types of very stressful situations.
               | Simple things like getting hold of the right person or
               | the right tool can be astonishingly time consuming in a
               | tight spot.
        
         | Kranar wrote:
         | Loyalty had nothing to do with this. The company lost $400
         | million dollars in about 45 minutes. That's 9 million dollars a
         | minute or 150 thousand dollars a second.
         | 
         | I work as a quant and remember that morning vividly. It was as
         | if trucks of free money were falling from the sky to the point
         | that many of us were skeptical that these were actual trades;
         | we were convinced that this was some kind of bug at the
         | exchange and these trades would get broken.
         | 
         | Mistakes like this do happen, in fact it's not thaaaat
         | uncommon. What made this situation so unusual is that it just
         | kept going and going.
        
         | mhh__ wrote:
         | > Maybe there's two problems here: insufficient culture of
         | diligence
         | 
         | I know nothing about Knight but I will say that traditional
         | finance often has no ability to think about systems
         | particularly well so the local optimum is usually a rickety
         | shack that has a lot of smart people working on it, and some
         | less smart people making sure problems are being _seen_ to be
         | fixed, but the overall architecture can be extremely brittle or
         | non-existent. It 's not like (good) tech companies.
        
       | kunwon1 wrote:
       | This bit at the bottom was gratifying to read:                 As
       | of 2016, the engineer who did the update still worked at KCG. His
       | entire management chain had been replaced, all the way up to the
       | CTO.
        
         | blantonl wrote:
         | The amount of experience this engineer has is invaluable.
        
           | jameshart wrote:
           | It cost $440m. Unclear if it was valued as much.
        
         | brazzledazzle wrote:
         | That's a refreshing reversal from what you usually hear about.
         | You can bet this engineer will never do anything like that
         | again and will likely lead the way toward implementing
         | comprehensive and effective safety mechanisms.
        
         | paxys wrote:
         | That sentence isn't as impactful as it sounds. There aren't
         | very many engineers out there whose reporting chain remains
         | intact over the better part of a decade, disaster or no
         | disaster.
        
           | pclmulqdq wrote:
           | Author here, sorry - it should read that his management chain
           | was replaced (resigned or fired) within the week.
        
       | CobrastanJorji wrote:
       | I see the "nanoseconds counted so they didn't use protobufs"
       | note, but in case you do you protobufs and want to make sure this
       | never happens to you, I heartily recommend using the "reserved"
       | keyword in your protos whenever you remove a field. Reserving a
       | number is a note to the proto compiler that says "I will not use
       | this number again, and please generate an error if I foolishly
       | later try."
       | 
       | It's a useful and probably under-used feature.
        
         | scottshamus wrote:
         | If anyone reading this hasn't heard this before, I encourage
         | you to read the do's/dont's of protobufs for some other good
         | best practices.
         | 
         | https://protobuf.dev/programming-guides/dos-donts/
        
           | GauntletWizard wrote:
           | There are several protobuf linters out there, but what I
           | haven't seen is a protobuf linter that integrates with Git to
           | look behind and verify that you haven't accidentally changed
           | protobuf numbers, or reused fields, or etc. Would be handy.
        
             | orbz wrote:
             | I believe buf's breaking change detection will pick up on
             | this. It also allows you to specify a git ref to target.
             | 
             | https://buf.build/docs/reference/cli/buf/breaking#against
        
       | blowski wrote:
       | HN conversation from the time.
       | 
       | https://news.ycombinator.com/item?id=4329101
        
       | _boffin_ wrote:
       | Posted in a prior thread, but here's a trade-by-trade recap
       | 
       | Original Thread: https://news.ycombinator.com/item?id=37459495
       | ---
       | 
       | If you want to see how it looked like from the tick scale, take a
       | look here: http://www.nanex.net/aqck2/3522.html Ps. Anyone know
       | of any other sites / places that does comparable level of
       | research that's open to the public?
        
       | upbeat_general wrote:
       | In my opinion the root cause is pretty clear: they had a network
       | protocol update that was not backwards compatible and didn't
       | verify the runtime versions or have versioning.
       | 
       | Everything else isn't really core to the issue [even that the
       | update script silently failed].
        
         | lmm wrote:
         | Mistakes happen. Bad processes and bad governance are what turn
         | mistakes into disasters.
        
       | rwmj wrote:
       | I suspect the main effect of "modern practices" is to make it
       | much easier to make mistakes at scale. Maybe not the exact same
       | mistake as here, but some other one.
       | 
       | By the way, have we heard from Google about how they managed to
       | roll back 6 months of customer data yesterday?
        
       | ShakataGaNai wrote:
       | They say, in Aircraft, safety regulations are written in blood.
       | While no one died from this event, its clear the the metaphorical
       | corporate blood spilled probably did wonders to help a lot of
       | other groups.
        
       | MichaelRo wrote:
       | Reminds me of this blunder: trader accidentally switches places
       | for price and quantity and instead of selling 1 contract at
       | Y=610,000 manages to send an order to sell 610,000 contracts at
       | Y=1. Order passes through the GUI, limits checker and several
       | dozen systems like knife through cheese and is placed on the
       | exchange. Exchange happily accepts the order and mayhem is
       | ensured.
       | 
       | If all were executed that would be more than $3B (billions!)
       | loss, heck almost 4 billions. Eventually the company settles for
       | about $300M (millions).
       | 
       | So Knight Capital isn't alone in this "hall of fame" :)
       | 
       | Here's the story: https://www.cbsnews.com/news/stock-trade-typo-
       | costs-firm-225...
        
         | artursapek wrote:
         | I implemented a slippage warning system in a trading GUI I was
         | in charge of after exactly this scenario happened once: a
         | trader switching price and quantity and temporarily cratering a
         | market. It would show a second order confirmation screen if
         | your order was going to fill with high slippage, and it made
         | you type the words "SHOOT ME" into a text field to send the
         | order. After we had built it, it seemed so obvious to have that
         | kind of sanity checking.
         | 
         | It makes even more sense for the matching engine to disallow
         | this on the back end, though.
        
           | yellowstuff wrote:
           | I believe that Mizuho had a similar check in place. This
           | stock was an IPO so there was no last price to compare
           | against.
        
             | artursapek wrote:
             | It shouldn't require price history. You just need the order
             | book and you can simulate the execution of any order, and
             | figure out its average price.
             | 
             | If the average price is X basis points worse than the
             | current top of book, that's slippage. So eg if highest bid
             | in book is $100 and you are entering a sell order that
             | would eat so much of the book that it would fill with an
             | average price of $70, that's 30% slippage and probably not
             | what you meant to do.
        
               | gpderetta wrote:
               | You need to have a book in the first place though. If the
               | instrument is highly illiquid the spread might be huge
               | and the price have little real world relevance.
               | 
               | For liquid books, yes definitely I would expect these
               | sort of checks (tipically against historical prices) to
               | be in place.
        
           | kstrauser wrote:
           | On the opposite end, sometimes the traders don't _want_ that
           | kind of reminder: https://thedailywtf.com/articles/special-
           | delivery
           | 
           | Relevant quote:
           | 
           | > As the senior trader at AExecor, Brad made it very clear
           | that no one -- "not even His Holiness, the Pope" -- shall
           | question his trades. After all, Brad makes complex trading
           | decisions that no one else could possibly comprehend.
        
           | gosub100 wrote:
           | I worked on a compliance add-on that blocked institutional
           | traders based on rules they set for themselves. All day long
           | was spent fielding their urgent support requests complaining
           | that the rule was wrong. Massive amounts of time were spent
           | finding the market data or computing their account value _at
           | the specific time_ the trade was blocked, 90% of the time
           | arriving at the conclusion that the product worked as
           | intended. I complained that I wasn 't getting to code enough,
           | I was told to "code on the train [while riding to work]".
           | 
           | One time, a client wanted a rule to block a trade if the
           | price exceeded the daily high/low. Well, when you define the
           | limit this way, you pretty much can't trade right at the
           | opening because many/most price ticks _are_ the highest
           | /lowest of the day SO FAR, because the day is only a few
           | seconds old! Customer had a meltdown, we traced the market
           | data back, and realized yeah, it worked exactly as designed.
           | sigh
        
       | dang wrote:
       | Related. Others?
       | 
       |  _Counterfactual Thinking, Rules, and the Knight Capital Accident
       | (2013)_ - https://news.ycombinator.com/item?id=37472422 - Sept
       | 2023 (1 comment)
       | 
       |  _Knightmare: A DevOps Cautionary Tale (2014)_ -
       | https://news.ycombinator.com/item?id=37459495 - Sept 2023 (275
       | comments)
       | 
       |  _The $440M software error at Knight Capital (2019)_ -
       | https://news.ycombinator.com/item?id=31239033 - May 2022 (172
       | comments)
       | 
       |  _Knightmare: A DevOps Cautionary Tale (2014)_ -
       | https://news.ycombinator.com/item?id=22250847 - Feb 2020 (33
       | comments)
       | 
       |  _Knightmare: A DevOps Cautionary Tale (2014)_ -
       | https://news.ycombinator.com/item?id=8994701 - Feb 2015 (85
       | comments)
       | 
       |  _Knightmare: A DevOps Cautionary Tale_ -
       | https://news.ycombinator.com/item?id=7652036 - April 2014 (60
       | comments)
       | 
       |  _Knight Capital Says Trading Glitch Cost It $440 Million_ -
       | https://news.ycombinator.com/item?id=4329101 - Aug 2012 (91
       | comments)
        
       | fintechie wrote:
       | This story now looks like a minor incident if you compare it to
       | FTX.
        
       | KMag wrote:
       | I was working on trading systems at Goldman in NYC at the time.
       | After hours on that day, I got a call from my manager to come in
       | early the next day to ensure our kill switches worked properly,
       | that our release and review processes were sufficient, and that
       | our monitoring systems were sufficient.
       | 
       | A few years later, I was working on trading systems at Goldman in
       | Hong Kong. I sent a change out for review, went out for dinner
       | and drinks with a colleague visiting from Tokyo, and swung by the
       | office on my way home. My change had been approved by my NYC
       | colleagues, so I merged it and went to bed. The next morning, I
       | woke up to news that Goldman had a 100 million USD trading lost
       | due to a software bug.
       | 
       | Edit: This was in Goldman's Slang language, where source code is
       | loaded from a globally-distributed eventually-consistent NoSQL
       | DB. Most applications execute from read-only DB snapshots after
       | extensive release testing. However, as soon as you merge your
       | change, it's potentially instantly running in production
       | somewhere in the world by some team you might not even know
       | exists. It was possible my merge, maybe 45 minutes before the NYC
       | market open, had gotten picked up by the errant system.
       | 
       | I spent a while convincing myself that there was no way my change
       | was the cause, and realized my phone would be ringing off the
       | hook had my change been the cause.
       | 
       | The guy who made the software change (let's call him Zaphod
       | Beeblebrox since that's clearly not his name), and the guy who
       | approved it, were both put on leave before I woke up. I found out
       | who made the change only because I had an open chat window with
       | Zaphod, and through several rounds of "fifth quartile" annual
       | layoffs, knew how the chat and email systems responded when
       | accounts got locked out. The chat system showed Zaphod's location
       | as unknown, and a test email to him came back with the "mailbox
       | full" message for a locked account. I walked over to the desk of
       | one of the senior Equity Options Flow Strats in Hong Kong, and
       | whispered "So... Zaphod Beeblebrox", and the Strat's face lit up
       | and he whispered back "How did you know?", to which I responded
       | "I didn't until I saw your reaction".
       | 
       | The guy who mode the 100 million mistake was actually very very
       | good at his job. He caught quite a few subtle bugs in other
       | people's code that he wasn't even asked to review, but was
       | reviewing out of curiosity. But, he was working late under time
       | pressure, didn't test his change properly, and you only have to
       | slip up once.
       | 
       | As I remember, many of the trades were broken by the exchange,
       | and the total loss came out to about 28 million USD.
       | 
       | On the one hand, the guy didn't deserve to get fired, and I'd
       | totally hire him for my team. On the other hand, if someone cuts
       | corners and that results in tens of millions of USD in losses and
       | doesn't get fired, that's very demotivating for everyone else at
       | the firm. They did a very good job about not naming and shaming.
       | 
       | After waking up to being momentarily scared I had made a 100
       | million USD error, I don't merge changes after-hours any more,
       | and certainly never after having consumed any alcohol. If a guy
       | like Zaphod can lose 28 million USD from a tired merge, so can I.
       | 
       | Zaphod, if you're reading this and ever looking for a job, give
       | me a ring.
        
         | yellowstuff wrote:
         | Great story. The part that sticks out to me is 72% (the
         | discount that GS got due to busted trades) and 0% (the discount
         | Knight got due to busted trades.) If you're going to eff up,
         | first make sure you're a big player!
        
           | KMag wrote:
           | In Goldman's case, Goldman was literally sending out options
           | orders with as ask price of $0. I don't recall if it was the
           | exchange or regulators that decided "If you bought below $x,
           | you knew you were trading against a broken algorithm and
           | shouldn't have expected the trade to last".
           | 
           | I'm not sure if any of the orders Knight was sending out were
           | clearly so erroneous. It's also possible that only after the
           | Knight incident is when it was made clear to market
           | participants that they should expect clearly erroneous trades
           | to be broken.
           | 
           | In any case, Knight was a major liquidity provider, and it
           | wasn't in the market's best interest for them to go bankrupt,
           | but it also sets a very bad precedent if plausible orders get
           | broken.
        
             | pclmulqdq wrote:
             | As far as I know, Knight was unique in that its orders were
             | obviously stupid, but not obviously mispriced or mis-sized.
             | It's not a case of a clear fat-finger error that would be
             | visible to other market participants.
             | 
             | I know that in many markets, the exchange will reverse your
             | trades if the counterparty made an obvious, visible error
             | (eg $0 ask price on a limit order).
        
       | esotericimpl wrote:
       | The best part of these stories is that nothing of value was ever
       | created or lost.
       | 
       | Just a shell game in a casino, our previous generations built
       | rockets with their best and brightest, now we build Ad Exchanges
       | and High frequency trading bots.
        
       ___________________________________________________________________
       (page generated 2023-11-28 23:00 UTC)