[HN Gopher] Post-mortem for last week's incident at Kagi
       ___________________________________________________________________
        
       Post-mortem for last week's incident at Kagi
        
       Author : leetrout
       Score  : 37 points
       Date   : 2024-01-16 21:04 UTC (1 hours ago)
        
 (HTM) web link (status.kagi.com)
 (TXT) w3m dump (status.kagi.com)
        
       | renewiltord wrote:
       | Interesting. The classic problem. You offer to not meter
       | something and then someone will use it to max capacity. Then
       | you're forced to place a limit so that one user won't hose
       | everyone else.
        
         | fotta wrote:
         | > We were later in contact with an account that we blocked who
         | claimed they were using their account to perform automated
         | scraping of our results, which is not something our terms allow
         | for.
         | 
         | I mean beyond that it was a user that was violating the TOS.
         | This isn't really a bait and switch scenario (although it could
         | be reasonably construed as such).
        
       | fwsgonzo wrote:
       | It wasn't that long ago that I had heard about Kagi the first
       | time. Now I use it every day, and the fact that I can pin
       | cppreference.com to the top is just such a boon.
        
       | boomboomsubban wrote:
       | One user running a scraper took the service down for seven hours?
       | I know it's easy to sit on the outside and say they should have
       | seen this coming, but how does nobody in testing go "what happens
       | if a ton of searches happen?"
        
         | z64 wrote:
         | Hi there, this is Zac from Kagi. I just posted some other
         | details here that might be of interest:
         | 
         | https://news.ycombinator.com/item?id=39019936
         | 
         | TL;DR - we are a tiny, young team at the center, and everyone
         | has a closet full of hats they wear. No dedicated SRE team yet.
         | 
         | > "what happens if a ton of searches happen?"
         | 
         | In fairness, you can checkout https://kagi.com/stats - "a lot
         | of searches" is already happening, approaching 400k per day,
         | and systems still operate with plenty of capacity day-to-day,
         | in addition to some auto-scaling measures.
         | 
         | The devil is in the details of some users exploting a
         | pathological case. Our lack of experience (now rightfully
         | gained) is knowing what organic or pathological traffic we
         | could have predicted and simulated ahead of time.
         | 
         | Load-simulating 20,000 users searching concurrently sounds like
         | it would have been a sound experiment early on, and we did do
         | some things resembling this. But considering this incident, it
         | still would not have caught this issue. We have also had maybe
         | 10 people run security scanners on our production services at
         | this point that generated more traffic than this incident.
         | 
         | It is extremely difficult to balance this kind of development
         | when we also have features to build, and clearly we could do
         | with more of it! As mentioned in my other post, we are looking
         | to expand the team in the near term so that we are not spread
         | so thin on these sorts of efforts.
         | 
         | There is a lot that could be said in hindsight, but I hope that
         | is a bit more transparent WRT how we ended up here.
        
           | smcleod wrote:
           | Zac, I think you're doing great handling and communicating
           | this. Keep up the great work and have fun learning while
           | you're at it!
        
       | fragmede wrote:
       | That speaks volumes about the observability they have of their
       | internal systems. It's easy for me to say they should have seen
       | it sooner, but the right datadog dashboards and splunk queries
       | should have made that clear as day much faster. Hopefully they
       | take it as a learning experience and invest in better monitoring.
        
         | z64 wrote:
         | Hi there, I'm Zac, Kagi's tech lead / author of the post-mortem
         | etc.
         | 
         | This has 100% been a learning experience for us, but I can
         | provide some finer context re: observability.
         | 
         | Kagi is a small team. The number of staff we have capable of
         | responding to an event like this is essentially 3 people,
         | seated across 3 timezones. For myself and my right-hand dev,
         | this is actually our very first step in our web careers - this
         | is to say that we are not some SV vets who have seen it all
         | already. To say that we have a lot to learn is a given, from
         | building Kagi from nothing though, I am proud of how far we've
         | come & where we're going.
         | 
         | Observability is something we started taking more seriously in
         | the past 6 months or so. We have tons of dashboards now, and
         | alerts that go right to our company chat channels and ping
         | relevant people. And as the primary owner of our DB, GCP's
         | query insights are a godsend. During the incident both our
         | monitoring went off, as well as query insights showing the
         | "culprit" query - but, we could have monitoring in the world,
         | and still lack the experience to interpret it and understand
         | what the root cause is or most efficient action to mitigate is.
         | 
         | In other words, we don't have the wisdom yet to not be "gaslit"
         | by our own systems if we're not careful. Only in hindsight can
         | I say that GCP's query insights was 100% on the money, and not
         | some bug in application space.
         | 
         | All said, our growth has enabled us to expand our team quite a
         | bit now. We have had SRE consultations before, and intend to
         | bring on more full or part-time support to help keep things
         | moving forward.
        
           | jjtheblunt wrote:
           | I bet a silent majority are thinking "well done, Zac, all the
           | same".
        
           | nanocat wrote:
           | Sounds like you're doing great to me. Thank you for being so
           | open!
        
           | primitivesuave wrote:
           | I really appreciate you sharing these candid insights. Let me
           | tell you (after over a decade of deploying cloud services),
           | some rogue user will always figure out how to throw an
           | unforeseen wrench into your system as the service gets more
           | popular. Even worse than an outage is when someone figures
           | out how to explode your cloud computing costs :)
        
           | timwis wrote:
           | Thank you for sharing! I'm surprised to hear that, given how
           | impressive your product is, but I'm an even bigger fan now.
        
         | mathverse wrote:
         | Kagi is a startup with low margins and high operational costs.
        
         | hacker_newz wrote:
         | What are "the right datadog dashboards and splunk queries"?
        
           | blantonl wrote:
           | Lots and lots of money to catch what you don't know, which
           | means "oh crap, now we need to log this also"
        
       | layoric wrote:
       | "This didn't exactly come as a surprise to us, as for the
       | entirety of Kagi's life so far we have actually used the
       | cheapest, single-core database available to us on GCP!"
       | 
       | Outages suck, but I love the fact that they are building such a
       | lean product. Been paying for Kagi as a part of de-Google-ifying
       | my use of online services and the experience so far (I wasn't
       | impacted by this outage) has been great.
       | 
       | A few years ago I built a global SaaS (first employee and SWE) in
       | the weather space which was backed by a single DB, and while it
       | had more than just 1 core (8 when I left from memory), I think a
       | lot of developers reach for distributed DBs far too early. Modern
       | hardware can do a lot, products like AWS Aurora are impressive,
       | but they come with their own complexities (and MUCH higher
       | costs).
        
       | jacob019 wrote:
       | If you're listening, Kagi, please add an a la carte plan for
       | search. Maybe hide it behind the API options as not to disrupt
       | your normal plans. I love the search and I'm happy to pay, but
       | I'm cost sensitive now and it's the only way that I'm going to
       | feel comfortable using it long-term.
        
         | spiderice wrote:
         | They have a $5 for 300 searches option. Is that not what you're
         | referring to?
        
       | zilti wrote:
       | GCP? Well, that is one way to waste a lot of money and sanity.
        
         | HaZeust wrote:
         | I'll bite; how do you figure?
        
       | smcleod wrote:
       | Good write up. I always appreciate Kagi's honesty and
       | transparency like this. Great product, great service.
        
       | muhammadusman wrote:
       | I was one of the users that went and reported this issue on
       | Discord. I love Kagi but I was a bit disappointed to see that
       | their status page showed everything was up and running. I think
       | that made me a bit uneasy and it shows their status pages are not
       | given priority during incidents that are affecting real users. I
       | hope in the future the status page is accurately updated.
       | 
       | In the past, services I heavily rely on (e.g. Github), have
       | updated their status pages immediately and this allows me to rest
       | assured that people are aware of the issue and it's not an issue
       | with my devices. When this happened with Kagi, I was looking up
       | the nearest grocery stores open since we were getting snow later
       | that day so it was almost like I got let down b/c I had to go to
       | Google for this.
       | 
       | I will continue using Kagi b/c 99.9% of the other time I've used
       | it, it has been better than Google but I hope the authors of the
       | post-mortem do mean it when they say they'll be moving their
       | status page code to a different service/platform.
       | 
       | And thanks again Zac for being transparent and writing this up.
       | This is part of good engineering!
        
       | blantonl wrote:
       | _At first, by what turned out to be a complete coincidence, the
       | incident occurred at precisely the same time that we were
       | performing an infrastructure upgrade to our VMs with additional
       | RAM resources_
       | 
       | I can assure you that these "coincidences" happen all the time,
       | and will cause you to question your very existence when you are
       | troubleshooting them. And if you panic while questioning your
       | very existence, you'll invariably push a hotfix that breaks
       | something else and then you are in a world of hurt. \
       | 
       | Muphy's law is a cruel thing to sysadmins and developers.
        
       ___________________________________________________________________
       (page generated 2024-01-16 23:00 UTC)