[HN Gopher] Post-mortem for last week's incident at Kagi
___________________________________________________________________
Post-mortem for last week's incident at Kagi
Author : leetrout
Score : 37 points
Date : 2024-01-16 21:04 UTC (1 hours ago)
(HTM) web link (status.kagi.com)
(TXT) w3m dump (status.kagi.com)
| renewiltord wrote:
| Interesting. The classic problem. You offer to not meter
| something and then someone will use it to max capacity. Then
| you're forced to place a limit so that one user won't hose
| everyone else.
| fotta wrote:
| > We were later in contact with an account that we blocked who
| claimed they were using their account to perform automated
| scraping of our results, which is not something our terms allow
| for.
|
| I mean beyond that it was a user that was violating the TOS.
| This isn't really a bait and switch scenario (although it could
| be reasonably construed as such).
| fwsgonzo wrote:
| It wasn't that long ago that I had heard about Kagi the first
| time. Now I use it every day, and the fact that I can pin
| cppreference.com to the top is just such a boon.
| boomboomsubban wrote:
| One user running a scraper took the service down for seven hours?
| I know it's easy to sit on the outside and say they should have
| seen this coming, but how does nobody in testing go "what happens
| if a ton of searches happen?"
| z64 wrote:
| Hi there, this is Zac from Kagi. I just posted some other
| details here that might be of interest:
|
| https://news.ycombinator.com/item?id=39019936
|
| TL;DR - we are a tiny, young team at the center, and everyone
| has a closet full of hats they wear. No dedicated SRE team yet.
|
| > "what happens if a ton of searches happen?"
|
| In fairness, you can checkout https://kagi.com/stats - "a lot
| of searches" is already happening, approaching 400k per day,
| and systems still operate with plenty of capacity day-to-day,
| in addition to some auto-scaling measures.
|
| The devil is in the details of some users exploting a
| pathological case. Our lack of experience (now rightfully
| gained) is knowing what organic or pathological traffic we
| could have predicted and simulated ahead of time.
|
| Load-simulating 20,000 users searching concurrently sounds like
| it would have been a sound experiment early on, and we did do
| some things resembling this. But considering this incident, it
| still would not have caught this issue. We have also had maybe
| 10 people run security scanners on our production services at
| this point that generated more traffic than this incident.
|
| It is extremely difficult to balance this kind of development
| when we also have features to build, and clearly we could do
| with more of it! As mentioned in my other post, we are looking
| to expand the team in the near term so that we are not spread
| so thin on these sorts of efforts.
|
| There is a lot that could be said in hindsight, but I hope that
| is a bit more transparent WRT how we ended up here.
| smcleod wrote:
| Zac, I think you're doing great handling and communicating
| this. Keep up the great work and have fun learning while
| you're at it!
| fragmede wrote:
| That speaks volumes about the observability they have of their
| internal systems. It's easy for me to say they should have seen
| it sooner, but the right datadog dashboards and splunk queries
| should have made that clear as day much faster. Hopefully they
| take it as a learning experience and invest in better monitoring.
| z64 wrote:
| Hi there, I'm Zac, Kagi's tech lead / author of the post-mortem
| etc.
|
| This has 100% been a learning experience for us, but I can
| provide some finer context re: observability.
|
| Kagi is a small team. The number of staff we have capable of
| responding to an event like this is essentially 3 people,
| seated across 3 timezones. For myself and my right-hand dev,
| this is actually our very first step in our web careers - this
| is to say that we are not some SV vets who have seen it all
| already. To say that we have a lot to learn is a given, from
| building Kagi from nothing though, I am proud of how far we've
| come & where we're going.
|
| Observability is something we started taking more seriously in
| the past 6 months or so. We have tons of dashboards now, and
| alerts that go right to our company chat channels and ping
| relevant people. And as the primary owner of our DB, GCP's
| query insights are a godsend. During the incident both our
| monitoring went off, as well as query insights showing the
| "culprit" query - but, we could have monitoring in the world,
| and still lack the experience to interpret it and understand
| what the root cause is or most efficient action to mitigate is.
|
| In other words, we don't have the wisdom yet to not be "gaslit"
| by our own systems if we're not careful. Only in hindsight can
| I say that GCP's query insights was 100% on the money, and not
| some bug in application space.
|
| All said, our growth has enabled us to expand our team quite a
| bit now. We have had SRE consultations before, and intend to
| bring on more full or part-time support to help keep things
| moving forward.
| jjtheblunt wrote:
| I bet a silent majority are thinking "well done, Zac, all the
| same".
| nanocat wrote:
| Sounds like you're doing great to me. Thank you for being so
| open!
| primitivesuave wrote:
| I really appreciate you sharing these candid insights. Let me
| tell you (after over a decade of deploying cloud services),
| some rogue user will always figure out how to throw an
| unforeseen wrench into your system as the service gets more
| popular. Even worse than an outage is when someone figures
| out how to explode your cloud computing costs :)
| timwis wrote:
| Thank you for sharing! I'm surprised to hear that, given how
| impressive your product is, but I'm an even bigger fan now.
| mathverse wrote:
| Kagi is a startup with low margins and high operational costs.
| hacker_newz wrote:
| What are "the right datadog dashboards and splunk queries"?
| blantonl wrote:
| Lots and lots of money to catch what you don't know, which
| means "oh crap, now we need to log this also"
| layoric wrote:
| "This didn't exactly come as a surprise to us, as for the
| entirety of Kagi's life so far we have actually used the
| cheapest, single-core database available to us on GCP!"
|
| Outages suck, but I love the fact that they are building such a
| lean product. Been paying for Kagi as a part of de-Google-ifying
| my use of online services and the experience so far (I wasn't
| impacted by this outage) has been great.
|
| A few years ago I built a global SaaS (first employee and SWE) in
| the weather space which was backed by a single DB, and while it
| had more than just 1 core (8 when I left from memory), I think a
| lot of developers reach for distributed DBs far too early. Modern
| hardware can do a lot, products like AWS Aurora are impressive,
| but they come with their own complexities (and MUCH higher
| costs).
| jacob019 wrote:
| If you're listening, Kagi, please add an a la carte plan for
| search. Maybe hide it behind the API options as not to disrupt
| your normal plans. I love the search and I'm happy to pay, but
| I'm cost sensitive now and it's the only way that I'm going to
| feel comfortable using it long-term.
| spiderice wrote:
| They have a $5 for 300 searches option. Is that not what you're
| referring to?
| zilti wrote:
| GCP? Well, that is one way to waste a lot of money and sanity.
| HaZeust wrote:
| I'll bite; how do you figure?
| smcleod wrote:
| Good write up. I always appreciate Kagi's honesty and
| transparency like this. Great product, great service.
| muhammadusman wrote:
| I was one of the users that went and reported this issue on
| Discord. I love Kagi but I was a bit disappointed to see that
| their status page showed everything was up and running. I think
| that made me a bit uneasy and it shows their status pages are not
| given priority during incidents that are affecting real users. I
| hope in the future the status page is accurately updated.
|
| In the past, services I heavily rely on (e.g. Github), have
| updated their status pages immediately and this allows me to rest
| assured that people are aware of the issue and it's not an issue
| with my devices. When this happened with Kagi, I was looking up
| the nearest grocery stores open since we were getting snow later
| that day so it was almost like I got let down b/c I had to go to
| Google for this.
|
| I will continue using Kagi b/c 99.9% of the other time I've used
| it, it has been better than Google but I hope the authors of the
| post-mortem do mean it when they say they'll be moving their
| status page code to a different service/platform.
|
| And thanks again Zac for being transparent and writing this up.
| This is part of good engineering!
| blantonl wrote:
| _At first, by what turned out to be a complete coincidence, the
| incident occurred at precisely the same time that we were
| performing an infrastructure upgrade to our VMs with additional
| RAM resources_
|
| I can assure you that these "coincidences" happen all the time,
| and will cause you to question your very existence when you are
| troubleshooting them. And if you panic while questioning your
| very existence, you'll invariably push a hotfix that breaks
| something else and then you are in a world of hurt. \
|
| Muphy's law is a cruel thing to sysadmins and developers.
___________________________________________________________________
(page generated 2024-01-16 23:00 UTC)