[HN Gopher] Quick takes on the recent OpenAI public incident wri...
___________________________________________________________________
Quick takes on the recent OpenAI public incident write-up
Author : azhenley
Score : 36 points
Date : 2024-12-15 06:01 UTC (1 days ago)
(HTM) web link (surfingcomplexity.blog)
(TXT) w3m dump (surfingcomplexity.blog)
| dang wrote:
| Recent and related:
|
| _ChatGPT Down_ - https://news.ycombinator.com/item?id=42394391 -
| Dec 2024 (30 comments)
| JohnMakin wrote:
| I caused an API server outage once with a monitoring tool,
| however in my case it was a monstrosity of a 20,000 line script.
| We quickly realized what we had done and turned it off, and I
| have seen in very large clusters with 1000+ nodes that you need
| to be especially sensitive about monitoring API server resource
| usage depending on what precisely you are doing. Surprised they
| hadn't learned this lesson yet, given the likely scale of their
| workloads.
| jimmyl02 wrote:
| splitting the control and data plane is a great way to improve
| resilience and prevent everything from being hard down. I wonder
| how it could be accomplished with service discovery / routing.
|
| maybe instead of relying on kubernetes DNS for discovery it can
| be closer to something like envoy.the control plane updates
| configs that are stored locally (and are eventually consistent)
| so even if the control plane dies the data plane has access to
| location information of other peer clusters.
| dilyevsky wrote:
| Something doesn't add up - CoreDNS's kubernetes plugin should be
| serving Service RRs from its internal cache even if APIServer is
| down because it's using cache.Indexer. The records would be stale
| but unless their application pods all restarted, which they could
| not since APIServer was down, or all CoreDNS pods got restarted,
| which, again, they could not, just records expiring from the
| cache shouldn't have caused full discovery outage.
| jimmyl02 wrote:
| wouldn't it be coredns caches the information and records from
| API server for X amount of time (it seems like this might be 20
| minutes?) then once the 20 minutes expired coredns would query
| api server, receive no response, then fail?
|
| I think the idea of just serving cached responses indefinitely
| when api server is unreachable is what you're describing but
| not sure if this is default. (and probably has other tradeoffs
| that I'm not sure about too)
| dilyevsky wrote:
| Based on my understanding of the plugin code it _is_ the
| default. The way cache.Indexer works is it 's continuously
| streaming resources from APIServer using Watch API and
| updates internal map. I think if Watch API is down it just
| sits there and doesn't purge anything but I haven't tested
| that. The 20 min expiry is probably referring to CodeDNS
| _cache_ stanza which is a separate plugin[0].
|
| [0] - https://coredns.io/plugins/cache
| ilaksh wrote:
| Wow, sounds like a nightmare. Operations staff definitely have
| real jobs.
___________________________________________________________________
(page generated 2024-12-16 23:00 UTC)