[HN Gopher] Quick takes on the recent OpenAI public incident wri...
       ___________________________________________________________________
        
       Quick takes on the recent OpenAI public incident write-up
        
       Author : azhenley
       Score  : 36 points
       Date   : 2024-12-15 06:01 UTC (1 days ago)
        
 (HTM) web link (surfingcomplexity.blog)
 (TXT) w3m dump (surfingcomplexity.blog)
        
       | dang wrote:
       | Recent and related:
       | 
       |  _ChatGPT Down_ - https://news.ycombinator.com/item?id=42394391 -
       | Dec 2024 (30 comments)
        
       | JohnMakin wrote:
       | I caused an API server outage once with a monitoring tool,
       | however in my case it was a monstrosity of a 20,000 line script.
       | We quickly realized what we had done and turned it off, and I
       | have seen in very large clusters with 1000+ nodes that you need
       | to be especially sensitive about monitoring API server resource
       | usage depending on what precisely you are doing. Surprised they
       | hadn't learned this lesson yet, given the likely scale of their
       | workloads.
        
       | jimmyl02 wrote:
       | splitting the control and data plane is a great way to improve
       | resilience and prevent everything from being hard down. I wonder
       | how it could be accomplished with service discovery / routing.
       | 
       | maybe instead of relying on kubernetes DNS for discovery it can
       | be closer to something like envoy.the control plane updates
       | configs that are stored locally (and are eventually consistent)
       | so even if the control plane dies the data plane has access to
       | location information of other peer clusters.
        
       | dilyevsky wrote:
       | Something doesn't add up - CoreDNS's kubernetes plugin should be
       | serving Service RRs from its internal cache even if APIServer is
       | down because it's using cache.Indexer. The records would be stale
       | but unless their application pods all restarted, which they could
       | not since APIServer was down, or all CoreDNS pods got restarted,
       | which, again, they could not, just records expiring from the
       | cache shouldn't have caused full discovery outage.
        
         | jimmyl02 wrote:
         | wouldn't it be coredns caches the information and records from
         | API server for X amount of time (it seems like this might be 20
         | minutes?) then once the 20 minutes expired coredns would query
         | api server, receive no response, then fail?
         | 
         | I think the idea of just serving cached responses indefinitely
         | when api server is unreachable is what you're describing but
         | not sure if this is default. (and probably has other tradeoffs
         | that I'm not sure about too)
        
           | dilyevsky wrote:
           | Based on my understanding of the plugin code it _is_ the
           | default. The way cache.Indexer works is it 's continuously
           | streaming resources from APIServer using Watch API and
           | updates internal map. I think if Watch API is down it just
           | sits there and doesn't purge anything but I haven't tested
           | that. The 20 min expiry is probably referring to CodeDNS
           | _cache_ stanza which is a separate plugin[0].
           | 
           | [0] - https://coredns.io/plugins/cache
        
       | ilaksh wrote:
       | Wow, sounds like a nightmare. Operations staff definitely have
       | real jobs.
        
       ___________________________________________________________________
       (page generated 2024-12-16 23:00 UTC)