hngopher.com

       [HN Gopher] How We Found 7 TiB of Memory Just Sitting Around
       ___________________________________________________________________
        
       How We Found 7 TiB of Memory Just Sitting Around
        
       Author : anurag
       Score  : 194 points
       Date   : 2025-10-30 18:25 UTC (2 days ago)
        
 (HTM) web link (render.com)
 (TXT) w3m dump (render.com)
        
       | shanemhansen wrote:
       | The unreasonable effectiveness of profiling and digging deep
       | strikes again.
        
         | hinkley wrote:
         | The biggest tool in the performance toolbox is stubbornness.
         | Without it all the mechanical sympathy in the world will go
         | unexploited.
         | 
         | There's about a factor of 3 improvement that can be made to
         | most code after the profiler has given up. That probably means
         | there are better profilers than could be written, but in 20
         | years of having them I've only seen 2 that tried. Sadly I think
         | flame graphs made profiling more accessible to the unmotivated
         | but didn't actually improve overall results.
        
           | zahlman wrote:
           | > The biggest tool in the performance toolbox is
           | stubbornness. Without it all the mechanical sympathy in the
           | world will go unexploited.
           | 
           | The sympathy is also needed. Problems aren't found when
           | people don't care, or consider the current performance
           | acceptable.
           | 
           | > There's about a factor of 3 improvement that can be made to
           | most code after the profiler has given up. That probably
           | means there are better profilers than could be written, but
           | in 20 years of having them I've only seen 2 that tried.
           | 
           | It's hard for profilers to identify slowdowns that are due to
           | the architecture. Making the function do less work to get its
           | result feels different from determining that the function's
           | result is unnecessary.
        
             | hinkley wrote:
             | Architecture, cache eviction, memory bandwidth, thermal
             | throttling.
             | 
             | All of which have gotten perhaps an order of magnitude
             | worse in the time since I started on this theory.
        
               | hinkley wrote:
               | And Amdahl's Law. Perf charts will complain about how
               | much CPU you're burning in the parallel parts of code and
               | ignore that the bottleneck is down in 8% of the code that
               | can't be made concurrent.
        
               | zahlman wrote:
               | I meant architecture _of the codebase_ , to be clear.
               | (I'm sure that the increasing complexity of hardware
               | architecture makes it harder to figure out how to write
               | optimal code, but it isn't really degrading the
               | performance of naive attempts, is it?)
        
               | hinkley wrote:
               | The problem Windows had during its time of fame is the
               | developers always had the fastest machines money could
               | buy. That decreased the code-build-test cycle for them,
               | but it also made it difficult for the developers to
               | visualize how their code would run on normal hardware.
               | Add the general lack of empathy inspired by their toxic
               | corporate culture of "we are the best in the world" and
               | its small wonder why windows, 95 and 98 ran more and more
               | dogshit on older hardware.
               | 
               | My first job out of college, I got handed the slowest
               | machine they had. The app was already half done and was
               | dogshit slow even with small data sets. I was embarrassed
               | to think my name would be associated with it. The UI
               | painted so slowly I could watch the individual lines
               | paint on my screen.
               | 
               | My friend and I in college had made homework into a game
               | of seeing who could make their homework assignment run
               | faster or using less memory. Such as calculating the
               | Fibonacci of 100, or 1000. So I just started applying
               | those skills and learning new ones.
               | 
               | For weeks I evaluated improvements to the code by saying
               | "one Mississippi, two Mississippi". Then how many
               | syllables I got through. Then the stopwatch function on
               | my watch. No profilers, no benchmarking tools, just code
               | review.
               | 
               | And that's how my first specialization became
               | optimization.
        
           | Negitivefrags wrote:
           | I think the biggest tool is higher expectations. Most
           | programmers really haven't come to grips with the idea that
           | computers are fast.
           | 
           | If you see a database query that takes 1 hour to run, and
           | only touches a few gb of data, you should be thinking "Well
           | nvme bandwidth is multiple gigabytes per second, why can't it
           | run in 1 second or less?"
           | 
           | The idea that anyone would accept a request to a website
           | taking longer than 30ms, (the time it takes for a game to
           | render it's entire world including both the CPU and GPU parts
           | at 60fps) is insane, and nobody should really accept it, but
           | we commonly do.
        
             | javier2 wrote:
             | its also about cost. My game computer has 8 cores + 1
             | expensive gpu + 32GB ram for me alone. We dont have that
             | per customer.
        
               | avidiax wrote:
               | It's also about revenue.
               | 
               | Uber could run the complete global rider/driver flow from
               | a single server.
               | 
               | It doesn't, in part because all of those individual trips
               | earn $1 or more each, so it's perfectly acceptable to the
               | business to be more more inefficient and use hundreds of
               | servers for this task.
               | 
               | Similarly, a small website taking 150ms to render the
               | page only matters if the lost productivity costs less
               | than the engineering time to fix it, and even then, only
               | makes sense if that engineering time isn't more
               | productively used to add features or reliability.
        
               | onethumb wrote:
               | Uber could not run the complete global rider/driver flow
               | from a single server.
        
               | exe34 wrote:
               | I believe the argument was that somebody competent could
               | do it.
        
               | avidiax wrote:
               | I'm saying you can keep track of all the riders and
               | drivers, matchmake, start/progress/complete trips, with a
               | single server, for the entire world.
               | 
               | Billing, serving assets like map tiles, etc. not
               | included.
               | 
               | Some key things to understand:
               | 
               | * The scale of Uber is not that high. A big city surely
               | has < 10,000 drivers simultaneously, probably less than
               | 1,000.
               | 
               | * The driver and rider phones participate in the state
               | keeping. They send updates every 4 seconds, but they only
               | have to be online to start a trip. Both mobiles cache a
               | trip log that gets uploaded when network is available.
               | 
               | * Since driver/rider send updates every 4 seconds, and
               | since you don't need to be online to continue or end a
               | trip, you don't even need an active spare for the server.
               | A hot spare can rebuild the world state in 4 seconds.
               | State for a rider and driver is just a few bytes each for
               | id, position and status.
               | 
               | * Since you'll have the rider and driver trip logs from
               | their phones, you don't necessarily have to log the ride
               | server side either. Its also OK to lose a little data on
               | the server. You can use UDP.
               | 
               | Don't forget that in the olden times, all the taxis in a
               | city like New York were dispatched by humans. All the
               | police in the city were dispatched by humans. You can
               | replace a building of dispatchers with a good server and
               | mobile hardware working together.
        
               | hinkley wrote:
               | You could envision a system that used one server per
               | county and that's 3k servers. Combine rural counties to
               | get that down to 1000, and that's probably less servers
               | than uber runs.
               | 
               | What the internet will tell me is that uber has 4500
               | distinct services, which is more services than there are
               | counties in the US.
        
               | hinkley wrote:
               | Practically, you have to parcel out points of contention
               | to a larger and larger team to stop them from spending 30
               | hours a week just coordinating for changes to the
               | servers. So the servers divide to follow Conway's Law, or
               | the company goes bankrupt (why not both?).
               | 
               | Microservices try to fix that. But then you need bin
               | packing so microservices beget kubernetes.
        
               | oivey wrote:
               | This is again a problem understanding that computers are
               | fast. A toaster can run an old 3D game like Quake at
               | hundreds of FPS. A website primarily displaying text
               | should be way faster. The reasons websites often aren't
               | have nothing to do with the user's computer.
        
               | paulryanrogers wrote:
               | That's a dedicated toaster serving only one client.
               | Websites usually aren't backed by bare metal per visitor.
        
               | oivey wrote:
               | Right. I'm replying to someone talking about their
               | personal computer.
        
               | Aeolun wrote:
               | If your websites take less than 16ms to serve, you can
               | serve 60 customers per second with that. So you sorta do
               | have it per customer?
        
               | vlovich123 wrote:
               | That's per core assuming the 16ms is CPU bound activity
               | (so 100 cores would serve 100 customers). If it's I/O you
               | can overlap a lot of customers since a single core could
               | easily keep track of thousands of in flight requests.
        
               | OJFord wrote:
               | With a latency of up to 984ms
        
               | javier2 wrote:
               | Im just saying that we dont have gaming pc specs per
               | customer to chug that 7GB of data for every request in
               | 30ms
        
             | azornathogron wrote:
             | Pedantic nit: At 60 fps the per frame time is 16.66... ms,
             | not 30 ms. Having said that a lot of games run at 30 fps,
             | or run different parts of their logic at different
             | frequencies, or do other tricks that mean there isn't
             | exactly one FPS rate that the thing is running at.
        
               | Negitivefrags wrote:
               | The CPU part happens on one frame, the GPU part happens
               | on the next frame. If you want to talk about the total
               | time for a game to render a frame, it needs to count two
               | frames.
        
               | wizzwizz4 wrote:
               | Computers are fast. Why do you accept a frame of lag? The
               | average game for a PC from the 1980s ran with less lag
               | than that. Super Mario Bros had _less_ than a frame
               | between controller input and character movement on the
               | screen. (Technically, it _could_ be more than a frame,
               | but only if there were enough objects in play that the
               | processor couldn 't handle all the physics updates in
               | time and missed the v-blank interval.)
        
               | Negitivefrags wrote:
               | If Vsync is on which was my assumption from my previous
               | comment, then if your computer is fast enough, you might
               | be able to run CPU and GPU work entirely in a single
               | frame if you use Reflex to delay when simulation starts
               | to lower latency, but regardless, you still have a total
               | time budget of 1/30th of a second to do all your combined
               | CPU and GPU work to get to 60fps.
        
               | azornathogron wrote:
               | If latency of input->visible effect is what you're
               | talking about, then yes, that's a great point!
        
             | hinkley wrote:
             | Lowered expectations are come in part from people giving up
             | on theirs. Accepting versus pushing back.
        
               | antonymoose wrote:
               | I have high hopes and expectations, unfortunately my
               | chain of command does not, and is often an immovable
               | force.
        
               | hinkley wrote:
               | This is a terrible time to tell someone to find a movable
               | object in another part of the org or elsewhere. :/
               | 
               | I always liked Shaw's "The reasonable man adapts himself
               | to the world: the unreasonable one persists in trying to
               | adapt the world to himself. Therefore all progress
               | depends on the unreasonable man."
        
             | mjevans wrote:
             | 30mS for a website is a tough bar to clear considering
             | Speed of Light (or rather electrons in copper / light in
             | fiber)
             | 
             | https://en.wikipedia.org/wiki/Speed_of_light
             | 
             | Just as an example, round trip delay from where I rent to
             | the local backbone is about 14mS alone, and the average for
             | a webserver is 53mS. Just as a simple echo reply. (I picked
             | it because I'd hoped that was in Redmond or some nearby
             | datacenter, but it looks more likely to be in a cheaper
             | labor area.)
             | 
             | However it's only the bloated ECMAScript (javascript) trash
             | web of today that makes a website take longer than ~1
             | second to load on a modern PC. Plain old HTML, images on a
             | reasonable diet, and some script elements only for
             | interactive things can scream.                   mtr -bzw
             | microsoft.com         6. AS7922
             | be-36131-cs03.seattle.wa.ibone.comcast.net
             | (2001:558:3:942::1)         0.0%    10   12.9  13.9  11.5
             | 18.7   2.6         7. AS7922
             | be-2311-pe11.seattle.wa.ibone.comcast.net
             | (2001:558:3:3a::2)           0.0%    10   11.8  13.3  10.6
             | 17.2   2.4         8. AS7922        2001:559:0:80::101e
             | 0.0%    10   15.2  20.7  10.7  60.0  17.3         9. AS8075
             | ae25-0.icr02.mwh01.ntwk.msn.net (2a01:111:2000:2:8000::b9a)
             | 0.0%    10   41.1  23.7  14.8  41.9  10.4         10.
             | AS8075        be140.ibr03.mwh01.ntwk.msn.net
             | (2603:1060:0:12::f18e)                  0.0%    10   53.1
             | 53.1  50.2  57.4   2.1         11. AS8075
             | 2603:1060:0:10::f536
             | 0.0%    10   82.1  55.7  50.5  82.1   9.7         12.
             | AS8075        2603:1060:0:10::f3b1
             | 0.0%    10   54.4  96.6  50.4 147.4  32.5         13.
             | AS8075        2603:1060:0:10::f51a
             | 0.0%    10   49.7  55.3  49.7  78.4   8.3         14.
             | AS8075        2a01:111:201:f200::d9d
             | 0.0%    10   52.7  53.2  50.2  58.1   2.7         15.
             | AS8075        2a01:111:2000:6::4a51
             | 0.0%    10   49.4  51.6  49.4  54.1   1.7         20.
             | AS8075        2603:1030:b:3::152
             | 0.0%    10   50.7  53.4  49.2  60.7   4.2
        
               | hinkley wrote:
               | In the cloud era this gets a bit better but my last job I
               | removed a single service that was adding 30ms to response
               | time and replaced it with a consul lookup with a watch on
               | it. It wasn't even a big service. Same DC, very simple
               | graph query with a very small response. You can burn
               | through 30 ms without half trying.
        
           | jesse__ wrote:
           | Broadly agree.
           | 
           | I'm curious, what're the profilers you know of that tried to
           | be better? I have a little homebrew game engine with an
           | integrated profiler that I'm always looking for ideas to make
           | more effective.
        
             | hinkley wrote:
             | Clinic.js tried and lost steam. I have a recollection of a
             | profiler called JProfiler that represented space and time
             | as a graph, but also a recollection they went under. And
             | there is a company selling a product of that name that has
             | been around since that time, but doesn't quite look how I
             | recalled and so I don't know if I was mistaken about their
             | demise or I've swapped product names in my brain. It was 20
             | years ago which is a long time for mush to happen.
             | 
             | The common element between attempts is new visualizations.
             | And like drawing a projection of an object in a mechanical
             | engineering drawing, there is no one projection that
             | contains the entire description of the problem. You need to
             | present several and let brain synthesize the data missing
             | in each individual projection into an accurate model.
        
               | never_inline wrote:
               | what do you think about speedscope's sandwich view?
        
               | hinkley wrote:
               | More of the same. JetBrains has an equivalent, though it
               | seems to be broken at present. The sandwich keeps
               | dragging you back to the flame graph. Call stack depth
               | has value but width is harder for people to judge and
               | it's the wrong yardstick for many of the concerns I've
               | mentioned in the rest of this thread.
               | 
               | The sandwich view hides invocation count, which is one of
               | the biggest things you need to look at for that remaining
               | 3x.
               | 
               | Also you need to think about budgets. Which is something
               | game designers do and the rest of us ignore. Do I want
               | 10% of overall processing time to be spent accessing
               | reloadable config? Reporting stats? If the answer is no
               | then we need to look at that, even if data retrieval is
               | currently 40% of overall response time and we are trying
               | to get from 2 seconds to 200 ms.
               | 
               | That means config and stats have a budget of 20ms each
               | and you will never hit 200ms if someone doesn't look at
               | them. So you can pretend like they don't exist until you
               | get all the other tent poles chopped and then surprise
               | pikachu face when you've already painted them into a
               | corner with your other changes.
               | 
               | When we have a lot of shit that all needs to get done,
               | you want to get to transparency, look at the pile and
               | figure out how to do it all effectively. Combine errands
               | and spread the stressful bits out over time. None of the
               | tools and none of the literature supports this exercise,
               | and in fact most of the literature is actively hostile to
               | this exercise. Which is why you should read a certain
               | level of reproval or even contempt in my writing about
               | optimization. It's very much intended.
               | 
               | Most advice on writing fast code has not materially
               | changed for a time period where the number of
               | calculations we do has increased by 5 orders of
               | magnitude. In every other domain, we re-evaluate our
               | solutions at each order of magnitude. We have marched
               | past ignorant and into insane at this point. We are
               | broken and we have been broken for twenty years.
        
       | nitinreddy88 wrote:
       | The other way to look is why adding NS label is causing so much
       | memory footprint in Kubernetes. Shouldn't be fixing that (could
       | be much bigger design change), will benefit whole Kube community?
        
         | bstack wrote:
         | Author here: yeah that's a good point. tbh I was mostly
         | unfamiliar with Vector so I took the shortest path to the goal
         | but that could be interesting followup. It does seem like
         | there's a lot of bytes per namespace!
        
           | stackskipton wrote:
           | You mentioned in the blog article that it's doing listwatch.
           | List Watch registers with Kubernetes API that get a list of
           | all objects AND get a notification when anything in object
           | you have registered with changes. A bunch of Vector Pods
           | saying "Hey, send me a notification when anything with
           | namespaces changes" and _poof_ goes your Memory keeping track
           | of who needs to know what.
           | 
           | At this point, I wonder if instead of relying on daemonsets,
           | you just gave every namespace a vector instance that was
           | responsible for that namespace and pods within. ElasticSearch
           | or whatever you pipe logging data to might not be happy with
           | all those TCP connections.
           | 
           | Just my SRE brain thoughts.
        
             | fells wrote:
             | >you just gave every namespace a vector instance that was
             | responsible for that namespace and pods within.
             | 
             | Vector is a daemonset, because it needs to tail the log
             | files on each node. A single vector per namespace might not
             | reside on the nodes that each pod is on.
        
               | stackskipton wrote:
               | I think DaemonSet is to reduce network load so Vector is
               | not pulling logs files over the network.
               | 
               | We run Vector as Daemonset as well but we don't have a
               | ton of namespaces. Render sounds like they have a ton of
               | namespaces running maybe one or two pods since their
               | customers are much smaller. This is probably much more
               | niche setup then many users of Kubernetes.
        
               | ahoka wrote:
               | That's where the design is wrong.
        
       | hinkley wrote:
       | Keys require O(logn) space per key or nlogn for the entire data
       | set, simply to avoid key collisions. But human friendly key
       | spaces grow much, much faster and I don't think many people have
       | looked too hard at that.
       | 
       | There were recent changes to the NodeJS Prometheus client that
       | eliminates tag names from the keys used for storing the tag
       | cardinality for metrics. The memory savings wasn't reported but
       | the cpu savings for recording data points was over 1/3. And about
       | twice that when applied to the aggregation logic.
       | 
       | Lookups are rarely O(1), even in hash tables.
       | 
       | I wonder if there's a general solution for keeping names concise
       | without triggering transposition or reading comprehension errors.
       | And what the space complexity is of such an algorithm.
        
         | vlovich123 wrote:
         | Why aren't let's just 128bit UUIDs? Those are guaranteed to be
         | globally unique and don't require so much spacex
        
           | hinkley wrote:
           | Why aren't what 128bit UUIDs?
           | 
           | > keeping names concise without triggering transposition or
           | reading comprehension errors.
           | 
           | Code that doesn't work for developers first will soon cease
           | to work for anyone. Plus how do you look up a uuid for a set
           | of tags? What's your perfect hash plan to make sure you don't
           | misattribute stats to the wrong place?
           | 
           | UUIDs are entirely opaque and difficult to tell apart
           | consistently.
        
       | Aeolun wrote:
       | I read this and I have to wonder, did anyone ever think it was
       | reasonable that a cluster that apparently needed only 120gb of
       | memory was consuming 1.2TB just for logging (or whatever vector
       | does)
        
         | bstack wrote:
         | Author here: You'd be surprised what you don't notice given
         | enough nodes and slow enough resource growth over time! Out of
         | the total resource usage in these clusters even at the high
         | water mark for this daemonset it was still a small overall
         | portion of the total.
        
           | fock wrote:
           | how large are the clusters then?
        
           | Aeolun wrote:
           | I'm not sure if that makes it better or worse.
        
             | embedding-shape wrote:
             | I didn't know what Render was when I skimmed the article at
             | first, but after reading these comments, I had to check out
             | what they do.
             | 
             | And they're a "Cloud Application Platform" meaning they
             | manage deploys and infrastructure for _other people_. Their
             | website says  "Click, click, done." which is cool and quick
             | and all, but to me it's kind of crazy an organization that
             | should be really engineering focused and mature, doesn't
             | immediately notice 1.2TB being used and tries to figure out
             | why, when 120GB ended up being sufficient.
             | 
             | It gives much more of a "We're a startup, we're learning as
             | we're running" vibe which again, cool and all, but hardly
             | what people should use for hosting their own stuff on.
        
             | antoniojtorres wrote:
             | It seems realistic to me, commonplace even. Lots to do in a
             | company like this one.
        
         | devjab wrote:
         | We're a much smaller scale company and the cost we lose on
         | these things is insignificant compared to what's in this story.
         | Yesterday I was improving the process for creating databases in
         | our azure and I stumbled upon a subscription which was running
         | 7 mssql servers for 12 databases. These weren't elastic and
         | they were each paying a license that we don't have to pay
         | because we qualify for the base cost through our contract with
         | our microsoft partner. This company has some of the thightest
         | control over their cloud infrastructure out of any organisation
         | I've worked with.
         | 
         | This is anecdotal, but if my experiences aren't unique then
         | there is a lot of lack of reasonable in DevOps.
        
           | ffsm8 wrote:
           | Isn't that mostly down to the fact the vast majority of devs
           | explicitly don't want to do anything wrt Ops?
           | 
           | DevOps has - ever since it's originally well meaning
           | inception (by Netflix iirc?) - been implemented across our
           | industry as an effective cost cutting measure, forcing devs
           | that didn't see it as their job to _also_ handle it.
           | 
           | Which consequently means they're not interfacing with it
           | whatsoever. They do as little as they can get away with,
           | which inevitably means things are being done with borderline
           | malicious compliance... Or just complete incompetence.
           | 
           | I'm not even sure I'd blame these devs in particular. The
           | devs just saw it as a quick bonus generator for the MBA in
           | charge of this rebranding while offloading more
           | responsibilities in their shoulders.
           | 
           | DevOps made total sense in the work culture where this
           | concept was conceived - Netflix was well known at that point
           | to only ever employ senior Devs. However, in the context of
           | the average 9-5 dev, which often knows a lot less then even
           | some enthusiastic Jrs... Let's just say that it's incredibly
           | dicey wherever it's successful in practice.
        
             | mustyoshi wrote:
             | I politely disagree. I spent maybe 8 hours over a week
             | rightsizing a handful of heavy deployments from a previous
             | team and reduced their peak resource usage by implementing
             | better scaling policies. Before the new scaling policy the
             | service would scale out and new pods would remain idle and
             | ultimately get terminated without ever responding to a
             | request quite frequently.
             | 
             | The service dashboards already existed, all I had to do was
             | a bit of load testing and read the graphs.
             | 
             | It's not too much extra work to make sure you're scaling
             | efficiently.
        
               | ffsm8 wrote:
               | You disagree but then cite another example of low hanging
               | fruits that nobody took action on until you came along?
               | 
               | Did you accidentally respond to the wrong comment?
               | Because if anything you're giving another example of
               | "most devs not wanting to interface with ops, hence
               | letting it slide until someone bothers to pick up their
               | slack"...
        
             | FroshKiller wrote:
             | The first time my director asked me if I'd ever heard of
             | DevOps, I said, "Sure, doing two jobs for one paycheck."
             | I'm a software developer, buddy. I write the programs.
             | Leave me out of running them.
        
         | fock wrote:
         | we have on-prem with heavy spikes (our batch workload can
         | utilize the 20TB of memory in the cluster easily) and we just
         | don't care much and add 10% every year to the hardware
         | requested. Compared to employing people or paying other vendors
         | (relational databases with many TB-sized tables...) this is
         | just irrelevant.
         | 
         | Sadly devs are incentivized by that and going towards the cloud
         | might be a fun story. Given the environment I hope they scrap
         | the effort sooner rather than later, buy some Oxide systems for
         | the people who need to iterate faster than the usual process of
         | getting a VM and replace/reuse the 10% of the company occupied
         | with the cloud (mind you: no real workload runs there yet...)
         | to actually improve local processes...
        
           | g-mork wrote:
           | Somewhat unrelated, but you just tied wasteful software
           | design to high it salaries, and also suggest a reason why
           | Russian programmers might also seem to on the whole be far
           | more effective than we are
           | 
           | I wonder if msft simply cut dev salaries by 50% in the 90s,
           | would it have had any measurable effect on windows quality by
           | today
        
         | formerly_proven wrote:
         | It probably doesn't help that the first line of treatment for
         | any error is to blindly increase memory request/limit and claim
         | it's fixed (preferably without looking at the logs once).
        
       | liampulles wrote:
       | I'm a little surprised that it got to the point where pods which
       | should consume a couple MB of RAM were consuming 4GB before
       | action was taken. But I can also kind of understand it, because
       | the way k8s operators (apps running in k8s that manipulate k8s
       | resource) are meant to run is essentially a loop of listing
       | resources, comparing to spec, and making moves to try and bring
       | the state of the cluster closer to spec. This reconciliation loop
       | is simple to understand (and I think this benefit has led to the
       | creation of a wide array of excellent open source and proprietary
       | operators that can be added to clusters). But its also a recipe
       | for cascading explosions in resource usage.
       | 
       | These kind of resource explosions are something I see all the
       | time in k8s clusters. The general advice is to always try and
       | keep pressure off the k8s API, and the consequence is that one
       | must be very minimal and tactical with the operators one
       | installs, and then engage in many hours of work trying to fine
       | tune each operator to run efficiently (e.g. Grafana, whose
       | default helm settings do not use the recommended log indexing
       | algorithm, and which needs to be tweaked to get an appropriate
       | set of read vs. write pods for your situation).
       | 
       | Again, I recognize there is a tradeoff here - the simplicity and
       | openness of the k8s API is what has led to a flourish of new
       | operators, which really has allowed one to run "their own cloud".
       | But there is definitely a cost. I don't know what the solution
       | is, and I'm curious to hear from people who have other views of
       | it, or use other solutions to k8s which offer a different set of
       | tradeoffs.
        
         | never_inline wrote:
         | > are meant to run is essentially a loop of listing resources,
         | comparing to spec, and making moves to try and bring the state
         | of the cluster closer to spec.
         | 
         | Aren't they supposed to use watch/long polling?
        
       ___________________________________________________________________
       (page generated 2025-11-01 23:01 UTC)