[HN Gopher] How We Found 7 TiB of Memory Just Sitting Around
___________________________________________________________________
How We Found 7 TiB of Memory Just Sitting Around
Author : anurag
Score : 194 points
Date : 2025-10-30 18:25 UTC (2 days ago)
(HTM) web link (render.com)
(TXT) w3m dump (render.com)
| shanemhansen wrote:
| The unreasonable effectiveness of profiling and digging deep
| strikes again.
| hinkley wrote:
| The biggest tool in the performance toolbox is stubbornness.
| Without it all the mechanical sympathy in the world will go
| unexploited.
|
| There's about a factor of 3 improvement that can be made to
| most code after the profiler has given up. That probably means
| there are better profilers than could be written, but in 20
| years of having them I've only seen 2 that tried. Sadly I think
| flame graphs made profiling more accessible to the unmotivated
| but didn't actually improve overall results.
| zahlman wrote:
| > The biggest tool in the performance toolbox is
| stubbornness. Without it all the mechanical sympathy in the
| world will go unexploited.
|
| The sympathy is also needed. Problems aren't found when
| people don't care, or consider the current performance
| acceptable.
|
| > There's about a factor of 3 improvement that can be made to
| most code after the profiler has given up. That probably
| means there are better profilers than could be written, but
| in 20 years of having them I've only seen 2 that tried.
|
| It's hard for profilers to identify slowdowns that are due to
| the architecture. Making the function do less work to get its
| result feels different from determining that the function's
| result is unnecessary.
| hinkley wrote:
| Architecture, cache eviction, memory bandwidth, thermal
| throttling.
|
| All of which have gotten perhaps an order of magnitude
| worse in the time since I started on this theory.
| hinkley wrote:
| And Amdahl's Law. Perf charts will complain about how
| much CPU you're burning in the parallel parts of code and
| ignore that the bottleneck is down in 8% of the code that
| can't be made concurrent.
| zahlman wrote:
| I meant architecture _of the codebase_ , to be clear.
| (I'm sure that the increasing complexity of hardware
| architecture makes it harder to figure out how to write
| optimal code, but it isn't really degrading the
| performance of naive attempts, is it?)
| hinkley wrote:
| The problem Windows had during its time of fame is the
| developers always had the fastest machines money could
| buy. That decreased the code-build-test cycle for them,
| but it also made it difficult for the developers to
| visualize how their code would run on normal hardware.
| Add the general lack of empathy inspired by their toxic
| corporate culture of "we are the best in the world" and
| its small wonder why windows, 95 and 98 ran more and more
| dogshit on older hardware.
|
| My first job out of college, I got handed the slowest
| machine they had. The app was already half done and was
| dogshit slow even with small data sets. I was embarrassed
| to think my name would be associated with it. The UI
| painted so slowly I could watch the individual lines
| paint on my screen.
|
| My friend and I in college had made homework into a game
| of seeing who could make their homework assignment run
| faster or using less memory. Such as calculating the
| Fibonacci of 100, or 1000. So I just started applying
| those skills and learning new ones.
|
| For weeks I evaluated improvements to the code by saying
| "one Mississippi, two Mississippi". Then how many
| syllables I got through. Then the stopwatch function on
| my watch. No profilers, no benchmarking tools, just code
| review.
|
| And that's how my first specialization became
| optimization.
| Negitivefrags wrote:
| I think the biggest tool is higher expectations. Most
| programmers really haven't come to grips with the idea that
| computers are fast.
|
| If you see a database query that takes 1 hour to run, and
| only touches a few gb of data, you should be thinking "Well
| nvme bandwidth is multiple gigabytes per second, why can't it
| run in 1 second or less?"
|
| The idea that anyone would accept a request to a website
| taking longer than 30ms, (the time it takes for a game to
| render it's entire world including both the CPU and GPU parts
| at 60fps) is insane, and nobody should really accept it, but
| we commonly do.
| javier2 wrote:
| its also about cost. My game computer has 8 cores + 1
| expensive gpu + 32GB ram for me alone. We dont have that
| per customer.
| avidiax wrote:
| It's also about revenue.
|
| Uber could run the complete global rider/driver flow from
| a single server.
|
| It doesn't, in part because all of those individual trips
| earn $1 or more each, so it's perfectly acceptable to the
| business to be more more inefficient and use hundreds of
| servers for this task.
|
| Similarly, a small website taking 150ms to render the
| page only matters if the lost productivity costs less
| than the engineering time to fix it, and even then, only
| makes sense if that engineering time isn't more
| productively used to add features or reliability.
| onethumb wrote:
| Uber could not run the complete global rider/driver flow
| from a single server.
| exe34 wrote:
| I believe the argument was that somebody competent could
| do it.
| avidiax wrote:
| I'm saying you can keep track of all the riders and
| drivers, matchmake, start/progress/complete trips, with a
| single server, for the entire world.
|
| Billing, serving assets like map tiles, etc. not
| included.
|
| Some key things to understand:
|
| * The scale of Uber is not that high. A big city surely
| has < 10,000 drivers simultaneously, probably less than
| 1,000.
|
| * The driver and rider phones participate in the state
| keeping. They send updates every 4 seconds, but they only
| have to be online to start a trip. Both mobiles cache a
| trip log that gets uploaded when network is available.
|
| * Since driver/rider send updates every 4 seconds, and
| since you don't need to be online to continue or end a
| trip, you don't even need an active spare for the server.
| A hot spare can rebuild the world state in 4 seconds.
| State for a rider and driver is just a few bytes each for
| id, position and status.
|
| * Since you'll have the rider and driver trip logs from
| their phones, you don't necessarily have to log the ride
| server side either. Its also OK to lose a little data on
| the server. You can use UDP.
|
| Don't forget that in the olden times, all the taxis in a
| city like New York were dispatched by humans. All the
| police in the city were dispatched by humans. You can
| replace a building of dispatchers with a good server and
| mobile hardware working together.
| hinkley wrote:
| You could envision a system that used one server per
| county and that's 3k servers. Combine rural counties to
| get that down to 1000, and that's probably less servers
| than uber runs.
|
| What the internet will tell me is that uber has 4500
| distinct services, which is more services than there are
| counties in the US.
| hinkley wrote:
| Practically, you have to parcel out points of contention
| to a larger and larger team to stop them from spending 30
| hours a week just coordinating for changes to the
| servers. So the servers divide to follow Conway's Law, or
| the company goes bankrupt (why not both?).
|
| Microservices try to fix that. But then you need bin
| packing so microservices beget kubernetes.
| oivey wrote:
| This is again a problem understanding that computers are
| fast. A toaster can run an old 3D game like Quake at
| hundreds of FPS. A website primarily displaying text
| should be way faster. The reasons websites often aren't
| have nothing to do with the user's computer.
| paulryanrogers wrote:
| That's a dedicated toaster serving only one client.
| Websites usually aren't backed by bare metal per visitor.
| oivey wrote:
| Right. I'm replying to someone talking about their
| personal computer.
| Aeolun wrote:
| If your websites take less than 16ms to serve, you can
| serve 60 customers per second with that. So you sorta do
| have it per customer?
| vlovich123 wrote:
| That's per core assuming the 16ms is CPU bound activity
| (so 100 cores would serve 100 customers). If it's I/O you
| can overlap a lot of customers since a single core could
| easily keep track of thousands of in flight requests.
| OJFord wrote:
| With a latency of up to 984ms
| javier2 wrote:
| Im just saying that we dont have gaming pc specs per
| customer to chug that 7GB of data for every request in
| 30ms
| azornathogron wrote:
| Pedantic nit: At 60 fps the per frame time is 16.66... ms,
| not 30 ms. Having said that a lot of games run at 30 fps,
| or run different parts of their logic at different
| frequencies, or do other tricks that mean there isn't
| exactly one FPS rate that the thing is running at.
| Negitivefrags wrote:
| The CPU part happens on one frame, the GPU part happens
| on the next frame. If you want to talk about the total
| time for a game to render a frame, it needs to count two
| frames.
| wizzwizz4 wrote:
| Computers are fast. Why do you accept a frame of lag? The
| average game for a PC from the 1980s ran with less lag
| than that. Super Mario Bros had _less_ than a frame
| between controller input and character movement on the
| screen. (Technically, it _could_ be more than a frame,
| but only if there were enough objects in play that the
| processor couldn 't handle all the physics updates in
| time and missed the v-blank interval.)
| Negitivefrags wrote:
| If Vsync is on which was my assumption from my previous
| comment, then if your computer is fast enough, you might
| be able to run CPU and GPU work entirely in a single
| frame if you use Reflex to delay when simulation starts
| to lower latency, but regardless, you still have a total
| time budget of 1/30th of a second to do all your combined
| CPU and GPU work to get to 60fps.
| azornathogron wrote:
| If latency of input->visible effect is what you're
| talking about, then yes, that's a great point!
| hinkley wrote:
| Lowered expectations are come in part from people giving up
| on theirs. Accepting versus pushing back.
| antonymoose wrote:
| I have high hopes and expectations, unfortunately my
| chain of command does not, and is often an immovable
| force.
| hinkley wrote:
| This is a terrible time to tell someone to find a movable
| object in another part of the org or elsewhere. :/
|
| I always liked Shaw's "The reasonable man adapts himself
| to the world: the unreasonable one persists in trying to
| adapt the world to himself. Therefore all progress
| depends on the unreasonable man."
| mjevans wrote:
| 30mS for a website is a tough bar to clear considering
| Speed of Light (or rather electrons in copper / light in
| fiber)
|
| https://en.wikipedia.org/wiki/Speed_of_light
|
| Just as an example, round trip delay from where I rent to
| the local backbone is about 14mS alone, and the average for
| a webserver is 53mS. Just as a simple echo reply. (I picked
| it because I'd hoped that was in Redmond or some nearby
| datacenter, but it looks more likely to be in a cheaper
| labor area.)
|
| However it's only the bloated ECMAScript (javascript) trash
| web of today that makes a website take longer than ~1
| second to load on a modern PC. Plain old HTML, images on a
| reasonable diet, and some script elements only for
| interactive things can scream. mtr -bzw
| microsoft.com 6. AS7922
| be-36131-cs03.seattle.wa.ibone.comcast.net
| (2001:558:3:942::1) 0.0% 10 12.9 13.9 11.5
| 18.7 2.6 7. AS7922
| be-2311-pe11.seattle.wa.ibone.comcast.net
| (2001:558:3:3a::2) 0.0% 10 11.8 13.3 10.6
| 17.2 2.4 8. AS7922 2001:559:0:80::101e
| 0.0% 10 15.2 20.7 10.7 60.0 17.3 9. AS8075
| ae25-0.icr02.mwh01.ntwk.msn.net (2a01:111:2000:2:8000::b9a)
| 0.0% 10 41.1 23.7 14.8 41.9 10.4 10.
| AS8075 be140.ibr03.mwh01.ntwk.msn.net
| (2603:1060:0:12::f18e) 0.0% 10 53.1
| 53.1 50.2 57.4 2.1 11. AS8075
| 2603:1060:0:10::f536
| 0.0% 10 82.1 55.7 50.5 82.1 9.7 12.
| AS8075 2603:1060:0:10::f3b1
| 0.0% 10 54.4 96.6 50.4 147.4 32.5 13.
| AS8075 2603:1060:0:10::f51a
| 0.0% 10 49.7 55.3 49.7 78.4 8.3 14.
| AS8075 2a01:111:201:f200::d9d
| 0.0% 10 52.7 53.2 50.2 58.1 2.7 15.
| AS8075 2a01:111:2000:6::4a51
| 0.0% 10 49.4 51.6 49.4 54.1 1.7 20.
| AS8075 2603:1030:b:3::152
| 0.0% 10 50.7 53.4 49.2 60.7 4.2
| hinkley wrote:
| In the cloud era this gets a bit better but my last job I
| removed a single service that was adding 30ms to response
| time and replaced it with a consul lookup with a watch on
| it. It wasn't even a big service. Same DC, very simple
| graph query with a very small response. You can burn
| through 30 ms without half trying.
| jesse__ wrote:
| Broadly agree.
|
| I'm curious, what're the profilers you know of that tried to
| be better? I have a little homebrew game engine with an
| integrated profiler that I'm always looking for ideas to make
| more effective.
| hinkley wrote:
| Clinic.js tried and lost steam. I have a recollection of a
| profiler called JProfiler that represented space and time
| as a graph, but also a recollection they went under. And
| there is a company selling a product of that name that has
| been around since that time, but doesn't quite look how I
| recalled and so I don't know if I was mistaken about their
| demise or I've swapped product names in my brain. It was 20
| years ago which is a long time for mush to happen.
|
| The common element between attempts is new visualizations.
| And like drawing a projection of an object in a mechanical
| engineering drawing, there is no one projection that
| contains the entire description of the problem. You need to
| present several and let brain synthesize the data missing
| in each individual projection into an accurate model.
| never_inline wrote:
| what do you think about speedscope's sandwich view?
| hinkley wrote:
| More of the same. JetBrains has an equivalent, though it
| seems to be broken at present. The sandwich keeps
| dragging you back to the flame graph. Call stack depth
| has value but width is harder for people to judge and
| it's the wrong yardstick for many of the concerns I've
| mentioned in the rest of this thread.
|
| The sandwich view hides invocation count, which is one of
| the biggest things you need to look at for that remaining
| 3x.
|
| Also you need to think about budgets. Which is something
| game designers do and the rest of us ignore. Do I want
| 10% of overall processing time to be spent accessing
| reloadable config? Reporting stats? If the answer is no
| then we need to look at that, even if data retrieval is
| currently 40% of overall response time and we are trying
| to get from 2 seconds to 200 ms.
|
| That means config and stats have a budget of 20ms each
| and you will never hit 200ms if someone doesn't look at
| them. So you can pretend like they don't exist until you
| get all the other tent poles chopped and then surprise
| pikachu face when you've already painted them into a
| corner with your other changes.
|
| When we have a lot of shit that all needs to get done,
| you want to get to transparency, look at the pile and
| figure out how to do it all effectively. Combine errands
| and spread the stressful bits out over time. None of the
| tools and none of the literature supports this exercise,
| and in fact most of the literature is actively hostile to
| this exercise. Which is why you should read a certain
| level of reproval or even contempt in my writing about
| optimization. It's very much intended.
|
| Most advice on writing fast code has not materially
| changed for a time period where the number of
| calculations we do has increased by 5 orders of
| magnitude. In every other domain, we re-evaluate our
| solutions at each order of magnitude. We have marched
| past ignorant and into insane at this point. We are
| broken and we have been broken for twenty years.
| nitinreddy88 wrote:
| The other way to look is why adding NS label is causing so much
| memory footprint in Kubernetes. Shouldn't be fixing that (could
| be much bigger design change), will benefit whole Kube community?
| bstack wrote:
| Author here: yeah that's a good point. tbh I was mostly
| unfamiliar with Vector so I took the shortest path to the goal
| but that could be interesting followup. It does seem like
| there's a lot of bytes per namespace!
| stackskipton wrote:
| You mentioned in the blog article that it's doing listwatch.
| List Watch registers with Kubernetes API that get a list of
| all objects AND get a notification when anything in object
| you have registered with changes. A bunch of Vector Pods
| saying "Hey, send me a notification when anything with
| namespaces changes" and _poof_ goes your Memory keeping track
| of who needs to know what.
|
| At this point, I wonder if instead of relying on daemonsets,
| you just gave every namespace a vector instance that was
| responsible for that namespace and pods within. ElasticSearch
| or whatever you pipe logging data to might not be happy with
| all those TCP connections.
|
| Just my SRE brain thoughts.
| fells wrote:
| >you just gave every namespace a vector instance that was
| responsible for that namespace and pods within.
|
| Vector is a daemonset, because it needs to tail the log
| files on each node. A single vector per namespace might not
| reside on the nodes that each pod is on.
| stackskipton wrote:
| I think DaemonSet is to reduce network load so Vector is
| not pulling logs files over the network.
|
| We run Vector as Daemonset as well but we don't have a
| ton of namespaces. Render sounds like they have a ton of
| namespaces running maybe one or two pods since their
| customers are much smaller. This is probably much more
| niche setup then many users of Kubernetes.
| ahoka wrote:
| That's where the design is wrong.
| hinkley wrote:
| Keys require O(logn) space per key or nlogn for the entire data
| set, simply to avoid key collisions. But human friendly key
| spaces grow much, much faster and I don't think many people have
| looked too hard at that.
|
| There were recent changes to the NodeJS Prometheus client that
| eliminates tag names from the keys used for storing the tag
| cardinality for metrics. The memory savings wasn't reported but
| the cpu savings for recording data points was over 1/3. And about
| twice that when applied to the aggregation logic.
|
| Lookups are rarely O(1), even in hash tables.
|
| I wonder if there's a general solution for keeping names concise
| without triggering transposition or reading comprehension errors.
| And what the space complexity is of such an algorithm.
| vlovich123 wrote:
| Why aren't let's just 128bit UUIDs? Those are guaranteed to be
| globally unique and don't require so much spacex
| hinkley wrote:
| Why aren't what 128bit UUIDs?
|
| > keeping names concise without triggering transposition or
| reading comprehension errors.
|
| Code that doesn't work for developers first will soon cease
| to work for anyone. Plus how do you look up a uuid for a set
| of tags? What's your perfect hash plan to make sure you don't
| misattribute stats to the wrong place?
|
| UUIDs are entirely opaque and difficult to tell apart
| consistently.
| Aeolun wrote:
| I read this and I have to wonder, did anyone ever think it was
| reasonable that a cluster that apparently needed only 120gb of
| memory was consuming 1.2TB just for logging (or whatever vector
| does)
| bstack wrote:
| Author here: You'd be surprised what you don't notice given
| enough nodes and slow enough resource growth over time! Out of
| the total resource usage in these clusters even at the high
| water mark for this daemonset it was still a small overall
| portion of the total.
| fock wrote:
| how large are the clusters then?
| Aeolun wrote:
| I'm not sure if that makes it better or worse.
| embedding-shape wrote:
| I didn't know what Render was when I skimmed the article at
| first, but after reading these comments, I had to check out
| what they do.
|
| And they're a "Cloud Application Platform" meaning they
| manage deploys and infrastructure for _other people_. Their
| website says "Click, click, done." which is cool and quick
| and all, but to me it's kind of crazy an organization that
| should be really engineering focused and mature, doesn't
| immediately notice 1.2TB being used and tries to figure out
| why, when 120GB ended up being sufficient.
|
| It gives much more of a "We're a startup, we're learning as
| we're running" vibe which again, cool and all, but hardly
| what people should use for hosting their own stuff on.
| antoniojtorres wrote:
| It seems realistic to me, commonplace even. Lots to do in a
| company like this one.
| devjab wrote:
| We're a much smaller scale company and the cost we lose on
| these things is insignificant compared to what's in this story.
| Yesterday I was improving the process for creating databases in
| our azure and I stumbled upon a subscription which was running
| 7 mssql servers for 12 databases. These weren't elastic and
| they were each paying a license that we don't have to pay
| because we qualify for the base cost through our contract with
| our microsoft partner. This company has some of the thightest
| control over their cloud infrastructure out of any organisation
| I've worked with.
|
| This is anecdotal, but if my experiences aren't unique then
| there is a lot of lack of reasonable in DevOps.
| ffsm8 wrote:
| Isn't that mostly down to the fact the vast majority of devs
| explicitly don't want to do anything wrt Ops?
|
| DevOps has - ever since it's originally well meaning
| inception (by Netflix iirc?) - been implemented across our
| industry as an effective cost cutting measure, forcing devs
| that didn't see it as their job to _also_ handle it.
|
| Which consequently means they're not interfacing with it
| whatsoever. They do as little as they can get away with,
| which inevitably means things are being done with borderline
| malicious compliance... Or just complete incompetence.
|
| I'm not even sure I'd blame these devs in particular. The
| devs just saw it as a quick bonus generator for the MBA in
| charge of this rebranding while offloading more
| responsibilities in their shoulders.
|
| DevOps made total sense in the work culture where this
| concept was conceived - Netflix was well known at that point
| to only ever employ senior Devs. However, in the context of
| the average 9-5 dev, which often knows a lot less then even
| some enthusiastic Jrs... Let's just say that it's incredibly
| dicey wherever it's successful in practice.
| mustyoshi wrote:
| I politely disagree. I spent maybe 8 hours over a week
| rightsizing a handful of heavy deployments from a previous
| team and reduced their peak resource usage by implementing
| better scaling policies. Before the new scaling policy the
| service would scale out and new pods would remain idle and
| ultimately get terminated without ever responding to a
| request quite frequently.
|
| The service dashboards already existed, all I had to do was
| a bit of load testing and read the graphs.
|
| It's not too much extra work to make sure you're scaling
| efficiently.
| ffsm8 wrote:
| You disagree but then cite another example of low hanging
| fruits that nobody took action on until you came along?
|
| Did you accidentally respond to the wrong comment?
| Because if anything you're giving another example of
| "most devs not wanting to interface with ops, hence
| letting it slide until someone bothers to pick up their
| slack"...
| FroshKiller wrote:
| The first time my director asked me if I'd ever heard of
| DevOps, I said, "Sure, doing two jobs for one paycheck."
| I'm a software developer, buddy. I write the programs.
| Leave me out of running them.
| fock wrote:
| we have on-prem with heavy spikes (our batch workload can
| utilize the 20TB of memory in the cluster easily) and we just
| don't care much and add 10% every year to the hardware
| requested. Compared to employing people or paying other vendors
| (relational databases with many TB-sized tables...) this is
| just irrelevant.
|
| Sadly devs are incentivized by that and going towards the cloud
| might be a fun story. Given the environment I hope they scrap
| the effort sooner rather than later, buy some Oxide systems for
| the people who need to iterate faster than the usual process of
| getting a VM and replace/reuse the 10% of the company occupied
| with the cloud (mind you: no real workload runs there yet...)
| to actually improve local processes...
| g-mork wrote:
| Somewhat unrelated, but you just tied wasteful software
| design to high it salaries, and also suggest a reason why
| Russian programmers might also seem to on the whole be far
| more effective than we are
|
| I wonder if msft simply cut dev salaries by 50% in the 90s,
| would it have had any measurable effect on windows quality by
| today
| formerly_proven wrote:
| It probably doesn't help that the first line of treatment for
| any error is to blindly increase memory request/limit and claim
| it's fixed (preferably without looking at the logs once).
| liampulles wrote:
| I'm a little surprised that it got to the point where pods which
| should consume a couple MB of RAM were consuming 4GB before
| action was taken. But I can also kind of understand it, because
| the way k8s operators (apps running in k8s that manipulate k8s
| resource) are meant to run is essentially a loop of listing
| resources, comparing to spec, and making moves to try and bring
| the state of the cluster closer to spec. This reconciliation loop
| is simple to understand (and I think this benefit has led to the
| creation of a wide array of excellent open source and proprietary
| operators that can be added to clusters). But its also a recipe
| for cascading explosions in resource usage.
|
| These kind of resource explosions are something I see all the
| time in k8s clusters. The general advice is to always try and
| keep pressure off the k8s API, and the consequence is that one
| must be very minimal and tactical with the operators one
| installs, and then engage in many hours of work trying to fine
| tune each operator to run efficiently (e.g. Grafana, whose
| default helm settings do not use the recommended log indexing
| algorithm, and which needs to be tweaked to get an appropriate
| set of read vs. write pods for your situation).
|
| Again, I recognize there is a tradeoff here - the simplicity and
| openness of the k8s API is what has led to a flourish of new
| operators, which really has allowed one to run "their own cloud".
| But there is definitely a cost. I don't know what the solution
| is, and I'm curious to hear from people who have other views of
| it, or use other solutions to k8s which offer a different set of
| tradeoffs.
| never_inline wrote:
| > are meant to run is essentially a loop of listing resources,
| comparing to spec, and making moves to try and bring the state
| of the cluster closer to spec.
|
| Aren't they supposed to use watch/long polling?
___________________________________________________________________
(page generated 2025-11-01 23:01 UTC)