https://render.com/blog/how-we-found-7-tib-of-memory-just-sitting-around Debug your Render services in Claude Code and Cursor. Try Render MCP Product Platform Overview HIPAA on Render Features + Autoscaling + Private Networking + Persistent Disks + Infrastructure as Code + Preview Environments + Zero Downtime Deploys + Render API Services + Static Sites + Web Services + Private Services + Background Workers + Cron Jobs + Render Postgres + Render Key Value * Pricing * Customers * Blog * Docs * Changelog Company + About Us + Security + Careers + Press * Contact Sign In Get Started Engineering October 30, 2025 Engineering How We Found 7 TiB of Memory Just Sitting Around Brian Stack October 30, 2025 Brian Stack " Debugging infrastructure at scale is rarely about one big aha moment. It's often the result of many small questions, small changes, and small wins stacked up until something clicks. Inside the hypercube of bad vibes: the namespace dimension Getting ready to dissect what I like to call: the Kubernetes hypercube of bad vibes.Getting ready to dissect what I like to call: the Kubernetes hypercube of bad vibes. Credits: Hyperkube from gregegan.net, diagram (modified) from Kubernetes community repo Plenty of teams run Kubernetes clusters bigger than ours. More nodes, more pods, more ingresses, you name it. In most dimensions, someone out there has us beat. There's one dimension where I suspect we might be near the very top: namespaces. I say that because we keep running into odd behavior in any process that has to keep track of them. In particular, anything that listwatches them ends up using a surprising amount of memory and puts real pressure on the apiserver. This has become one of those scaling quirks you only really notice once you hit a certain threshold. As this memory overhead adds up, efficiency decreases: each byte we have to use for management is a byte we can't put towards user services. The problem gets significantly worse when a daemonset needs to listwatch namespaces or network policies (netpols, which we define per namespace). Since daemonsets run a pod on every node, each of those pods independently performs a listwatch on the same resources. As a result, memory usage increases with the number of nodes. Even worse, these listwatch calls can put significant load on the apiserver. If many daemonset pods restart at once, such as during a rollout, they can overwhelm the server with requests and cause real disruption. Following the memory trail A few months ago, if you looked at our nodes, the largest memory consumers were often daemonsets. In particular, Calico and Vector which handle configuring networking and log collection respectively. We had already done some work to reduce Calico's memory usage, working closely with the project's maintainers to make it scale more efficiently. That optimization effort was a big win for us, and it gave us useful insight into how memory behaves when namespaces scale up. Memory profiling resultsMemory profiling results Time-series graph of memory usage per pod for calico-node instances Time-series graph of memory usage per pod for calico-node instances To support that work, we set up a staging cluster with several hundred thousand namespaces. We knew that per-namespace network policies (netpols) were the scaling factor that stressed Calico, so we reproduced those conditions to validate our changes. While running those tests, we noticed something strange. Vector, another daemonset, also started consuming large amounts of memory. Memory usage per pod graph showing Vector podsMemory usage per pod graph showing Vector pods The pattern looked familiar, and we knew we had another problem to dig into. Vector obviously wasn't looking at netpols but after poking around a bit we found it was listwatching namespaces from every node in order to allow referencing namespace labels per-pod in the kubernetes logs source. Do we really need these labels? That gave us an idea: what if Vector didn't need to use namespaces at all? Was that even possible? As it turns out, yes, they were in use in our configuration, but only to check whether a pod belonged to a user namespace. Conveniently, we realized we could hackily describe that condition in another way, and the memory savings were absolutely worth it. Building the fix (and breaking the logs) At that point, we were feeling a bit too lucky. We reached out to the Vector maintainers to ask whether disabling this behavior would actually work, and whether they would be open to accepting a contribution if we made it happen. GitHub comment proposing to make namespace list/watching in Vector an opt-in settingGitHub comment proposing to make namespace list/ watching in Vector an opt-in setting From there, all that was left was to try it. The code change was straightforward. We added a new config option and threaded it through the relevant parts of the codebase. After a few hours of flailing at rustc, a Docker image finally built and we were ready to test the theory. The container ran cleanly with no errors in the logs, which seemed promising. But then we hit a snag. Nothing was being emitted. No logs at all. I couldn't figure out why. Thankfully, our pal Claude came to the rescue: Claude chatbot answer: Looking at the code, I can see the issue. When add_namespace_fields is set to false, the namespace watcher/reflector is not created (lines 722-741). However, there's still a dependency on the namespace state in the K8sPathsProvider (line 768) and NamespaceMetadataAnnotator (line 774)Claude chatbot answer: Looking at the code, I can see the issue. When add_namespace_fields is set to false, the namespace watcher/reflector is not created (lines 722-741). However, there's still a dependency on the namespace state in the K8sPathsProvider (line 768) and NamespaceMetadataAnnotator (line 774) I rebuilt it (which took like 73 hours because Rust), generated a new image, updating staging, and watched nervously. This time, logs were flowing like normal and...Memory usage per pod graph Memory usage per pod graph The numbers don't add up The change saved 50 percent of memory. A huge win. We were ready to wrap it up and ship to production. But then Hieu, one of our teammates, asked a very good question. Slack conversation: Hieu: that sounds good. still concerning that it uses 1Gi in staging without namespace data, though.Me: we can profile it more if we want. now that I can build images, I think we can do the Rust profiling the same way we get from Go for free.Hieu: cool! yeah, I think it'd be worth understanding where the RAM is goingSlack conversation: Hieu: that sounds good. still concerning that it uses 1Gi in staging without namespace data, though. Me: we can profile it more if we want. now that I can build images, I think we can do the Rust profiling the same way we get from Go for free.Hieu: cool! yeah, I think it'd be worth understanding where the RAM is going He was right, something didn't add up. A few hours later, after repeatedly running my head into a wall, I still hadn't found anything. There was still a full gibibyte of memory unaccounted for. My whole theory about how this worked was starting to fall apart. I even dropped into the channel to see if anyone had Valgrind experience: Me (later in channel): anybody got a background in valgrind? seems pretty straightforward to get working so far but it won't end up interfacing with pyroscope. we'll have to exec in and gdb manually. The answer was no. In a last-ditch effort to profile it again, I finally saw the answer. It had been staring me in the face the whole time. We actually had two kubernetes_logs sources on user nodes. I had only set the flag on one of them. Once I applied it to both, memory usage dropped to the level we had seen in staging before the extra namespaces were added. Shipping it I put together a full pull request, and after waiting a little while, it shipped! PR merged!PR merged! Changelog noting new insert_namespace_fields. Click to see it in the Vector docs.Changelog noting new insert_namespace_fields. Click to see it in the Vector docs. Around the same time, our colleague Mark happened to be on-call. He did his usual magic -- pulled everything together, tested the rollout in staging, and got it shipped to production. I'll let the results speak for themselves. Memory usage per pod dropped from nearly 4 GiB down to just a few tens of MiB not to mention the reduction in CPU and network IOMemory usage per pod dropped from nearly 4 GiB down to just a few tens of MiB not to mention the reduction in CPU and network IO The total memory usage across one of our larger clusters dropped by 1TiB.The total memory usage across one of our larger clusters dropped by 1TiB. Our largest cluster saw a 1 TiB memory drop, with savings across our other clusters adding up to a total of just over 7 TiB. 7 TiB later Debugging infrastructure at scale is rarely about one big "aha" moment. It's often the result of many small questions, small changes, and small wins stacked up until something clicks. In this case, it started with a memory chart that didn't look quite right, a teammate asking the right question at the right time, and a bit of persistence. When applied to our whole infrastructure, that simple fix freed up 7 TiB of memory, reduced risk during rollouts, and made the system easier to reason about. Huge thanks to Hieu for pushing the investigation forward, Mark for shipping it smoothly, and the Vector maintainers for being responsive and open to the change. If you're running daemonsets at scale and seeing unexplained memory pressure, it might be worth asking: Do you really need those namespace labels? Render takes your infrastructure problems away and gives you a battle-tested, powerful, and cost-effective cloud with an outstanding developer experience. Focus on building your apps, shipping fast, and delighting your customers, and leave your cloud infrastructure to us. Try Render Free Explore Docs Features * Autoscaling * Private Networking * Render Postgres * Render Key Value * Persistent Disks * Infrastructure As Code * Preview Environments * Zero Downtime Deploys * Render API Services * Static Sites * Web Services * Private Services * Background Workers * Cron Jobs * Render Postgres * Render Key Value * Deploy Docker Images Resources * Blog * Changelog * Docs * Pricing * Security * Community * Privacy Policy * DPA * Acceptable Use * Terms of Use Company * About * Careers * Brand Kit * Press * Contact * X * LinkedIn * GitHub * (c) Render 2025 yamlCopy to clipboard type: filter inputs: [add_log_levels] condition: type: vrl source: .kubernetes.namespace_labels.userNS == "true" goCopy to clipboard .kubernetes.namespace_name != null && ( starts_with(string!(.kubernetes.namespace_name), "user-") ) goCopy to clipboard insert_namespace_fields: { description: """ Specifies whether or not to enrich logs with namespace fields. Setting to `false` prevents Vector from pulling in namespaces and thus namespace label fields will not be available. This helps reduce load on the `kube-apiserver` and lowers daemonset memory usage in clusters with many namespaces. """ required: false type: bool: default: true } yamlCopy to clipboard sources: kubernetes_system_logs: type: kubernetes_logs glob_minimum_cooldown_ms: 1000 ... kubernetes_logs: type: kubernetes_logs glob_minimum_cooldown_ms: 1000 ...