[HN Gopher] Ask HN: Scheduling stateful nodes when MMAP makes me...
___________________________________________________________________
Ask HN: Scheduling stateful nodes when MMAP makes memory accounting
a lie
We're hitting a classic distributed systems wall and I'm looking
for war stories or "least worst" practices. The Context: We
maintain a distributed stateful engine (think search/analytics).
The architecture is standard: a Control Plane (Coordinator) assigns
data segments to Worker Nodes. The workload involves heavy use of
mmap and lazy loading for large datasets. The Incident: We had a
cascading failure where the Coordinator got stuck in a loop, DDOS-
ing a specific node. The Signal: Coordinator sees Node A has
significantly fewer rows (logical count) than the cluster average.
It flags Node A as "underutilized." The Action: Coordinator
attempts to rebalance/load new segments onto Node A. The Reality:
Node A is actually sitting at 197GB RAM usage (near OOM). The data
on it happens to be extremely wide (fat rows, huge blobs), so its
logical row count is low, but physical footprint is massive. The
Loop: Node A rejects the load (or times out). The Coordinator
ignores the backpressure, sees the low row count again, and retries
immediately. The Core Problem: We are trying to write a "God
Equation" for our load balancer. We started with row_count, which
failed. We looked at disk usage, but that doesn't correlate with
RAM because of lazy loading. Now we are staring at mmap. Because
the OS manages the page cache, the application-level RSS is noisy
and doesn't strictly reflect "required" memory vs "reclaimable"
cache. The Question: Attempting to enumerate every resource
variable (CPU, IOPS, RSS, Disk, logical count) into a single
scoring function feels like an NP-hard trap. How do you handle
placement in systems where memory usage is opaque/dynamic? Dumb
Coordinator, Smart Nodes: Should we just let the Coordinator blind-
fire based on disk space, and rely 100% on the Node to return hard
429 Too Many Requests based on local pressure? Cost Estimation: Do
we try to build a synthetic "cost model" per segment (e.g.,
predicted memory footprint) and schedule based on credits, ignoring
actual OS metrics? Control Plane Decoupling: Separate storage
balancing (disk) from query balancing (mem)? Feels like we are
reinventing the wheel. References to papers or similar architecture
post-mortems appreciated.
Author : leo_e
Score : 13 points
Date : 2025-11-24 17:30 UTC (5 hours ago)
| otterley wrote:
| It's not clear whether you're using Kubernetes, but the
| Kubernetes way of dealing with this problem is to declare a
| memory reservation (i.e., a request) along with the container
| specification. The amount of the reservation will be deducted
| from the host's available memory for scheduling purposes,
| regardless of whether the container actually consumes the
| reserved amount. It's also a best practice to configure the
| memory limit to be identical to the reservation, so if the
| container exceeds the reserved amount, the kernel will terminate
| it via the OOM killer.
|
| Of course, for this to work, you have to figure out what that
| reserved amount should be. That is an exercise for the
| implementer (i.e., you).
|
| See https://kubernetes.io/docs/concepts/configuration/manage-
| res...
|
| > Attempting to enumerate every resource variable (CPU, IOPS,
| RSS, Disk, logical count) into a single scoring function feels
| like an NP-hard trap.
|
| Yeah, don't do that. Figure out what resources your applications
| need and the declare them, and let the scheduler find the best
| node based on the requirements you've specified.
|
| > We are trying to write a "God Equation" for our load balancer.
| We started with row_count, which failed. We looked at disk usage,
| but that doesn't correlate with RAM because of lazy loading.
|
| A few things come to mind...
|
| First, you're talking about a load balancer, but it's not clear
| that you're trying to balance load! A good metric to use for load
| balancing is one whose value is proportional to response latency.
|
| It smells like you're trying to provision resources based on an
| optimistic prediction of your working set size. Perhaps you need
| a more pessimistic prediction. It might also be that you're
| relying too heavily on the kernel to handle paging, when what you
| really need is a cache tuned for your application that is scan-
| resistant, coupled with O_DIRECT for I/O.
| majke wrote:
| > Coordinator sees Node A has significantly fewer rows (logical
| count) than the cluster average. It flags Node A as
| "underutilized."
|
| Ok, so you are dealing with a classic - you measure A, but what
| matters is B. For "load" balancing a decent metric is, well,
| response time (and jitter).
|
| For data partitioning - I guess number of rows is not the right
| metric? Change it to number*avg_size or something?
|
| If you can't measure the thing directly, then take a look at
| stuff like "PID controller". This can be approach as a typical
| controller loop problem, although in 99% doing PID for software
| systems is an overkill.
| bcoates wrote:
| Memory pressure (and a lot of other overload conditions) usually
| makes latency worse--does that show up in your system? Latency
| backpressure is a pretty conventional thing to do. You're going
| to want some way to close the loop back to your load balancer, if
| you're doing open-loop control (sending a "fair share" of traffic
| to each node and assuming it can handle it) issues like you
| describe will keep coming up.
|
| This is a Hard Problem and you might be trying to get away with
| an unrealistically small amount of overprovisioning.
| wmf wrote:
| Have you measured Pressure Stall Information or active pages from
| /proc/meminfo?
|
| _Attempting to enumerate every resource variable (CPU, IOPS,
| RSS, Disk, logical count) into a single scoring function feels
| like an NP-hard trap._
|
| That's perfect for machine learning.
| shanemhansen wrote:
| This actually seems like a simple example of memory request vs
| limit.
|
| Request the amount of memory needed to be healthy, you can
| potentially set the limit higher to account for "reclaimable
| cache".
|
| Another way to approach it if you find that there are too many
| limiting metrics to accurately model things: is you let the
| workers grab more segments until you determine that they are
| overloaded. Ideally for this to work though you have some idea
| that the node is approaching saturation. So for example: keep
| adding segments as long as the nth percentile response time is
| under some threshold.
|
| The advantage of this approach is you don't necessarily have to
| know which resource (memory, filehandles, etc) is at capacity.
| You don't even necessarily have to have deep knowledge of linux
| memory management. You _just_ have to be able to probe the system
| to determine if it 's healthy.
|
| I can even go backwards with a binary split mechanism. You sort
| of bring up a node that owns [A-H] (8 segments in this case). If
| that fails bring up 2 nodes that own [A-D],[E-H], if that fails,
| all the way down to one segment per node.
___________________________________________________________________
(page generated 2025-11-24 23:01 UTC)