[HN Gopher] Heartbeats in Distributed Systems
___________________________________________________________________
Heartbeats in Distributed Systems
Author : sebg
Score : 90 points
Date : 2025-11-13 13:43 UTC (9 hours ago)
(HTM) web link (arpitbhayani.me)
(TXT) w3m dump (arpitbhayani.me)
| paulsutter wrote:
| > When a system uses very short intervals, such as sending
| heartbeats every 500 milliseconds
|
| 500 milliseconds is a very long interval, on a CPU timescale.
| Funny how we all tend to judge intervals based on human
| timescales
|
| Of course the best way to choose heartbeat intervals is based on
| metrics like transaction failure rate or latency
| blipvert wrote:
| Well, it is called a heartbeat after all, not a oscillator beat
| :-)
| hinkley wrote:
| Top shelf would be noticing an anomaly in behavior for a node
| and then interrogating it to see what's wrong.
|
| Automatic load balancing always gets weird, because it can end
| up sending more traffic to the sick server instead of less,
| because the results come back faster. So you have to be careful
| with status codes.
| just_mc wrote:
| You have to consider the tail latencies of the system
| responding plus the network in between. The p99 is typically
| much higher than the average. Also, may have to account for GC
| as was mentioned in the article. 500ms gets used up pretty
| fast.
| macintux wrote:
| Related advice based on my days working at Basho: find a way to
| recognize, and terminate, slow-running (or erratically-behaving)
| servers.
|
| A dead server is _much_ better for a distributed system than a
| misbehaving one. The latter can bring down your entire
| application.
| rcxdude wrote:
| Indeed, which is why I've heard of failover setups where the
| backup has a means to make very sure that the main system is
| off before it takes over (often by cutting the power).
| owl_vision wrote:
| I did this systematically: at the first sign of outlier in
| performance one system would move itself to another platform
| and shut itself down. The shutdown meant turn all services
| off and let someone log in to investigate and rearrange it
| again. This system allowed different roles to be assigned to
| different platform. The platform was bare metal or bhyve vm.
| It worked perfect.
| xyzzy_plugh wrote:
| Usually we call this STONITH
| westurner wrote:
| Docker and Kubernetes have health check mechanisms to help
| solve for this;
|
| Docker docs > Dockerfile HEALTHCHECK instruction:
| https://docs.docker.com/reference/dockerfile/#healthcheck
|
| Podman docs > podman-healthcheck-run, docker-healthcheck-run:
| https://docs.podman.io/en/v5.4.0/markdown/podman-healthcheck...
|
| Kubernetes docs > "Configure Liveness, Readiness and Startup
| Probes" https://kubernetes.io/docs/tasks/configure-pod-
| container/con...
| toast0 wrote:
| > Consider a system with 1000 nodes where each node sends
| heartbeats to a central monitor every 500 milliseconds. This
| results in 2000 heartbeat messages per second just for health
| monitoring. In a busy production environment, this overhead can
| interfere with actual application traffic.
|
| If your 1000-node busy production environment is run so close to
| the edge that 2000 heartbeat messages per second, push it into
| overload, that's impressive resource scheduling.
|
| Really, setting the interval balances speed of detection/cost of
| slow detection vs cost of reacting to a momentary interruption.
| If the node actually dies, you'd like to react as soon as
| possible; but if it's something like a link flap or system pause
| (GC or otherwise), most applications would prefer to wait and not
| transition state; some applications like live broadcast are
| better served by moving very rapidly and 500 ms might be too
| long.
|
| Re: network partitioning, the author left out the really fun
| splits. Say you have servers in DC, TX, and CA. If there's a
| damaged (but not severed) link between TX and CA, there's a good
| chance that DC can talk to everyone, but TX and CA can't
| communicate. You can have that inside a datacenter too, maybe
| each node can only reach 75% of the other nodes, but A can reach
| B and B can reach C does not indicate A can reach C. Lots of fun
| times there.
| macintux wrote:
| Kingsbury & Bailis's paper on the topic of network partitions:
| https://github.com/aphyr/partitions-post
| karmakaze wrote:
| I've dealt with exactly this. We had a couple thousand webapp
| server instances that had connections to a MySQL database. Each
| one only polled its connection for liveliness once per second,
| but those were little interruptions that were poking at the
| servers and showed up on top time consuming request charts.
| hinkley wrote:
| When systems were smaller I tried to push for the realization
| that I don't need a heartbeat from a machine that is currently
| returning status 200 messages from 60 req/s. The evidence of
| work is already there, and more meaningful than the status
| check.
|
| We end up adding real work to the status checks often enough
| anyway, to make sure the database is still visible and other
| services. So inference has a lot of power that a heartbeat does
| not.
| taeric wrote:
| I'm amused that you got pushback. Just going on the metaphor,
| the very way that we check heartbeats in people is by a proof
| of work in some other part of the body. Isn't like we are
| directly hooked into the heart. It isn't like the heart sends
| out an otherwise useless signal.
| zbentley wrote:
| > If your 1000-node busy production environment is run so close
| to the edge that 2000 heartbeat messages per second, push it
| into overload, that's impressive resource scheduling.
|
| Eeeh I'm not so sure. The overhead of _handling_ a hello-world
| heartbeat request is negligible, sure, but what about the
| overhead of having the connections open (file descriptors,
| maybe >1 per request), client tracking metadata for what are
| necessarily all different client location identifiers, and so
| on?
|
| That's still cheap stuff, but at 2krps there are totally
| realistic scenarios where a system with decent capacity
| budgeting could still be adversely affected by heartbeats.
|
| And what if a heartbeat client's network link is degraded and
| there's a super long time between first byte and last? Whether
| or not that client gets evicted from the cluster, if it's
| basically slowloris-ing the server that can cause issues too.
| toast0 wrote:
| File descriptors are a limited resource, but the limits are
| huge. My little 2GB instances on GCP claim a limit of 1M;
| FreeBSD autotunes my 16GB servers to 0.5M (but I could
| increase it if I needed).
|
| I just don't know how you have a 1000 node system and you
| can't manage to heartbeat everything 2x a second; I don't
| think you need that many heartbeats in most systems, but it's
| just not that much work. The only way I can see it being a
| lot of work is if your nodes are _very_ small; but do you
| really need a 1000 node esp8266 cluster and you can 't get
| anything bigger for the management node?
|
| > And what if a heartbeat client's network link is degraded
| and there's a super long time between first byte and last?
| Whether or not that client gets evicted from the cluster, if
| it's basically slowloris-ing the server that can cause issues
| too.
|
| How big is your heartbeat, is the response really going to be
| in multiple packets? With a 500ms heartbeat and a typical
| 3-5x no response => dead, you're going to hold onto the
| partial heartbeat for like 2 seconds.
| schmichael wrote:
| > Really, setting the interval balances speed of detection/cost
| of slow detection vs cost of reacting to a momentary
| interruption.
|
| Another option is dynamically adjusting heartbeat interval
| based on cluster-size to ensure processing heartbeats has a
| fixed cost. That's what Nomad does and in my 10 year fuzzy
| memory heartbeating has never caused resource constraints on
| the schedulers:
| https://developer.hashicorp.com/nomad/docs/configuration/ser...
| For reference clusters are commonly over 10k nodes and to my
| knowledge peak between 20k-30k. At least if anyone is running
| Nomad larger than that I'd love to hear from them!
|
| That being said the default of 50/s is probably too low, and
| the liveness tradeoff we force on users is probably not
| articulated clearly enough.
|
| As an off-the-shelf scheduler we can't encode liveness costs
| for our users unfortunately, but we try to offer the right
| knobs to adjust it including per-workload parameters for what
| to do when heartbeats fail:
| https://developer.hashicorp.com/nomad/docs/job-specification...
|
| (Disclaimer: I'm on the Nomad team)
| tstack wrote:
| > ... push it into overload ...
|
| Oh, oh, I get to talk about my favorite bug!
|
| I was working on network-booting servers with iPXE and we got a
| bug saying that things were working fine until the cluster size
| went over 4/5 machines. In a larger cluster, machines would not
| come up from a reboot. I thought QA was just being silly, why
| would the size of the cluster matter? I took a closer look and,
| sure enough, was able to reproduce the bug. Basically, the
| machine would sit there stuck trying to download the boot image
| over TCP from the server.
|
| After some investigation, it turned out to be related to the
| heartbeats sent between machines (they were ICMP pings). Since
| iPXE is a very nice and fancy bootloader, it will happily
| respond to ICMP pings. Note that, in order to do this, it would
| do an ARP to find address to send the response to.
| Unfortunately, the size of the ARP cache was pretty small since
| this was "embedded" software (take a guess how big the cache
| was...). Essentially, while iPXE was downloading the image, the
| address of the image server would get pushed out of the ARP
| cache by all these heartbeats. Thus, the download would suffer
| since it had to constantly pause to redo the ARP request. So,
| things would work with a smaller cluster size since the ARP
| cache was big enough to keep track of the download server and
| the peers in the cluster.
|
| I think I "fixed" it by responding to the ICMP using the source
| MAC address (making sure it wasn't broadcast) rather than doing
| an ARP.
| ignoramous wrote:
| > _You can have that inside a datacenter too, maybe each node
| can only reach 75% of the other nodes, but A can reach B and B
| can reach C does not indicate A can reach C. Lots of fun times
| there_
|
| At _BigCloud_ in the early days, things went berserk with a
| gossip system when A could reach B but B couldn 't reach A.
|
| Cloudflare hit something similar though they misclassified the
| failure mode: https://blog.cloudflare.com/a-byzantine-failure-
| in-the-real-...
| __turbobrew__ wrote:
| Does anyone have recommendations on books/papers/articles which
| cover gossip protocols?
|
| I have been more interested in learning about gossip protocols
| and how they are used, different tradeoffs, etc.
| rishabhaiover wrote:
| https://thesecretlivesofdata.com/raft/
| rdtsc wrote:
| > https://thesecretlivesofdata.com/raft/
|
| Are you suggesting to use raft as a gossip protocol? Run a
| replicated state machine with leader election, replicated
| logs and stable storage?
| the_duke wrote:
| Raft is a consensus protocol, which is very different from a
| gossip protocol.
| rishabhaiover wrote:
| I'm sorry, I got confused.
| the_duke wrote:
| Two interesting papers:
|
| * Epidemic broadcast trees:
| https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf
|
| * HyParView:
| https://asc.di.fct.unl.pt/~jleitao/pdf/dsn07-leitao.pdf
|
| The iroh-gossip implementation is based on those:
| https://docs.rs/iroh-gossip/latest/iroh_gossip/
| __turbobrew__ wrote:
| Thank you
| John23832 wrote:
| While not a book/paper/article, this is good implementation
| practice: https://fly.io/dist-sys/
| just_mc wrote:
| Check out SWIM and its extensions:
| https://en.wikipedia.org/wiki/SWIM_Protocol
| https://arxiv.org/abs/1707.00788
| candiddevmike wrote:
| I've been noodling a lot on how IP/ARP works as a "distributed
| system". Are there any reference distributed systems that have a
| similar setup of "optimistic"/best effort delivery? IPv6 and NDP
| seem like they could scale a lot, what would be the negatives
| about using a similar design for RPC?
| jeffbee wrote:
| Some fuzzy thinking in here. "A heartbeat sent from a node in
| California to a monitor in Virginia might take 80 milliseconds
| under normal conditions, but could spike to 200 milliseconds
| during periods of congestion." This is not really the effect of
| congestion, or at best this sentence misleads the reader. The
| mechanism that causes high latency during congestion is dropped
| frames, which are retried at the protocol level based on timers.
| You can get a 200ms delay between two nodes even if they are
| adjacent, because the TCP minimum RTO is 200ms.
| toast0 wrote:
| Congestion manifests as packet queueing as well as packet
| dropping. 120 ms would be a lot of queuing, especially if we
| assume the 1000 node cluster is servers on high bandwidth
| networks, but some network elements are happy to buffer that
| much without dropping packets.
|
| You could also get a jump to 200 ms round trip if a link in the
| path goes down and a significantly less optimal route is
| chosen. Again, 120 ms is a large change, but routing policies
| don't always result in the best real world results; and while
| the link status propagates, packets may take longer paths or
| loop.
| westurner wrote:
| Why can't network time synchronization services like SPTP and
| WhiteRabbit also solve for heartbeats in distributed systems?
| QuiCasseRien wrote:
| Nice article, I will use the concept for my own network node bots
| ^^
___________________________________________________________________
(page generated 2025-11-13 23:00 UTC)