[HN Gopher] Heartbeats in Distributed Systems
       ___________________________________________________________________
        
       Heartbeats in Distributed Systems
        
       Author : sebg
       Score  : 90 points
       Date   : 2025-11-13 13:43 UTC (9 hours ago)
        
 (HTM) web link (arpitbhayani.me)
 (TXT) w3m dump (arpitbhayani.me)
        
       | paulsutter wrote:
       | > When a system uses very short intervals, such as sending
       | heartbeats every 500 milliseconds
       | 
       | 500 milliseconds is a very long interval, on a CPU timescale.
       | Funny how we all tend to judge intervals based on human
       | timescales
       | 
       | Of course the best way to choose heartbeat intervals is based on
       | metrics like transaction failure rate or latency
        
         | blipvert wrote:
         | Well, it is called a heartbeat after all, not a oscillator beat
         | :-)
        
         | hinkley wrote:
         | Top shelf would be noticing an anomaly in behavior for a node
         | and then interrogating it to see what's wrong.
         | 
         | Automatic load balancing always gets weird, because it can end
         | up sending more traffic to the sick server instead of less,
         | because the results come back faster. So you have to be careful
         | with status codes.
        
         | just_mc wrote:
         | You have to consider the tail latencies of the system
         | responding plus the network in between. The p99 is typically
         | much higher than the average. Also, may have to account for GC
         | as was mentioned in the article. 500ms gets used up pretty
         | fast.
        
       | macintux wrote:
       | Related advice based on my days working at Basho: find a way to
       | recognize, and terminate, slow-running (or erratically-behaving)
       | servers.
       | 
       | A dead server is _much_ better for a distributed system than a
       | misbehaving one. The latter can bring down your entire
       | application.
        
         | rcxdude wrote:
         | Indeed, which is why I've heard of failover setups where the
         | backup has a means to make very sure that the main system is
         | off before it takes over (often by cutting the power).
        
           | owl_vision wrote:
           | I did this systematically: at the first sign of outlier in
           | performance one system would move itself to another platform
           | and shut itself down. The shutdown meant turn all services
           | off and let someone log in to investigate and rearrange it
           | again. This system allowed different roles to be assigned to
           | different platform. The platform was bare metal or bhyve vm.
           | It worked perfect.
        
           | xyzzy_plugh wrote:
           | Usually we call this STONITH
        
         | westurner wrote:
         | Docker and Kubernetes have health check mechanisms to help
         | solve for this;
         | 
         | Docker docs > Dockerfile HEALTHCHECK instruction:
         | https://docs.docker.com/reference/dockerfile/#healthcheck
         | 
         | Podman docs > podman-healthcheck-run, docker-healthcheck-run:
         | https://docs.podman.io/en/v5.4.0/markdown/podman-healthcheck...
         | 
         | Kubernetes docs > "Configure Liveness, Readiness and Startup
         | Probes" https://kubernetes.io/docs/tasks/configure-pod-
         | container/con...
        
       | toast0 wrote:
       | > Consider a system with 1000 nodes where each node sends
       | heartbeats to a central monitor every 500 milliseconds. This
       | results in 2000 heartbeat messages per second just for health
       | monitoring. In a busy production environment, this overhead can
       | interfere with actual application traffic.
       | 
       | If your 1000-node busy production environment is run so close to
       | the edge that 2000 heartbeat messages per second, push it into
       | overload, that's impressive resource scheduling.
       | 
       | Really, setting the interval balances speed of detection/cost of
       | slow detection vs cost of reacting to a momentary interruption.
       | If the node actually dies, you'd like to react as soon as
       | possible; but if it's something like a link flap or system pause
       | (GC or otherwise), most applications would prefer to wait and not
       | transition state; some applications like live broadcast are
       | better served by moving very rapidly and 500 ms might be too
       | long.
       | 
       | Re: network partitioning, the author left out the really fun
       | splits. Say you have servers in DC, TX, and CA. If there's a
       | damaged (but not severed) link between TX and CA, there's a good
       | chance that DC can talk to everyone, but TX and CA can't
       | communicate. You can have that inside a datacenter too, maybe
       | each node can only reach 75% of the other nodes, but A can reach
       | B and B can reach C does not indicate A can reach C. Lots of fun
       | times there.
        
         | macintux wrote:
         | Kingsbury & Bailis's paper on the topic of network partitions:
         | https://github.com/aphyr/partitions-post
        
         | karmakaze wrote:
         | I've dealt with exactly this. We had a couple thousand webapp
         | server instances that had connections to a MySQL database. Each
         | one only polled its connection for liveliness once per second,
         | but those were little interruptions that were poking at the
         | servers and showed up on top time consuming request charts.
        
         | hinkley wrote:
         | When systems were smaller I tried to push for the realization
         | that I don't need a heartbeat from a machine that is currently
         | returning status 200 messages from 60 req/s. The evidence of
         | work is already there, and more meaningful than the status
         | check.
         | 
         | We end up adding real work to the status checks often enough
         | anyway, to make sure the database is still visible and other
         | services. So inference has a lot of power that a heartbeat does
         | not.
        
           | taeric wrote:
           | I'm amused that you got pushback. Just going on the metaphor,
           | the very way that we check heartbeats in people is by a proof
           | of work in some other part of the body. Isn't like we are
           | directly hooked into the heart. It isn't like the heart sends
           | out an otherwise useless signal.
        
         | zbentley wrote:
         | > If your 1000-node busy production environment is run so close
         | to the edge that 2000 heartbeat messages per second, push it
         | into overload, that's impressive resource scheduling.
         | 
         | Eeeh I'm not so sure. The overhead of _handling_ a hello-world
         | heartbeat request is negligible, sure, but what about the
         | overhead of having the connections open (file descriptors,
         | maybe  >1 per request), client tracking metadata for what are
         | necessarily all different client location identifiers, and so
         | on?
         | 
         | That's still cheap stuff, but at 2krps there are totally
         | realistic scenarios where a system with decent capacity
         | budgeting could still be adversely affected by heartbeats.
         | 
         | And what if a heartbeat client's network link is degraded and
         | there's a super long time between first byte and last? Whether
         | or not that client gets evicted from the cluster, if it's
         | basically slowloris-ing the server that can cause issues too.
        
           | toast0 wrote:
           | File descriptors are a limited resource, but the limits are
           | huge. My little 2GB instances on GCP claim a limit of 1M;
           | FreeBSD autotunes my 16GB servers to 0.5M (but I could
           | increase it if I needed).
           | 
           | I just don't know how you have a 1000 node system and you
           | can't manage to heartbeat everything 2x a second; I don't
           | think you need that many heartbeats in most systems, but it's
           | just not that much work. The only way I can see it being a
           | lot of work is if your nodes are _very_ small; but do you
           | really need a 1000 node esp8266 cluster and you can 't get
           | anything bigger for the management node?
           | 
           | > And what if a heartbeat client's network link is degraded
           | and there's a super long time between first byte and last?
           | Whether or not that client gets evicted from the cluster, if
           | it's basically slowloris-ing the server that can cause issues
           | too.
           | 
           | How big is your heartbeat, is the response really going to be
           | in multiple packets? With a 500ms heartbeat and a typical
           | 3-5x no response => dead, you're going to hold onto the
           | partial heartbeat for like 2 seconds.
        
         | schmichael wrote:
         | > Really, setting the interval balances speed of detection/cost
         | of slow detection vs cost of reacting to a momentary
         | interruption.
         | 
         | Another option is dynamically adjusting heartbeat interval
         | based on cluster-size to ensure processing heartbeats has a
         | fixed cost. That's what Nomad does and in my 10 year fuzzy
         | memory heartbeating has never caused resource constraints on
         | the schedulers:
         | https://developer.hashicorp.com/nomad/docs/configuration/ser...
         | For reference clusters are commonly over 10k nodes and to my
         | knowledge peak between 20k-30k. At least if anyone is running
         | Nomad larger than that I'd love to hear from them!
         | 
         | That being said the default of 50/s is probably too low, and
         | the liveness tradeoff we force on users is probably not
         | articulated clearly enough.
         | 
         | As an off-the-shelf scheduler we can't encode liveness costs
         | for our users unfortunately, but we try to offer the right
         | knobs to adjust it including per-workload parameters for what
         | to do when heartbeats fail:
         | https://developer.hashicorp.com/nomad/docs/job-specification...
         | 
         | (Disclaimer: I'm on the Nomad team)
        
         | tstack wrote:
         | > ... push it into overload ...
         | 
         | Oh, oh, I get to talk about my favorite bug!
         | 
         | I was working on network-booting servers with iPXE and we got a
         | bug saying that things were working fine until the cluster size
         | went over 4/5 machines. In a larger cluster, machines would not
         | come up from a reboot. I thought QA was just being silly, why
         | would the size of the cluster matter? I took a closer look and,
         | sure enough, was able to reproduce the bug. Basically, the
         | machine would sit there stuck trying to download the boot image
         | over TCP from the server.
         | 
         | After some investigation, it turned out to be related to the
         | heartbeats sent between machines (they were ICMP pings). Since
         | iPXE is a very nice and fancy bootloader, it will happily
         | respond to ICMP pings. Note that, in order to do this, it would
         | do an ARP to find address to send the response to.
         | Unfortunately, the size of the ARP cache was pretty small since
         | this was "embedded" software (take a guess how big the cache
         | was...). Essentially, while iPXE was downloading the image, the
         | address of the image server would get pushed out of the ARP
         | cache by all these heartbeats. Thus, the download would suffer
         | since it had to constantly pause to redo the ARP request. So,
         | things would work with a smaller cluster size since the ARP
         | cache was big enough to keep track of the download server and
         | the peers in the cluster.
         | 
         | I think I "fixed" it by responding to the ICMP using the source
         | MAC address (making sure it wasn't broadcast) rather than doing
         | an ARP.
        
         | ignoramous wrote:
         | > _You can have that inside a datacenter too, maybe each node
         | can only reach 75% of the other nodes, but A can reach B and B
         | can reach C does not indicate A can reach C. Lots of fun times
         | there_
         | 
         | At _BigCloud_ in the early days, things went berserk with a
         | gossip system when A could reach B but B couldn 't reach A.
         | 
         | Cloudflare hit something similar though they misclassified the
         | failure mode: https://blog.cloudflare.com/a-byzantine-failure-
         | in-the-real-...
        
       | __turbobrew__ wrote:
       | Does anyone have recommendations on books/papers/articles which
       | cover gossip protocols?
       | 
       | I have been more interested in learning about gossip protocols
       | and how they are used, different tradeoffs, etc.
        
         | rishabhaiover wrote:
         | https://thesecretlivesofdata.com/raft/
        
           | rdtsc wrote:
           | > https://thesecretlivesofdata.com/raft/
           | 
           | Are you suggesting to use raft as a gossip protocol? Run a
           | replicated state machine with leader election, replicated
           | logs and stable storage?
        
           | the_duke wrote:
           | Raft is a consensus protocol, which is very different from a
           | gossip protocol.
        
             | rishabhaiover wrote:
             | I'm sorry, I got confused.
        
         | the_duke wrote:
         | Two interesting papers:
         | 
         | * Epidemic broadcast trees:
         | https://asc.di.fct.unl.pt/~jleitao/pdf/srds07-leitao.pdf
         | 
         | * HyParView:
         | https://asc.di.fct.unl.pt/~jleitao/pdf/dsn07-leitao.pdf
         | 
         | The iroh-gossip implementation is based on those:
         | https://docs.rs/iroh-gossip/latest/iroh_gossip/
        
           | __turbobrew__ wrote:
           | Thank you
        
         | John23832 wrote:
         | While not a book/paper/article, this is good implementation
         | practice: https://fly.io/dist-sys/
        
         | just_mc wrote:
         | Check out SWIM and its extensions:
         | https://en.wikipedia.org/wiki/SWIM_Protocol
         | https://arxiv.org/abs/1707.00788
        
       | candiddevmike wrote:
       | I've been noodling a lot on how IP/ARP works as a "distributed
       | system". Are there any reference distributed systems that have a
       | similar setup of "optimistic"/best effort delivery? IPv6 and NDP
       | seem like they could scale a lot, what would be the negatives
       | about using a similar design for RPC?
        
       | jeffbee wrote:
       | Some fuzzy thinking in here. "A heartbeat sent from a node in
       | California to a monitor in Virginia might take 80 milliseconds
       | under normal conditions, but could spike to 200 milliseconds
       | during periods of congestion." This is not really the effect of
       | congestion, or at best this sentence misleads the reader. The
       | mechanism that causes high latency during congestion is dropped
       | frames, which are retried at the protocol level based on timers.
       | You can get a 200ms delay between two nodes even if they are
       | adjacent, because the TCP minimum RTO is 200ms.
        
         | toast0 wrote:
         | Congestion manifests as packet queueing as well as packet
         | dropping. 120 ms would be a lot of queuing, especially if we
         | assume the 1000 node cluster is servers on high bandwidth
         | networks, but some network elements are happy to buffer that
         | much without dropping packets.
         | 
         | You could also get a jump to 200 ms round trip if a link in the
         | path goes down and a significantly less optimal route is
         | chosen. Again, 120 ms is a large change, but routing policies
         | don't always result in the best real world results; and while
         | the link status propagates, packets may take longer paths or
         | loop.
        
       | westurner wrote:
       | Why can't network time synchronization services like SPTP and
       | WhiteRabbit also solve for heartbeats in distributed systems?
        
       | QuiCasseRien wrote:
       | Nice article, I will use the concept for my own network node bots
       | ^^
        
       ___________________________________________________________________
       (page generated 2025-11-13 23:00 UTC)