Cluster Outage Postmortem: Missing Keepalived

       
       
       Published on : 2026-01-14 10:04
       
        🧹 Incident Trigger
       
       This outage was not caused by a power failure, kernel panic, or 
       misconfiguration. It was caused by my wife shutting down my main 
       server ( Tiny) while vacuuming the living room.
       
       Tiny was not just another node. Tiny happened to be the entry point
       of the entire cluster. All external port forwarding rules were 
       redirected to Tiny:
       
        - HTTPS
        - SSH
        - Service endpoints
        - Gemini
        - Gopher
        - Finger
       
       This meant that when Tiny went down, NOTHING could be reached from 
       outside my home network, even though other nodes were still up 
       and healthy.
       
        ☠️ Root Cause
       
       The cluster had no high - availability IP management. No:
       
        - Virtual IP (VIP)
        - Automatic failover
        - Master election
        - Redundancy at the network layer
       
       All services ports were implicitly tied to a single machine.
       
       This meant: One shutdown = total service loss.
       
        📡 Detection
       
       The failure was immediately visible:
       
        - All services unreachable
        - External access dead
        - Internal routing broken
        - No automatic recovery
       
       This confirmed the presence of a hard single point of failure at 
       the network ingress layer.
       
        🛠️ Resolution: Keepalived Deployment
       
       Keepalived is a Linux daemon providing high availability (HA) and 
       load balancing for server clusters by implementing the Virtual 
       Router Redundancy Protocol (VRRP), allowing a Virtual IP (VIP) to 
       seamlessly failover between servers, ensuring continuous service 
       even if a master node fails. 
       
 (HTM)  🔖 KeepAlived Official Website
       
       I installed keepalived on all 5 machines in the cluster. 
       
           sudo apt install keepalived
       
       
       Each node now participates in a VRRP group. New Setup looks like 
       this:
       
        - 5 nodes
        - 1 shared Virtual IP (VIP)
        - Automatic master election
        - Automatic failover
        - Health-based priority handling
       
       If the active node goes down, another node takes over the VIP 
       within seconds. No manual intervention required.
       
        🚪 What Is VRRP?
       
       VRRP (Virtual Router Redundancy Protocol) allows multiple machines 
       to share a single IP address.
       
       At any given time:
       
        - One node is MASTER
        - The others are BACKUP
        - All nodes advertise their state via multicast
        - Priority determines who becomes master
        - If the master disappears, the highest-priority backup takes over
       
       The VIP is moved automatically between machines. To the network, 
       nothing changes. To clients, nothing breaks.
       
        🌐 Network Architecture Update
       
       All port forwardings now target the shared virtual IP, not a 
       specific machine. This removes the dependency on any single node 
       acting as the gateway.
       
       Benefits:
       
        - No more node specific NAT rules
        - No more hardcoded ingress points
        - Transparent failover
        - Stable external endpoint
        - Services survive node shutdowns
       
       From the outside, the cluster now appears as one resilient system.
       
       📄 Keepalived Configuration Examples
       
       MASTER Node Example
       
           vrrp_instance VI_1 {
               state MASTER
               interface eth0
               virtual_router_id 41
               priority 200
               advert_int 1
               virtual_ipaddress {
                   192.168.1.255/24
               }
           }
       
       BACKUP Node Example
       
           vrrp_instance VI_1 {
               state BACKUP
               interface eth0
               virtual_router_id 41
               priority 100
               advert_int 1
               virtual_ipaddress {
                   192.168.1.255/24
               }
           }
       
        📘 Notes
       
        - virtual_router_id must match on all nodes
        - Highest priority wins
        - Network Interface must be correct for each machine
        - VIP (shared ip address) should not be statically assigned anywhe
       
       With 5 nodes, I simply stagger priorities.
       
        🧠 Lessons Learned
       
       High availability is not about having multiple machines. 
       It's about:
       
        - eliminating single points of failure
        - automating failover
        - designing for accidental shutdowns
        - assuming humans will unplug things
       
       
        🪦 Epilogue: The Sacrifice of Tiny
       
       Tiny did not crash. Tiny was chosen: chosen by fate and dust. 
       Chosen by a power button. And from its untimely shutdown, a 
       highly available cluster was born.
       
       Your sacrifice will be remembered, Tiny.
       
 (DIR)  Back to my phlog