Published on : 2026-01-14 10:04
๐งน Incident Trigger
This outage was not caused by a power failure, kernel panic, or
misconfiguration. It was caused by my wife shutting down my main
server ( Tiny) while vacuuming the living room.
Tiny was not just another node. Tiny happened to be the entry point
of the entire cluster. All external port forwarding rules were
redirected to Tiny:
- HTTPS
- SSH
- Service endpoints
- Gemini
- Gopher
- Finger
This meant that when Tiny went down, NOTHING could be reached from
outside my home network, even though other nodes were still up
and healthy.
โ ๏ธ Root Cause
The cluster had no high - availability IP management. No:
- Virtual IP (VIP)
- Automatic failover
- Master election
- Redundancy at the network layer
All services ports were implicitly tied to a single machine.
This meant: One shutdown = total service loss.
๐ก Detection
The failure was immediately visible:
- All services unreachable
- External access dead
- Internal routing broken
- No automatic recovery
This confirmed the presence of a hard single point of failure at
the network ingress layer.
๐ ๏ธ Resolution: Keepalived Deployment
Keepalived is a Linux daemon providing high availability (HA) and
load balancing for server clusters by implementing the Virtual
Router Redundancy Protocol (VRRP), allowing a Virtual IP (VIP) to
seamlessly failover between servers, ensuring continuous service
even if a master node fails.
(HTM) ๐ KeepAlived Official Website
I installed keepalived on all 5 machines in the cluster.
sudo apt install keepalived
Each node now participates in a VRRP group. New Setup looks like
this:
- 5 nodes
- 1 shared Virtual IP (VIP)
- Automatic master election
- Automatic failover
- Health-based priority handling
If the active node goes down, another node takes over the VIP
within seconds. No manual intervention required.
๐ช What Is VRRP?
VRRP (Virtual Router Redundancy Protocol) allows multiple machines
to share a single IP address.
At any given time:
- One node is MASTER
- The others are BACKUP
- All nodes advertise their state via multicast
- Priority determines who becomes master
- If the master disappears, the highest-priority backup takes over
The VIP is moved automatically between machines. To the network,
nothing changes. To clients, nothing breaks.
๐ Network Architecture Update
All port forwardings now target the shared virtual IP, not a
specific machine. This removes the dependency on any single node
acting as the gateway.
Benefits:
- No more node specific NAT rules
- No more hardcoded ingress points
- Transparent failover
- Stable external endpoint
- Services survive node shutdowns
From the outside, the cluster now appears as one resilient system.
๐ Keepalived Configuration Examples
MASTER Node Example
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 41
priority 200
advert_int 1
virtual_ipaddress {
192.168.1.255/24
}
}
BACKUP Node Example
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 41
priority 100
advert_int 1
virtual_ipaddress {
192.168.1.255/24
}
}
๐ Notes
- virtual_router_id must match on all nodes
- Highest priority wins
- Network Interface must be correct for each machine
- VIP (shared ip address) should not be statically assigned anywhe
With 5 nodes, I simply stagger priorities.
๐ง Lessons Learned
High availability is not about having multiple machines.
It's about:
- eliminating single points of failure
- automating failover
- designing for accidental shutdowns
- assuming humans will unplug things
๐ชฆ Epilogue: The Sacrifice of Tiny
Tiny did not crash. Tiny was chosen: chosen by fate and dust.
Chosen by a power button. And from its untimely shutdown, a
highly available cluster was born.
Your sacrifice will be remembered, Tiny.
(DIR) Back to my phlog