⏱️ Estimated Reading Time: 11-12 minutes
Patroni Failover Test Scripts (Part 2 of the Patroni Series)
Your Patroni cluster looks healthy. Primary is up, replicas are streaming, and applications are connected. But the real question is simple: have you ever tested failover?
In production, untested failover is worse than no failover. A broken promotion, stale replication slot, or delayed leader election can turn a simple node crash into minutes of complete outage.
This guide provides real Patroni failover test scripts used by production DBAs to validate leader election, replica promotion, and client recovery — before incidents happen.
Table of Contents
- Why You Must Test Patroni Failover Regularly
- Production-Ready Patroni Failover Test Scripts
- Failover Output & Analysis Explained
- Critical Components: Patroni Failover Mechanics
- Troubleshooting Common Failover Issues
- How to Automate Failover Testing
- Interview Questions: Patroni Failover Scenarios
- Final Summary
- FAQ
- About the Author
1. Why You Must Test Patroni Failover Regularly
- False Confidence: Healthy cluster but broken promotion logic.
- Extended Downtime: Failover takes >60 seconds instead of <10.
- Split Brain Risk: Two writable primaries during network issues.
- Application Impact: P99 latency jumps from 30ms to 6+ seconds.
2. Production-Ready Patroni Failover Test Scripts
- Patroni cluster running (Part 1 completed)
- patronictl installed
- SSH access to all nodes
#!/bin/bash
# Script: Primary Crash Simulation
# Author: Chetan Yadav
# Usage: Run on current primary node
PRIMARY_PID=$(pgrep -f postgres)
if [ -z "$PRIMARY_PID" ]; then
echo "PostgreSQL not running"
exit 1
fi
echo "Killing PostgreSQL PID: $PRIMARY_PID"
kill -9 $PRIMARY_PID
#!/bin/bash
# Script: Patroni Cluster Status Check
# Author: Chetan Yadav
patronictl -c /etc/patroni.yml list
3. Failover Output & Analysis Explained
| Check Component | Healthy Failover | Red Flags |
|---|---|---|
| Leader Election | New primary in <10 sec | No leader after 30 sec |
| Replica Promotion | Clean promotion | Timeline mismatch |
| Client Recovery | <5 sec reconnect | Connection storms |
4. Critical Components: Patroni Failover Mechanics
DCS TTL (Time To Live)
TTL controls leader lock expiration. Too high = slow failover. Too low = instability.
Replication Lag Threshold
Controls which replica is eligible for promotion. Incorrect values risk data loss.
Fencing
Prevents old primary from accepting writes after failover.
5. Troubleshooting Common Failover Issues
Issue: No Replica Promoted
Symptom: Cluster stuck without leader.
Root Cause: All replicas exceeded lag threshold.
Resolution:
- Check replication lag
- Lower
maximum_lag_on_failover
6. How to Automate Failover Testing
Method 1: Cron-Based Chaos Test
0 3 * * 0 /scripts/failover_primary_kill.sh
Method 2: Cloud Monitoring Integration
Track failover duration using Prometheus or CloudWatch metrics.
Method 3: CI/CD Validation
Run failover tests before production releases.
7. Interview Questions: Patroni Failover Scenarios
A: Based on replication lag, timeline, and DCS consensus.
A: Broken fencing or DCS network partitions.
A: Ideally under 10 seconds.
A: Use scheduled chaos tests in non-peak hours.
A: Yes, using patronictl switchover.
8. Final Summary
Failover that is not tested is a liability. Patroni provides the tooling — validation is your responsibility.
These scripts give you confidence that PostgreSQL HA will behave correctly when it matters most.
- Test failover regularly
- Measure failover time
- Validate fencing
- Automate chaos testing
9. FAQ
A: Not if replicas are caught up.
A: Monthly in staging, quarterly in production.
A: Only during approved maintenance windows.
A: Yes, in Patroni and PostgreSQL logs.
A: Patroni is more automated and safer.