Patroni Failover Test Scripts – PostgreSQL HA Validation (Part 2)

⏱️ Estimated Reading Time: 11-12 minutes

Patroni Failover Test Scripts (Part 2 of the Patroni Series)

Your Patroni cluster looks healthy. Primary is up, replicas are streaming, and applications are connected. But the real question is simple: have you ever tested failover?

In production, untested failover is worse than no failover. A broken promotion, stale replication slot, or delayed leader election can turn a simple node crash into minutes of complete outage.

This guide provides real Patroni failover test scripts used by production DBAs to validate leader election, replica promotion, and client recovery — before incidents happen.

PostgreSQL Patroni failover testing dashboard showing primary crash simulation, replica promotion timeline, leader election status, and application reconnection metrics during high availability testing

Why You Must Test Patroni Failover Regularly
Production-Ready Patroni Failover Test Scripts
Failover Output & Analysis Explained
Critical Components: Patroni Failover Mechanics
Troubleshooting Common Failover Issues
How to Automate Failover Testing
Interview Questions: Patroni Failover Scenarios
Final Summary
FAQ
About the Author

1. Why You Must Test Patroni Failover Regularly

False Confidence: Healthy cluster but broken promotion logic.
Extended Downtime: Failover takes >60 seconds instead of <10.
Split Brain Risk: Two writable primaries during network issues.
Application Impact: P99 latency jumps from 30ms to 6+ seconds.

2. Production-Ready Patroni Failover Test Scripts

Prerequisites:

Patroni cluster running (Part 1 completed)
patronictl installed
SSH access to all nodes

📋 failover_primary_kill.sh

#!/bin/bash
# Script: Primary Crash Simulation
# Author: Chetan Yadav
# Usage: Run on current primary node

PRIMARY_PID=$(pgrep -f postgres)

if [ -z "$PRIMARY_PID" ]; then
  echo "PostgreSQL not running"
  exit 1
fi

echo "Killing PostgreSQL PID: $PRIMARY_PID"
kill -9 $PRIMARY_PID

📋 check_failover_status.sh

#!/bin/bash
# Script: Patroni Cluster Status Check
# Author: Chetan Yadav

patronictl -c /etc/patroni.yml list

3. Failover Output & Analysis Explained

Check Component	Healthy Failover	Red Flags
Leader Election	New primary in <10 sec	No leader after 30 sec
Replica Promotion	Clean promotion	Timeline mismatch
Client Recovery	<5 sec reconnect	Connection storms

4. Critical Components: Patroni Failover Mechanics

DCS TTL (Time To Live)

TTL controls leader lock expiration. Too high = slow failover. Too low = instability.

Replication Lag Threshold

Controls which replica is eligible for promotion. Incorrect values risk data loss.

Fencing

Prevents old primary from accepting writes after failover.

5. Troubleshooting Common Failover Issues

Issue: No Replica Promoted

Symptom: Cluster stuck without leader.

Root Cause: All replicas exceeded lag threshold.

Resolution:

Check replication lag
Lower maximum_lag_on_failover

Technical workflow diagram illustrating Patroni failover process including primary failure detection, distributed consensus via DCS, replica promotion decision, and application reconnection sequence

6. How to Automate Failover Testing

Method 1: Cron-Based Chaos Test

📋 scheduled_failover_test.sh


0 3 * * 0 /scripts/failover_primary_kill.sh

Method 2: Cloud Monitoring Integration

Track failover duration using Prometheus or CloudWatch metrics.

Method 3: CI/CD Validation

Run failover tests before production releases.

7. Interview Questions: Patroni Failover Scenarios

Q: How does Patroni decide which replica to promote?

A: Based on replication lag, timeline, and DCS consensus.

Q: What causes split brain?

A: Broken fencing or DCS network partitions.

Q: How fast should failover be?

A: Ideally under 10 seconds.

Q: How do you test Patroni safely?

A: Use scheduled chaos tests in non-peak hours.

Q: Can failover be manual?

A: Yes, using patronictl switchover.

8. Final Summary

Failover that is not tested is a liability. Patroni provides the tooling — validation is your responsibility.

These scripts give you confidence that PostgreSQL HA will behave correctly when it matters most.

Key Takeaways:

Test failover regularly
Measure failover time
Validate fencing
Automate chaos testing

9. FAQ

Does killing postgres risk data loss?

A: Not if replicas are caught up.

How often should failover be tested?

A: Monthly in staging, quarterly in production.

Is this safe for production?

A: Only during approved maintenance windows.

Does Patroni log failovers?

A: Yes, in Patroni and PostgreSQL logs.

Better than repmgr?

A: Patroni is more automated and safer.

10. About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Connect & Learn More:
📊 LinkedIn Profile
🎥 YouTube Channel

Chetan Yadav

Pages

Monday, December 29, 2025

Patroni Failover Testing: The Part Everyone Skips (Until Production Breaks)