Monday, December 29, 2025

Patroni Failover Testing: The Part Everyone Skips (Until Production Breaks)

Patroni Failover Test Scripts – PostgreSQL HA Validation (Part 2)

⏱️ Estimated Reading Time: 11-12 minutes

Patroni Failover Test Scripts (Part 2 of the Patroni Series)

Your Patroni cluster looks healthy. Primary is up, replicas are streaming, and applications are connected. But the real question is simple: have you ever tested failover?

In production, untested failover is worse than no failover. A broken promotion, stale replication slot, or delayed leader election can turn a simple node crash into minutes of complete outage.

This guide provides real Patroni failover test scripts used by production DBAs to validate leader election, replica promotion, and client recovery — before incidents happen.

PostgreSQL Patroni failover testing dashboard showing primary crash simulation, replica promotion timeline, leader election status, and application reconnection metrics during high availability testing

Table of Contents

  1. Why You Must Test Patroni Failover Regularly
  2. Production-Ready Patroni Failover Test Scripts
  3. Failover Output & Analysis Explained
  4. Critical Components: Patroni Failover Mechanics
  5. Troubleshooting Common Failover Issues
  6. How to Automate Failover Testing
  7. Interview Questions: Patroni Failover Scenarios
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Test Patroni Failover Regularly

  • False Confidence: Healthy cluster but broken promotion logic.
  • Extended Downtime: Failover takes >60 seconds instead of <10.
  • Split Brain Risk: Two writable primaries during network issues.
  • Application Impact: P99 latency jumps from 30ms to 6+ seconds.

2. Production-Ready Patroni Failover Test Scripts

Prerequisites:
  • Patroni cluster running (Part 1 completed)
  • patronictl installed
  • SSH access to all nodes
📋 failover_primary_kill.sh
#!/bin/bash # Script: Primary Crash Simulation # Author: Chetan Yadav # Usage: Run on current primary node PRIMARY_PID=$(pgrep -f postgres) if [ -z "$PRIMARY_PID" ]; then echo "PostgreSQL not running" exit 1 fi echo "Killing PostgreSQL PID: $PRIMARY_PID" kill -9 $PRIMARY_PID
📋 check_failover_status.sh
#!/bin/bash # Script: Patroni Cluster Status Check # Author: Chetan Yadav patronictl -c /etc/patroni.yml list

3. Failover Output & Analysis Explained

Check Component Healthy Failover Red Flags
Leader Election New primary in <10 sec No leader after 30 sec
Replica Promotion Clean promotion Timeline mismatch
Client Recovery <5 sec reconnect Connection storms

4. Critical Components: Patroni Failover Mechanics

DCS TTL (Time To Live)

TTL controls leader lock expiration. Too high = slow failover. Too low = instability.

Replication Lag Threshold

Controls which replica is eligible for promotion. Incorrect values risk data loss.

Fencing

Prevents old primary from accepting writes after failover.

5. Troubleshooting Common Failover Issues

Issue: No Replica Promoted

Symptom: Cluster stuck without leader.

Root Cause: All replicas exceeded lag threshold.

Resolution:

  1. Check replication lag
  2. Lower maximum_lag_on_failover
Technical workflow diagram illustrating Patroni failover process including primary failure detection, distributed consensus via DCS, replica promotion decision, and application reconnection sequence

6. How to Automate Failover Testing

Method 1: Cron-Based Chaos Test

📋 scheduled_failover_test.sh
0 3 * * 0 /scripts/failover_primary_kill.sh

Method 2: Cloud Monitoring Integration

Track failover duration using Prometheus or CloudWatch metrics.

Method 3: CI/CD Validation

Run failover tests before production releases.

7. Interview Questions: Patroni Failover Scenarios

Q: How does Patroni decide which replica to promote?

A: Based on replication lag, timeline, and DCS consensus.

Q: What causes split brain?

A: Broken fencing or DCS network partitions.

Q: How fast should failover be?

A: Ideally under 10 seconds.

Q: How do you test Patroni safely?

A: Use scheduled chaos tests in non-peak hours.

Q: Can failover be manual?

A: Yes, using patronictl switchover.

8. Final Summary

Failover that is not tested is a liability. Patroni provides the tooling — validation is your responsibility.

These scripts give you confidence that PostgreSQL HA will behave correctly when it matters most.

Key Takeaways:
  • Test failover regularly
  • Measure failover time
  • Validate fencing
  • Automate chaos testing

9. FAQ

Does killing postgres risk data loss?

A: Not if replicas are caught up.

How often should failover be tested?

A: Monthly in staging, quarterly in production.

Is this safe for production?

A: Only during approved maintenance windows.

Does Patroni log failovers?

A: Yes, in Patroni and PostgreSQL logs.

Better than repmgr?

A: Patroni is more automated and safer.

10. About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Connect & Learn More:
📊 LinkedIn Profile
🎥 YouTube Channel


No comments:

Post a Comment