Thursday, December 25, 2025

Patroni Test Lab Setup Guide

Patroni Test Lab Setup Guide – PostgreSQL HA for Production DBAs

⏱️ Estimated Reading Time: 11-12 minutes

Patroni Test Lab Setup Guide

At 1:20 AM, your primary PostgreSQL node goes down. Applications freeze, connection pools exhaust, and failover doesn’t happen. The problem is not PostgreSQL — it’s the lack of a tested HA setup.

In production, PostgreSQL without a proven failover mechanism becomes a single point of failure. Downtime leads to transaction loss, SLA breaches, and emergency firefighting during peak hours.

This guide walks you through building a Patroni-based PostgreSQL HA test lab that behaves like production — allowing you to test leader election, failover, and recovery safely before going live.

PostgreSQL high availability test lab architecture showing Patroni cluster nodes, distributed configuration store, leader election process, and client connections simulating production failover scenarios

Table of Contents

  1. Why You Must Monitor Patroni Clusters Daily
  2. Production-Ready Patroni Test Lab Setup
  3. Script Output & Analysis Explained
  4. Critical Components: Patroni Architecture Concepts
  5. Troubleshooting Common Patroni Issues
  6. How to Automate This Monitoring
  7. Interview Questions: Patroni Troubleshooting
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Monitor Patroni Clusters Daily

  • Leader Election Failure: No primary available, writes blocked.
  • Replication Lag: Standby lag exceeds 5–10 seconds under load.
  • Split Brain Risk: Two primaries due to DCS inconsistency.
  • Application Impact: P99 latency spikes from 40ms to 5+ seconds.

2. Production-Ready Patroni Test Lab Setup

Lab Requirements:
  • 3 Linux VMs (Primary + 2 Replicas)
  • PostgreSQL 14 or higher
  • etcd or Consul as DCS
  • Passwordless SSH between nodes
📋 patroni.yml
scope: pg-ha-lab name: node1 restapi: listen: 0.0.0.0:8008 connect_address: 10.0.0.1:8008 etcd: host: 10.0.0.10:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 initdb: - encoding: UTF8 - data-checksums postgresql: listen: 0.0.0.0:5432 connect_address: 10.0.0.1:5432 data_dir: /var/lib/postgresql/data bin_dir: /usr/pgsql-14/bin authentication: replication: username: replicator password: repl_pass superuser: username: postgres password: pg_pass

3. Script Output & Analysis Explained

Check Component Healthy State Red Flags
Leader Status Single primary No leader / multiple leaders
Replication Lag < 1 second > 10 seconds
Failover Time < 10 seconds > 30 seconds

4. Critical Components: Patroni Architecture Concepts

Distributed Configuration Store (DCS)

DCS (etcd/Consul) stores cluster state. If DCS is unhealthy, leader election fails.

Leader Election

Patroni ensures only one writable primary. Broken fencing leads to split-brain scenarios.

Replication Slots

Prevent WAL loss but can cause disk bloat if lag grows.

5. Troubleshooting Common Patroni Issues

Issue: No Primary After Restart

Symptom: All nodes in replica mode.

Root Cause: DCS unreachable.

Resolution:

  1. Check etcd health: etcdctl endpoint health
  2. Restart Patroni service
Technical workflow diagram showing Patroni leader election, health checks, failover decision process, promotion of replica to primary, and client reconnection during PostgreSQL high availability events

6. How to Automate This Monitoring

Method 1: Cron-Based Health Check

📋 patroni_health.sh
#!/bin/bash curl -s http://localhost:8008/health | jq .

Method 2: Cloud Monitoring

Export Patroni metrics to Prometheus or CloudWatch.

Method 3: Third-Party Tools

Use Grafana dashboards for replication and failover visibility.

7. Interview Questions: Patroni Troubleshooting

Q: How does Patroni prevent split brain?

A: By using a distributed configuration store and strict leader locks.

Q: What happens if DCS is down?

A: Patroni freezes leader changes to avoid data corruption.

Q: How do you test failover?

A: Stop PostgreSQL on primary and observe leader promotion.

Q: Can Patroni work with RDS?

A: No. Patroni requires OS-level PostgreSQL access.

Q: How do you monitor Patroni?

A: REST API, Prometheus exporters, and logs.

8. Final Summary

A Patroni test lab is mandatory before production rollout. It exposes real-world failure modes safely.

With proper monitoring and automation, Patroni delivers predictable PostgreSQL high availability.

Key Takeaways:
  • Always test failover
  • Monitor DCS health
  • Track replication lag
  • Automate health checks

9. FAQ

Does Patroni impact performance?

A: Minimal overhead, mostly control-plane traffic.

Is Patroni production-ready?

A: Yes, widely used at scale.

Can it run on Kubernetes?

A: Yes, with StatefulSets.

Common mistakes?

A: Weak fencing and no DCS monitoring.

Is it better than repmgr?

A: Patroni is more automation-focused.

10. About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Connect & Learn More:
📊 LinkedIn Profile
🎥 YouTube Channel


x

Thursday, December 18, 2025

n8n Workflow: Auto Email Summary

n8n Workflow: Auto Email Summary for Production Teams

⏱️ Estimated Reading Time: 13 minutes

n8n Workflow: Auto Email Summary

In production environments, inboxes become operational bottlenecks. Critical alerts, customer emails, job opportunities, and vendor notifications get buried under long email threads.

The business impact is real — delayed responses, missed actions, and engineers spending hours reading emails instead of fixing systems. For on-call DBAs and SREs, this directly increases MTTR.

This guide shows how to build a production-ready n8n workflow that automatically summarizes incoming emails using AI, so teams get concise, actionable information in seconds.

n8n workflow dashboard displaying automated email ingestion, AI-based summarization, conditional routing, and delivery of concise email summaries for production engineering teams

Table of Contents

  1. Why You Must Monitor Email Workflows Daily
  2. Production-Ready Auto Email Summary Script
  3. Script Output & Analysis Explained
  4. Critical Components: Email Automation Concepts
  5. Troubleshooting Common Issues
  6. How to Automate This Monitoring
  7. Interview Questions: Email Automation Troubleshooting
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Monitor Auto Email Summaries Daily

  • Missed Critical Alerts: Incident emails unread for 30+ minutes.
  • Operational Delay: Human parsing adds 5–10 minutes per email.
  • Cascading Failures: Delayed action increases blast radius.
  • Productivity Loss: Engineers spend hours triaging inbox noise.

2. Production-Ready Auto Email Summary Workflow

Execution Requirements:
  • n8n self-hosted or cloud
  • Email trigger (IMAP or Gmail)
  • OpenAI / LLM credentials as environment variables
📋 email_summary_prompt.txt
Summarize the following email. Rules: - Use bullet points - Highlight action items - Mention deadlines clearly - Max 120 words - No assumptions Email Subject: {{subject}} Email Sender: {{from}} Email Content: {{body}}

3. Script Output & Analysis Explained

Component Healthy Output Red Flags
Summary Length < 120 words > 300 words
Action Items Explicit bullets Missing actions
Latency < 3 seconds > 10 seconds

4. Critical Components: Email Automation Concepts

IMAP (Internet Message Access Protocol)

IMAP allows real-time inbox monitoring. Polling delays directly affect response time.

LLM Token Control

Unbounded email bodies increase cost and latency. Always truncate or sanitize input.

Idempotency

Prevents duplicate summaries during retries or failures.

5. Troubleshooting Common Issues

Issue: Duplicate Summaries

Symptom: Same email summarized multiple times.

Root Cause: Missing message-ID tracking.

Resolution:

  1. Store processed message IDs
  2. Skip if ID already exists
Technical workflow diagram showing email ingestion, filtering, AI summarization, conditional routing, and delivery to messaging platforms for automated email processing

6. How to Automate This Monitoring

Method 1: Cron-Based Trigger

📋 cron_schedule.txt
*/2 * * * * Trigger email summary workflow

Method 2: Cloud Monitoring

Use CloudWatch or Azure Monitor to track execution failures.

Method 3: Telegram Integration

Send summarized emails to Telegram for instant visibility.

7. Interview Questions: Email Automation Troubleshooting

Q: How do you avoid summarizing sensitive data?

A: By masking patterns, truncating content, and filtering attachments before sending data to the LLM.

Q: What causes high latency in summaries?

A: Large email bodies, token overflow, or slow LLM endpoints.

Q: How do you ensure reliability?

A: Retries, idempotency keys, and failure logging.

Q: Is this suitable for incident alerts?

A: Yes, especially when combined with priority tagging.

Q: Can this replace ticketing systems?

A: No, it complements them by improving signal clarity.

8. Final Summary

Auto email summaries reduce noise and speed up decisions. For production teams, this directly improves response times.

When integrated with monitoring and messaging tools, this workflow becomes a reliability multiplier.

Key Takeaways:
  • Summaries reduce cognitive load
  • Automation improves MTTR
  • Token control is critical
  • Integrate with existing tools

9. FAQ

Does this impact email server performance?

A: No, it only reads messages.

What permissions are required?

A: Read-only mailbox access.

Is this cloud-agnostic?

A: Yes, works across Gmail, Outlook, IMAP.

How does this compare to manual triage?

A: Saves 70–80% reading time.

Common pitfalls?

A: Missing truncation and retry handling.

10. About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Connect & Learn More:
📊 LinkedIn Profile
🎥 YouTube Channel


Monday, December 15, 2025

Real Outage RCA Template (Standardizing incident reports)

Real Outage RCA Template – Standardizing Incident Reports for Production DBAs

⏱️ Estimated Reading Time: 14 minutes

Real Outage RCA Template – Standardizing Incident Reports

At 3:40 AM, production went down. Databases were slow, APIs timed out, and customer transactions failed. By morning, management asked a simple question: “What exactly happened?”

What followed was chaos — multiple Slack threads, partial logs, conflicting timelines, and a postmortem that raised more questions than answers. This is not a tooling problem. This is an RCA standardization problem.

This article provides a real outage RCA template used by production DBAs and SREs to create clear, actionable, and audit-ready incident reports that engineering and business leaders can trust.

Production monitoring dashboard showing real-time service health metrics, incident status indicators, performance trends, and operational KPIs used during outage analysis and RCA preparation

Table of Contents

  1. Why You Must Standardize RCA Reports
  2. Production-Ready RCA Template
  3. RCA Output & Analysis Explained
  4. Critical Components: RCA Concepts
  5. Troubleshooting Common RCA Failures
  6. How to Automate RCA Creation
  7. Interview Questions: RCA & Incident Analysis
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Standardize RCA Reports

  • Incomplete Timelines: Missing 10–15 minute gaps during peak impact window.
  • Blame-Driven Culture: Teams focus on people instead of systems.
  • Recurring Incidents: Same outage repeats every 30–60 days.
  • Business Risk: P99 latency jumps from 80ms to 6–8 seconds without explanation.

2. Production-Ready RCA Template

Prerequisites:
  • Incident ID or ticket reference
  • Central log access (Splunk, CloudWatch, ELK)
  • Database metrics (AWR, Performance Insights)
📋 outage_rca_template.md
# Incident Title Short descriptive summary of the outage ## Incident Metadata - Incident ID: - Date & Time (UTC): - Duration: - Severity: - Affected Systems: ## Impact Summary - Customer impact: - Business impact: - SLA breach (Yes/No): ## Timeline (UTC) | Time | Event | |------|------| | 03:40 | Alert triggered | | 03:45 | DBA investigation started | ## Root Cause Clear technical explanation of the failure. ## Contributing Factors - Missing alert - Capacity limit - Configuration drift ## Resolution & Recovery Steps taken to restore service. ## Preventive Actions - Short-term fixes - Long-term fixes ## Lessons Learned What will be done differently next time.

3. RCA Output & Analysis Explained

Component Healthy RCA Red Flags
Timeline Minute-level accuracy Vague time ranges
Root Cause Single technical cause Multiple vague reasons
Actions Measurable fixes Generic statements

4. Critical Components: RCA Concepts

Single Root Cause (SRC)

SRC ensures accountability at the system level. Multiple causes usually indicate incomplete analysis.

Blast Radius

Defines which services, regions, and customers were affected and helps prioritize future mitigations.

MTTR (Mean Time to Recovery)

Lower MTTR directly correlates with better monitoring and runbooks.

5. Troubleshooting Common RCA Failures

Issue: RCA Lacks Technical Depth

Symptom: Management rejects RCA.

Root Cause: Metrics and logs missing.

Resolution:

  1. Attach AWR / Performance Insights screenshots
  2. Include query wait events and CPU graphs
Root cause analysis flow chart illustrating structured incident investigation steps including event detection, data collection, root cause identification, corrective actions, validation, and incident closure process

6. How to Automate RCA Creation

Method 1: Cron-Based Log Collection

📋 collect_incident_logs.sh
#!/bin/bash TIMESTAMP=$(date +%F_%H%M) aws logs filter-log-events \ --log-group-name prod-db \ --start-time $(date -d '1 hour ago' +%s000) \ > rca_logs_$TIMESTAMP.json

Method 2: CloudWatch Integration

Use CloudWatch alarms to auto-create incident timelines.

Method 3: Wiki-Based RCA Templates

Confluence or Git-based markdown templates enforce consistency.

7. Interview Questions: RCA & Incident Analysis

Q: What makes an RCA effective?

A: A clear timeline, single root cause, measurable impact, and actionable preventive steps backed by metrics and logs.

Q: How do you avoid blame in RCAs?

A: Focus on system failures, not individuals, and document process gaps instead of mistakes.

Q: How detailed should an RCA be?

A: Detailed enough that another engineer can prevent the same outage without additional context.

Q: How do you measure RCA quality?

A: Reduced recurrence rate and faster MTTR over the next 2–3 incidents.

Q: Should DBAs own RCAs?

A: DBAs should co-own RCAs for database-related incidents with SRE and application teams.

8. Final Summary

A well-written RCA is not documentation — it is a reliability tool. Standardization eliminates confusion, speeds recovery, and prevents repeat incidents.

When RCAs are consistent, technical, and measurable, organizations move from reactive firefighting to proactive reliability.

Key Takeaways:
  • Standardize RCA structure
  • Use metrics, not opinions
  • Track recurrence and MTTR
  • Automate data collection

9. FAQ

Does writing RCAs impact performance?

A: No. RCAs use historical data and logs only.

Who should write the RCA?

A: The on-call engineer with inputs from DBAs and SREs.

Are RCAs required for minor incidents?

A: Yes, lightweight RCAs help prevent escalation.

Can RCAs be automated?

A: Data collection can be automated, analysis remains human.

How long should an RCA take?

A: Ideally completed within 48 hours of incident resolution.

10. About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Connect & Learn More:
📊 LinkedIn Profile
🎥 YouTube Channel


Thursday, December 11, 2025

Aurora MySQL Lock Detection Script

Aurora MySQL Lock Detection Script - Complete Guide 2026
⏱️ Estimated Reading Time: 6–7 minutes

Aurora MySQL Lock Detection Script - Complete Production Guide 2026

In a production Aurora MySQL environment, undetected locks can silently degrade application performance, cause connection pool exhaustion, and lead to cascading timeouts across microservices. A single long-running transaction holding row locks can block hundreds of queries, turning a minor issue into a critical incident.

This article provides a comprehensive Shell Script for Aurora MySQL Lock Detection and Analysis. It covers blocking sessions, InnoDB lock waits, metadata locks, and transaction isolation issues—perfect for daily monitoring, incident response, or pre-deployment validation.

Table of Contents

  1. Why You Must Monitor Locks in Aurora MySQL Daily
  2. Production-Ready Lock Detection Script (Shell + SQL)
  3. Script Output & Analysis Explained
  4. Critical Components: InnoDB Locks, Metadata Locks & Deadlocks
  5. Troubleshooting Common Lock Issues
  6. How to Automate This Monitoring (CloudWatch + Cron)
  7. Interview Questions: MySQL Lock Troubleshooting
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Monitor Locks in Aurora MySQL Daily

Aurora MySQL's high-performance architecture doesn't eliminate locking issues—it amplifies their impact across reader endpoints and application tiers. Neglecting lock monitoring leads to:

  • Connection Pool Exhaustion: Blocked sessions accumulate, consuming max_connections limits
  • Cascading Query Timeouts: Applications experience widespread 30-second timeouts
  • Metadata Lock Deadlocks: DDL operations (ALTER TABLE) stuck behind long transactions
  • Replication Lag on Readers: Lock waits on the writer propagate to read replicas
  • Split-Second SLA Breaches: P99 latency spikes from 50ms to 5+ seconds

Running a unified lock detection script ensures you catch blocking chains, identify victim queries, and resolve issues before they trigger PagerDuty alerts.

2. Production-Ready Lock Detection Script

This shell script combines Performance Schema queries, InnoDB lock analysis, and metadata lock detection to provide a complete locking overview.

Note: Execute this script with a MySQL user having PROCESS and SELECT privileges on performance_schema and information_schema.
📋 aurora_lock_detection.sh
#!/bin/bash # ==================================================== # Aurora MySQL Lock Detection & Analysis Script # Author: Chetan Yadav # Usage: ./aurora_lock_detection.sh # ==================================================== # MySQL Connection Parameters MYSQL_HOST="your-aurora-cluster.cluster-xxxxx.us-east-1.rds.amazonaws.com" MYSQL_PORT="3306" MYSQL_USER="monitor_user" MYSQL_PASS="your_secure_password" MYSQL_DB="information_schema" # Output file for detailed logging OUTPUT_LOG="/tmp/aurora_lock_detection_$(date +%Y%m%d_%H%M%S).log" echo "==================================================" | tee -a $OUTPUT_LOG echo " AURORA MYSQL LOCK DETECTION - $(date) " | tee -a $OUTPUT_LOG echo "==================================================" | tee -a $OUTPUT_LOG # 1. Check for Blocking Sessions (InnoDB Lock Waits) echo -e "\n[1] Detecting InnoDB Lock Waits..." | tee -a $OUTPUT_LOG mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS \ -D $MYSQL_DB -sN <<EOF | tee -a $OUTPUT_LOG SELECT r.trx_id AS waiting_trx_id, r.trx_mysql_thread_id AS waiting_thread, r.trx_query AS waiting_query, b.trx_id AS blocking_trx_id, b.trx_mysql_thread_id AS blocking_thread, b.trx_query AS blocking_query, TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds FROM information_schema.innodb_lock_waits w INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id ORDER BY wait_seconds DESC; EOF # 2. Check for Long-Running Transactions echo -e "\n[2] Long-Running Transactions (>30 sec)..." | tee -a $OUTPUT_LOG mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS \ -D $MYSQL_DB -sN <<EOF | tee -a $OUTPUT_LOG SELECT trx_id, trx_mysql_thread_id AS thread_id, trx_state, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS runtime_sec, trx_rows_locked, trx_rows_modified, SUBSTRING(trx_query, 1, 80) AS query_snippet FROM information_schema.innodb_trx WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30 ORDER BY runtime_sec DESC; EOF # 3. Check for Metadata Locks echo -e "\n[3] Detecting Metadata Locks..." | tee -a $OUTPUT_LOG mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS \ -D performance_schema -sN <<EOF | tee -a $OUTPUT_LOG SELECT object_schema, object_name, lock_type, lock_duration, lock_status, owner_thread_id FROM metadata_locks WHERE lock_status = 'PENDING' AND object_schema NOT IN ('performance_schema', 'mysql'); EOF # 4. Check Active Processlist echo -e "\n[4] Active Processlist..." | tee -a $OUTPUT_LOG mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS \ -e "SHOW FULL PROCESSLIST;" | grep -v "Sleep" | tee -a $OUTPUT_LOG # 5. Check Last Deadlock echo -e "\n[5] Last Detected Deadlock..." | tee -a $OUTPUT_LOG mysql -h $MYSQL_HOST -P $MYSQL_PORT -u $MYSQL_USER -p$MYSQL_PASS \ -e "SHOW ENGINE INNODB STATUS\G" | \ grep -A 50 "LATEST DETECTED DEADLOCK" | tee -a $OUTPUT_LOG echo -e "\n==================================================" | tee -a $OUTPUT_LOG echo " LOCK DETECTION COMPLETE. Review: $OUTPUT_LOG " | tee -a $OUTPUT_LOG echo "==================================================" | tee -a $OUTPUT_LOG

This script consolidates five critical lock detection queries into a single diagnostic report, providing immediate visibility into blocking sessions and lock contention hotspots.

3. Script Output & Analysis Explained

Check Component What "Healthy" Looks Like Red Flags
InnoDB Lock Waits Empty result set (no blocking chains) Any rows indicate active blocking; wait_time > 5 seconds is critical
Long Transactions Transactions < 5 seconds Transactions > 60 seconds with high trx_rows_locked indicate forgotten transactions
Metadata Locks No PENDING locks PENDING metadata locks block DDL; check for unclosed transactions on that table
Processlist Queries in "Sending data" or "Sorting result" Multiple queries stuck in "Waiting for table metadata lock"
Deadlock Log No recent deadlocks Review victim query patterns; consider application-level retry logic

4. Critical Components: InnoDB Locks, Metadata Locks & Deadlocks

Understanding MySQL's locking mechanisms is vital for Aurora DBAs:

InnoDB Row Locks

Acquired automatically during DML operations (UPDATE, DELETE). Uses MVCC (Multi-Version Concurrency Control) to allow non-blocking reads while writes are in progress. Lock waits occur when two transactions try to modify the same row.

Metadata Locks (MDL)

Protect table structure during DDL operations (ALTER TABLE, DROP TABLE). A long-running SELECT can hold a metadata lock that blocks an ALTER TABLE, even though no row locks exist.

Deadlocks

Occur when two transactions acquire locks in opposite orders. InnoDB automatically detects deadlocks and rolls back the smaller transaction (the "victim"). Frequent deadlocks indicate poor transaction design or missing indexes.

Gap Locks

Used in REPEATABLE READ isolation level to prevent phantom reads. Can cause unexpected blocking when queries scan ranges without proper indexes.

5. Troubleshooting Common Lock Issues

If the script reports blocking or long lock waits, follow this workflow:

Issue: InnoDB Lock Wait Timeout (Error 1205)

Symptom: Application logs show Lock wait timeout exceeded; try restarting transaction

Root Cause: A blocking transaction is holding locks longer than innodb_lock_wait_timeout (default 50 seconds)

Resolution:

  1. Identify blocking thread from script output: blocking_thread column
  2. Kill the blocking session: KILL <blocking_thread>;
  3. Review the blocking query for missing indexes or inefficient WHERE clauses
  4. Check application code for forgotten transactions (missing COMMIT/ROLLBACK)

Issue: Metadata Lock Contention

Symptom: ALTER TABLE hangs indefinitely; other queries stuck in "Waiting for table metadata lock"

Root Cause: Long-running transaction or unclosed connection holding metadata lock

Resolution:

  1. Find the owner: Check owner_thread_id from metadata_locks query
  2. Review processlist for that thread: SELECT * FROM performance_schema.threads WHERE thread_id = <owner_thread_id>;
  3. Kill the offending session: KILL <processlist_id>;
  4. For production DDL, use pt-online-schema-change to avoid metadata locks

Issue: Frequent Deadlocks

Symptom: Application retries constantly; SHOW ENGINE INNODB STATUS shows many deadlocks

Root Cause: Transactions accessing tables in different orders or using REPEATABLE READ with range scans

Resolution:

  1. Review deadlock victim queries in InnoDB status output
  2. Ensure all transactions access tables in the same order
  3. Add indexes to reduce gap locks
  4. Consider changing isolation level to READ COMMITTED (if phantom reads are acceptable)

6. How to Automate This Monitoring (CloudWatch + Cron)

Method 1: Cron-Based Monitoring

Schedule this script to run every 5 minutes during peak hours:

📋 crontab_entry.sh
*/5 * * * * /home/ec2-user/scripts/aurora_lock_detection.sh >> /var/log/aurora_locks.log 2>&1

Method 2: CloudWatch Custom Metrics

Enhance the script to publish lock counts to CloudWatch:

📋 cloudwatch_metrics.sh
# Add after each check LOCK_COUNT=$(mysql -h $MYSQL_HOST ... | wc -l) aws cloudwatch put-metric-data \ --namespace "Aurora/Locks" \ --metric-name "InnoDB_Lock_Waits" \ --value $LOCK_COUNT \ --region us-east-1

Create CloudWatch alarms:

  • Trigger if InnoDB_Lock_Waits > 5 for 2 consecutive periods
  • SNS notification to on-call engineer

Method 3: Performance Insights Integration

Aurora's Performance Insights automatically tracks lock waits. Use this script as a supplementary deep-dive tool when Performance Insights shows spikes in wait/io/table/sql/handler or wait/lock/table/sql/handler.

7. Interview Questions: MySQL Lock Troubleshooting

Prepare for these questions in Aurora/MySQL DBA interviews:

Q: What's the difference between InnoDB row locks and table locks?

A: InnoDB uses row-level locking for DML operations, allowing high concurrency. Table locks (LOCK TABLES) lock the entire table and block all other operations. MyISAM uses table locks by default; InnoDB uses row locks with MVCC.

Q: How does MySQL's REPEATABLE READ isolation level cause deadlocks?

A: REPEATABLE READ uses gap locks to prevent phantom reads. If two transactions scan overlapping ranges without proper indexes, they can acquire gap locks in opposite orders, causing deadlocks. READ COMMITTED avoids gap locks but allows phantom reads.

Q: How do you identify the blocking query in a lock wait scenario?

A: Query information_schema.innodb_lock_waits joined with innodb_trx to map blocking_trx_id to the actual query. Use SHOW ENGINE INNODB STATUS for detailed lock information including locked record details.

Q: What causes metadata lock timeouts in production?

A: Long-running queries or unclosed transactions holding shared metadata locks. Even a simple SELECT with an open transaction prevents DDL operations. Use lock_wait_timeout and ensure applications properly close connections.

Q: How do you prevent deadlocks at the application level?

A: (1) Access tables in consistent order across all transactions, (2) Keep transactions short, (3) Use appropriate indexes to reduce gap locks, (4) Consider READ COMMITTED isolation if acceptable, (5) Implement exponential backoff retry logic.

8. Final Summary

A healthy Aurora MySQL cluster requires proactive lock monitoring, not just reactive troubleshooting. The script provided above delivers instant visibility into blocking sessions, long transactions, and metadata lock contention.

Use this script as part of your Daily Health Check routine and integrate it with CloudWatch alarms for real-time alerting. Combine it with Performance Insights for comprehensive lock analysis during incidents.

Key Takeaways:
  • InnoDB lock waits > 5 seconds require immediate investigation
  • Metadata locks can block DDL even without row lock contention
  • Deadlocks indicate transaction design issues or missing indexes
  • Automate monitoring with CloudWatch custom metrics

9. FAQ

Q1: Can this script impact production performance?

A: The queries access information_schema and performance_schema, which are lightweight metadata operations. Running every 5 minutes has negligible impact. Avoid running every 10 seconds on large clusters.

Q2: What if the blocking query shows NULL?

A: The transaction may have completed its query but hasn't committed. Check trx_state in innodb_trx—if it's "LOCK WAIT", the transaction is idle but holding locks. Kill it if it's been idle > 5 minutes.

Q3: How do I grant minimum privileges for the monitoring user?

A: GRANT SELECT ON information_schema.* TO 'monitor_user'@'%'; GRANT SELECT ON performance_schema.* TO 'monitor_user'@'%'; GRANT PROCESS ON *.* TO 'monitor_user'@'%';

Q4: Does this work with Aurora MySQL 2.x and 3.x?

A: Yes, the script uses standard MySQL 5.7+ features. Aurora MySQL 2.x (MySQL 5.7) and 3.x (MySQL 8.0) both support these queries. MySQL 8.0 has enhanced performance_schema lock tables for deeper analysis.

Q5: What's the difference between this and Performance Insights?

A: Performance Insights provides visual dashboards and historical trends. This script gives real-time CLI output with specific blocking chains and kill commands—ideal for incident response and automation.

About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Saturday, November 29, 2025

Oracle RAC Cluster Health Audit: The 2026 Production DBA Guide

⏱️ Estimated Reading Time: 5–6 minutes


In a production Oracle Real Application Clusters (RAC) environment, stability is everything. A single corrupt OCR, a missing Voting Disk, or an unstable CRS stack can lead to node evictions (split-brain scenarios) and unplanned downtime.

This article provides a comprehensive Shell Script for RAC Cluster Health Audits. It covers CRS status, OCR integrity, Voting Disk validation, and resource stability checks—perfect for daily monitoring or pre-patching validation.


Oracle RAC Cluster Health Audit 2026 Guide for Production DBAs showing high availability database architecture and performance metrics


Table of Contents

  1. Why You Must Audit RAC Cluster Health Daily
  2. Production-Ready RAC Health Check Script (Shell)
  3. Script Output & Analysis Explained
  4. Critical Components: OCR, Voting Disk & CRS
  5. Troubleshooting Common RAC Issues
  6. How to Automate This Audit (Cron)
  7. Interview Questions: RAC Troubleshooting
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Audit RAC Cluster Health Daily

Oracle RAC relies on a complex stack of clusterware services. Neglecting these checks leads to:

  • Node Evictions: Caused by heartbeat failures or voting disk I/O timeouts.
  • OCR Corruption: Resulting in the inability to start the clusterware stack.
  • Resource Regressions: Services or VIPs flapping between nodes.
  • Split-Brain Syndrome: Where nodes lose communication and fight for control.

Running a unified audit script ensures you catch "INTERMEDIATE" or "OFFLINE" states before they become outages.


2. Production-Ready RAC Health Check Script

This shell script checks the core pillars of RAC stability: CRS Stack, OCR, Voting Disks, and Resource Status.

Note: Execute this script as the grid (or root) user.

#!/bin/bash # ==================================================== # Oracle RAC Cluster Health Audit Script # Author: Chetan Yadav # Usage: ./rac_health_check.sh # ==================================================== # Set Grid Environment (Adjust ORACLE_HOME as needed) export ORACLE_HOME=/u01/app/19.0.0/grid export PATH=$ORACLE_HOME/bin:$PATH echo "==================================================" echo " ORACLE RAC CLUSTER HEALTH AUDIT - $(date) " echo "==================================================" # 1. Check High Availability Services (OHAS) echo -e "\n[1] Checking CRS/OHAS Stack Status..." crsctl check crs # 2. Check Voting Disk Status (Quorum) echo -e "\n[2] Checking Voting Disk Configuration..." crsctl query css votedisk # 3. Check OCR Integrity (Registry) echo -e "\n[3] Checking Oracle Cluster Registry (OCR) Integrity..." # Note: Requires root or grid privileges ocrcheck # 4. Check Cluster Resources (Highlighting Issues) echo -e "\n[4] Scanning for OFFLINE or UNSTABLE Resources..." crsctl stat res -t | grep -E "OFFLINE|INTERMEDIATE|UNKNOWN" # 5. Check Cluster Interconnect (Private Network) echo -e "\n[5] Checking Cluster Interconnects..." oifcfg getif echo -e "\n==================================================" echo " AUDIT COMPLETE. CHECK LOGS FOR ANY ERRORS. " echo "=================================================="

This script consolidates five manual commands into a single health report, saving valuable time during incidents or daily checks.


3. Script Output & Analysis Explained

Check ComponentWhat "Healthy" Looks Like
crsctl check crsCSS, CRS, and EVM should all show "Online". If any are offline, the node is not part of the cluster.
Voting DiskMust show "successful discovery" and list valid disk paths (e.g., ASM disk groups).
ocrcheckLook for "Cluster registry integrity check succeeded". Ensure enough free space is available.
Resource ScanAny resource in "INTERMEDIATE" state implies it is struggling to start or stop. "OFFLINE" is only okay for idle instances.

4. Critical Components: OCR, Voting Disk & CRS

Understanding these acronyms is vital for any RAC DBA:

  • OCR (Oracle Cluster Registry): Stores configuration info (resources, nodes, instances). If this is corrupt, the cluster cannot start.
  • Voting Disk: The "heartbeat" file. Nodes write to this to prove they are alive. Loss of voting disk = immediate node eviction (reboot).
  • CRS (Cluster Ready Services): The main daemon managing high availability.

5. Troubleshooting Common RAC Issues

If the script reports errors, follow this workflow:

  1. CRS Fails to Start: Check $ORACLE_HOME/log/hostname/alerthostname.log. It is often a permission issue or network failure.
  2. Voting Disk Missing: Verify ASM disk group mounting status. Run kfod disks=all to check disk visibility at OS level.
  3. Intermittent Evictions: Check network latency on the private interconnect. High latency leads to "Missed Heartbeats".



Technical diagram of 2-Node Oracle RAC Cluster Architecture verifying Private Interconnect status, Voting Disk integrity, and OCR Registry consistency during a production health audit


6. How to Automate This Audit (Cron)

You can schedule this script to run daily at 7 AM before business hours. Add this line to the Grid user's crontab:

00 07 * * * /home/grid/scripts/rac_health_check.sh > /tmp/rac_health_$(date +\%F).log 2>&1

7. Interview Questions: RAC Troubleshooting

Prepare for these common questions during senior DBA interviews:

  • Q: What is a split-brain scenario in RAC?
    A: When nodes lose private network communication and both try to write to the database. Voting disk prevents this by fencing off one node.
  • Q: How do you backup OCR?
    A: Oracle automatically backs up OCR every 4 hours. You can also manually backup using `ocrconfig -manualbackup`.
  • Q: What command checks the private interconnect IPs?
    A: `oifcfg getif`.

8. Final Summary

A healthy RAC cluster requires vigilant monitoring of the clusterware stack, not just the database instances. The script provided above is a fundamental tool for checking CRS, OCR, and Voting Disk health instantly.

Use this script as part of your Weekly Health Check routine (as suggested in the Nov 2025 schedule) to ensure 99.999% availability.


9. FAQ

Q1: Can I run this script as the 'oracle' user?
A: Most `crsctl` check commands work, but `ocrcheck` and deep diagnostics usually require `grid` or `root` privileges.

Q2: What should I do if OCR check fails?
A: Restore from the latest automatic backup using `ocrconfig -restore`. Do not restart the stack until resolved.

Q3: Does this cause performance impact?
A: No, these are lightweight metadata queries.


About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Explore More Technical Work

Call to Action
If you found this helpful, follow my blog and LinkedIn for deep Oracle, MySQL, and RAC content. I publish real production issues, scripts, and monitoring guides to help you level up your DBA career.



Thursday, November 27, 2025

Stop Guessing: The "One-Shot" Script to Check Oracle Data Guard Health (Lags, Gaps & MRP)

Oracle Data Guard Health Check Script Dashboard Primary Standby



If you are a DBA, you know the panic of a "Quiet Standby." The alerts are silent. The phone isn't ringing. But deep down, you wonder: Is my Disaster Recovery (DR) site actually in sync, or has it been stuck on Sequence #10452 since last Tuesday?

Too many monitoring tools (like OEM or Zabbix) only trigger an alert when the lag hits a threshold (e.g., "Lag > 30 Mins"). By then, it’s often too late. You don't just want to know if there is a lag; you need to know where the lag is.

Is it the Network (Transport Lag)? Or is it the Disk/CPU (Apply Lag)?

Below is the exact script I use in my daily health checks. It consolidates 4 different dynamic performance views (v$dataguard_statsv$managed_standbyv$archive_gapv$database) into one single "Truth" report.

The Script (dg_health_check.sql)

Save this as dg_health_check.sql and run it on your Standby Database.

SQL
SET LINESIZE 200 PAGESIZE 1000 CHECK OFF FEEDBACK OFF ECHO OFF VERIFY OFF
COL name FORMAT a30
COL value FORMAT a20
COL unit FORMAT a30
COL time_computed FORMAT a25
COL process FORMAT a10
COL status FORMAT a15
COL sequence# FORMAT 99999999
COL block# FORMAT 999999
COL error_message FORMAT a50

PROMPT ========================================================
PROMPT  ORACLE DATA GUARD HEALTH CHECK (Run on Standby)
PROMPT ========================================================

PROMPT
PROMPT 1. DATABASE ROLE & PROTECTION MODE
PROMPT ----------------------------------------
SELECT name, db_unique_name, database_role, open_mode, protection_mode 
FROM v$database;

PROMPT
PROMPT 2. REAL-TIME LAG STATISTICS (The Source of Truth)
PROMPT ----------------------------------------
-- Transport Lag = Delay in receiving data (Network Issue)
-- Apply Lag     = Delay in writing data (IO/CPU Issue)
SELECT name, value, unit, time_computed 
FROM v$dataguard_stats 
WHERE name IN ('transport lag', 'apply lag', 'estimated startup time');

PROMPT
PROMPT 3. MRP (MANAGED RECOVERY PROCESS) STATUS
PROMPT ----------------------------------------
-- IF NO ROWS SELECTED: Your recovery is STOPPED.
-- Look for 'APPLYING_LOG' or 'WAIT_FOR_LOG'
SELECT process, status, thread#, sequence#, block# 
FROM v$managed_standby 
WHERE process LIKE 'MRP%';

PROMPT
PROMPT 4. GAP DETECTION
PROMPT ----------------------------------------
-- If rows appear here, you have a missing archive log that FAL_SERVER could not fetch.
SELECT * FROM v$archive_gap;

PROMPT
PROMPT 5. RECENT ERRORS (Last 10 Events)
PROMPT ----------------------------------------
SELECT TO_CHAR(timestamp, 'DD-MON-RR HH24:MI:SS') as err_time, message 
FROM v$dataguard_status 
WHERE severity IN ('Error','Fatal') 
AND timestamp > sysdate-1
ORDER BY timestamp DESC
FETCH FIRST 10 ROWS ONLY;

PROMPT ========================================================
PROMPT  END OF REPORT
PROMPT ========================================================

Oracle Data Guard Transport Lag vs Apply Lag Architecture Diagram


How to Analyze the Output (Like a Senior DBA)

Scenario A: High Transport Lag

  • What you see: Transport Lag is high (e.g., +00 01:20:00), but Apply Lag is low.

  • What it means: Your Primary database is generating Redo faster than your network can ship it.

  • The Fix: Check your network bandwidth. If you are using Oracle 19c or 23ai, consider enabling Redo Compression in your Data Guard broker configuration (EditDatabase Set Property RedoCompression='ENABLE').

Scenario B: High Apply Lag

  • What you see: Transport Lag is near 0, but Apply Lag is climbing (e.g., +00 00:45:00).

  • What it means: The data is there (on the standby server), but the database can't write it to disk fast enough. This often happens during batch loads or index rebuilds on the Primary.

  • The Fix: Check I/O stats on the Standby. Ensure you are using Real-Time Apply so the MRP (Managed Recovery Process) reads directly from Standby Redo Logs (SRLs) rather than waiting for archive logs to be finalized.

Scenario C: MRP Status is "WAIT_FOR_GAP"

  • What you see: In Section 3, the status is WAIT_FOR_GAP.

  • What it means: A severe gap has occurred. The Standby is missing a specific sequence number and cannot proceed until you manually register that file.

  • The Fix: Run the query in Section 4 (v$archive_gap) to identify the missing sequence, restore it from backup, and register it.

Why this works in 2026

Old school scripts relied on v$archived_log, which only tells you history. In modern Oracle Cloud (OCI) and Hybrid environments, v$dataguard_stats is the only view that accurately calculates the time difference between the Primary commit and the Standby visibility.

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL & Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure, and on-premise systems. His core expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks, and real-world troubleshooting.

He also trains DBAs globally through deep-dive technical content, hands-on sessions, and automation workflows using n8n, AI tools, and modern monitoring stacks. His mission is to help DBAs solve real production problems and grow into high-paying remote roles worldwide.

Chetan regularly publishes expert content across Oracle, PostgreSQL, MySQL, and Cloud DBA technologies—including performance tuning guides, DR architectures, monitoring tools, scripts, and real incident-based case studies.


Explore More Technical Work

LinkedIn (Professional Profile & Articles)
https://www.linkedin.com/in/chetanyadavvds/

YouTube – Oracle Foundations Playlist
https://www.youtube.com/playlist?list=PL5TN6ECUWGROHQGXep_5hff-2ageWTp4b

Telegram – LevelUp_Careers DBA Tips
https://t.me/LevelUp_Careers

Instagram – Oracle/Cloud Learning Reels
https://www.instagram.com/levelup_careers/

Facebook Page – OracleDBAInfo
https://www.facebook.com/OracleDBAInfo

These platforms feature guides, scripts, diagrams, troubleshooting workflows, and real-world DBA case studies designed for database professionals worldwide.