Saturday, November 29, 2025

Oracle RAC Cluster Health Audit: The 2026 Production DBA Guide

⏱️ Estimated Reading Time: 5–6 minutes


In a production Oracle Real Application Clusters (RAC) environment, stability is everything. A single corrupt OCR, a missing Voting Disk, or an unstable CRS stack can lead to node evictions (split-brain scenarios) and unplanned downtime.

This article provides a comprehensive Shell Script for RAC Cluster Health Audits. It covers CRS status, OCR integrity, Voting Disk validation, and resource stability checks—perfect for daily monitoring or pre-patching validation.


Oracle RAC Cluster Health Audit 2026 Guide for Production DBAs showing high availability database architecture and performance metrics


Table of Contents

  1. Why You Must Audit RAC Cluster Health Daily
  2. Production-Ready RAC Health Check Script (Shell)
  3. Script Output & Analysis Explained
  4. Critical Components: OCR, Voting Disk & CRS
  5. Troubleshooting Common RAC Issues
  6. How to Automate This Audit (Cron)
  7. Interview Questions: RAC Troubleshooting
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Audit RAC Cluster Health Daily

Oracle RAC relies on a complex stack of clusterware services. Neglecting these checks leads to:

  • Node Evictions: Caused by heartbeat failures or voting disk I/O timeouts.
  • OCR Corruption: Resulting in the inability to start the clusterware stack.
  • Resource Regressions: Services or VIPs flapping between nodes.
  • Split-Brain Syndrome: Where nodes lose communication and fight for control.

Running a unified audit script ensures you catch "INTERMEDIATE" or "OFFLINE" states before they become outages.


2. Production-Ready RAC Health Check Script

This shell script checks the core pillars of RAC stability: CRS Stack, OCR, Voting Disks, and Resource Status.

Note: Execute this script as the grid (or root) user.

#!/bin/bash # ==================================================== # Oracle RAC Cluster Health Audit Script # Author: Chetan Yadav # Usage: ./rac_health_check.sh # ==================================================== # Set Grid Environment (Adjust ORACLE_HOME as needed) export ORACLE_HOME=/u01/app/19.0.0/grid export PATH=$ORACLE_HOME/bin:$PATH echo "==================================================" echo " ORACLE RAC CLUSTER HEALTH AUDIT - $(date) " echo "==================================================" # 1. Check High Availability Services (OHAS) echo -e "\n[1] Checking CRS/OHAS Stack Status..." crsctl check crs # 2. Check Voting Disk Status (Quorum) echo -e "\n[2] Checking Voting Disk Configuration..." crsctl query css votedisk # 3. Check OCR Integrity (Registry) echo -e "\n[3] Checking Oracle Cluster Registry (OCR) Integrity..." # Note: Requires root or grid privileges ocrcheck # 4. Check Cluster Resources (Highlighting Issues) echo -e "\n[4] Scanning for OFFLINE or UNSTABLE Resources..." crsctl stat res -t | grep -E "OFFLINE|INTERMEDIATE|UNKNOWN" # 5. Check Cluster Interconnect (Private Network) echo -e "\n[5] Checking Cluster Interconnects..." oifcfg getif echo -e "\n==================================================" echo " AUDIT COMPLETE. CHECK LOGS FOR ANY ERRORS. " echo "=================================================="

This script consolidates five manual commands into a single health report, saving valuable time during incidents or daily checks.


3. Script Output & Analysis Explained

Check ComponentWhat "Healthy" Looks Like
crsctl check crsCSS, CRS, and EVM should all show "Online". If any are offline, the node is not part of the cluster.
Voting DiskMust show "successful discovery" and list valid disk paths (e.g., ASM disk groups).
ocrcheckLook for "Cluster registry integrity check succeeded". Ensure enough free space is available.
Resource ScanAny resource in "INTERMEDIATE" state implies it is struggling to start or stop. "OFFLINE" is only okay for idle instances.

4. Critical Components: OCR, Voting Disk & CRS

Understanding these acronyms is vital for any RAC DBA:

  • OCR (Oracle Cluster Registry): Stores configuration info (resources, nodes, instances). If this is corrupt, the cluster cannot start.
  • Voting Disk: The "heartbeat" file. Nodes write to this to prove they are alive. Loss of voting disk = immediate node eviction (reboot).
  • CRS (Cluster Ready Services): The main daemon managing high availability.

5. Troubleshooting Common RAC Issues

If the script reports errors, follow this workflow:

  1. CRS Fails to Start: Check $ORACLE_HOME/log/hostname/alerthostname.log. It is often a permission issue or network failure.
  2. Voting Disk Missing: Verify ASM disk group mounting status. Run kfod disks=all to check disk visibility at OS level.
  3. Intermittent Evictions: Check network latency on the private interconnect. High latency leads to "Missed Heartbeats".



Technical diagram of 2-Node Oracle RAC Cluster Architecture verifying Private Interconnect status, Voting Disk integrity, and OCR Registry consistency during a production health audit


6. How to Automate This Audit (Cron)

You can schedule this script to run daily at 7 AM before business hours. Add this line to the Grid user's crontab:

00 07 * * * /home/grid/scripts/rac_health_check.sh > /tmp/rac_health_$(date +\%F).log 2>&1

7. Interview Questions: RAC Troubleshooting

Prepare for these common questions during senior DBA interviews:

  • Q: What is a split-brain scenario in RAC?
    A: When nodes lose private network communication and both try to write to the database. Voting disk prevents this by fencing off one node.
  • Q: How do you backup OCR?
    A: Oracle automatically backs up OCR every 4 hours. You can also manually backup using `ocrconfig -manualbackup`.
  • Q: What command checks the private interconnect IPs?
    A: `oifcfg getif`.

8. Final Summary

A healthy RAC cluster requires vigilant monitoring of the clusterware stack, not just the database instances. The script provided above is a fundamental tool for checking CRS, OCR, and Voting Disk health instantly.

Use this script as part of your Weekly Health Check routine (as suggested in the Nov 2025 schedule) to ensure 99.999% availability.


9. FAQ

Q1: Can I run this script as the 'oracle' user?
A: Most `crsctl` check commands work, but `ocrcheck` and deep diagnostics usually require `grid` or `root` privileges.

Q2: What should I do if OCR check fails?
A: Restore from the latest automatic backup using `ocrconfig -restore`. Do not restart the stack until resolved.

Q3: Does this cause performance impact?
A: No, these are lightweight metadata queries.


About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Explore More Technical Work

Call to Action
If you found this helpful, follow my blog and LinkedIn for deep Oracle, MySQL, and RAC content. I publish real production issues, scripts, and monitoring guides to help you level up your DBA career.



Thursday, November 27, 2025

Stop Guessing: The "One-Shot" Script to Check Oracle Data Guard Health (Lags, Gaps & MRP)

Oracle Data Guard Health Check Script Dashboard Primary Standby



If you are a DBA, you know the panic of a "Quiet Standby." The alerts are silent. The phone isn't ringing. But deep down, you wonder: Is my Disaster Recovery (DR) site actually in sync, or has it been stuck on Sequence #10452 since last Tuesday?

Too many monitoring tools (like OEM or Zabbix) only trigger an alert when the lag hits a threshold (e.g., "Lag > 30 Mins"). By then, it’s often too late. You don't just want to know if there is a lag; you need to know where the lag is.

Is it the Network (Transport Lag)? Or is it the Disk/CPU (Apply Lag)?

Below is the exact script I use in my daily health checks. It consolidates 4 different dynamic performance views (v$dataguard_statsv$managed_standbyv$archive_gapv$database) into one single "Truth" report.

The Script (dg_health_check.sql)

Save this as dg_health_check.sql and run it on your Standby Database.

SQL
SET LINESIZE 200 PAGESIZE 1000 CHECK OFF FEEDBACK OFF ECHO OFF VERIFY OFF
COL name FORMAT a30
COL value FORMAT a20
COL unit FORMAT a30
COL time_computed FORMAT a25
COL process FORMAT a10
COL status FORMAT a15
COL sequence# FORMAT 99999999
COL block# FORMAT 999999
COL error_message FORMAT a50

PROMPT ========================================================
PROMPT  ORACLE DATA GUARD HEALTH CHECK (Run on Standby)
PROMPT ========================================================

PROMPT
PROMPT 1. DATABASE ROLE & PROTECTION MODE
PROMPT ----------------------------------------
SELECT name, db_unique_name, database_role, open_mode, protection_mode 
FROM v$database;

PROMPT
PROMPT 2. REAL-TIME LAG STATISTICS (The Source of Truth)
PROMPT ----------------------------------------
-- Transport Lag = Delay in receiving data (Network Issue)
-- Apply Lag     = Delay in writing data (IO/CPU Issue)
SELECT name, value, unit, time_computed 
FROM v$dataguard_stats 
WHERE name IN ('transport lag', 'apply lag', 'estimated startup time');

PROMPT
PROMPT 3. MRP (MANAGED RECOVERY PROCESS) STATUS
PROMPT ----------------------------------------
-- IF NO ROWS SELECTED: Your recovery is STOPPED.
-- Look for 'APPLYING_LOG' or 'WAIT_FOR_LOG'
SELECT process, status, thread#, sequence#, block# 
FROM v$managed_standby 
WHERE process LIKE 'MRP%';

PROMPT
PROMPT 4. GAP DETECTION
PROMPT ----------------------------------------
-- If rows appear here, you have a missing archive log that FAL_SERVER could not fetch.
SELECT * FROM v$archive_gap;

PROMPT
PROMPT 5. RECENT ERRORS (Last 10 Events)
PROMPT ----------------------------------------
SELECT TO_CHAR(timestamp, 'DD-MON-RR HH24:MI:SS') as err_time, message 
FROM v$dataguard_status 
WHERE severity IN ('Error','Fatal') 
AND timestamp > sysdate-1
ORDER BY timestamp DESC
FETCH FIRST 10 ROWS ONLY;

PROMPT ========================================================
PROMPT  END OF REPORT
PROMPT ========================================================

Oracle Data Guard Transport Lag vs Apply Lag Architecture Diagram


How to Analyze the Output (Like a Senior DBA)

Scenario A: High Transport Lag

  • What you see: Transport Lag is high (e.g., +00 01:20:00), but Apply Lag is low.

  • What it means: Your Primary database is generating Redo faster than your network can ship it.

  • The Fix: Check your network bandwidth. If you are using Oracle 19c or 23ai, consider enabling Redo Compression in your Data Guard broker configuration (EditDatabase Set Property RedoCompression='ENABLE').

Scenario B: High Apply Lag

  • What you see: Transport Lag is near 0, but Apply Lag is climbing (e.g., +00 00:45:00).

  • What it means: The data is there (on the standby server), but the database can't write it to disk fast enough. This often happens during batch loads or index rebuilds on the Primary.

  • The Fix: Check I/O stats on the Standby. Ensure you are using Real-Time Apply so the MRP (Managed Recovery Process) reads directly from Standby Redo Logs (SRLs) rather than waiting for archive logs to be finalized.

Scenario C: MRP Status is "WAIT_FOR_GAP"

  • What you see: In Section 3, the status is WAIT_FOR_GAP.

  • What it means: A severe gap has occurred. The Standby is missing a specific sequence number and cannot proceed until you manually register that file.

  • The Fix: Run the query in Section 4 (v$archive_gap) to identify the missing sequence, restore it from backup, and register it.

Why this works in 2026

Old school scripts relied on v$archived_log, which only tells you history. In modern Oracle Cloud (OCI) and Hybrid environments, v$dataguard_stats is the only view that accurately calculates the time difference between the Primary commit and the Standby visibility.

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL & Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure, and on-premise systems. His core expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks, and real-world troubleshooting.

He also trains DBAs globally through deep-dive technical content, hands-on sessions, and automation workflows using n8n, AI tools, and modern monitoring stacks. His mission is to help DBAs solve real production problems and grow into high-paying remote roles worldwide.

Chetan regularly publishes expert content across Oracle, PostgreSQL, MySQL, and Cloud DBA technologies—including performance tuning guides, DR architectures, monitoring tools, scripts, and real incident-based case studies.


Explore More Technical Work

LinkedIn (Professional Profile & Articles)
https://www.linkedin.com/in/chetanyadavvds/

YouTube – Oracle Foundations Playlist
https://www.youtube.com/playlist?list=PL5TN6ECUWGROHQGXep_5hff-2ageWTp4b

Telegram – LevelUp_Careers DBA Tips
https://t.me/LevelUp_Careers

Instagram – Oracle/Cloud Learning Reels
https://www.instagram.com/levelup_careers/

Facebook Page – OracleDBAInfo
https://www.facebook.com/OracleDBAInfo

These platforms feature guides, scripts, diagrams, troubleshooting workflows, and real-world DBA case studies designed for database professionals worldwide.

Monday, November 24, 2025

MySQL Slow Query Diagnostic Script (2025): A Complete Production DBA Guide


Estimated Reading Time: 6–7 minutes

Slow queries are one of the biggest reasons for performance degradation in MySQL and Aurora MySQL environments. High latency SQL can create CPU spikes, I/O pressure, row lock waits, replication lag, and application-level timeouts.

This article provides a production-ready MySQL Slow Query Diagnostic Script, explains how to interpret the results, and shows how DBAs can use this script for proactive tuning and operational monitoring.


MySQL Slow Query Diagnostic Script banner showing SQL performance diagnostics, performance schema analysis and slow query monitoring dashboard


Table of Contents

1.      What Slow Query Diagnostics Mean for MySQL DBAs

2.      Production-Ready MySQL Slow Query Diagnostic Script

3.      Script Output Explained

4.      Additional Performance Metrics to Watch

5.      Add-On Scripts (Top by Buffer Gets, Disk Reads)

6.      Real-World MySQL DBA Scenario

7.      How to Automate These Checks

8.      Interview Questions

9.      Final Summary

10.  FAQ

11.  About the Author

12.  Call to Action (CTA)


1. What Slow Query Diagnostics Mean for MySQL DBAs

Slow queries lead to:

·         High CPU utilisation

·         Increased IOPS and latency

·         Row lock waits and deadlocks

·         Replication lag in Aurora MySQL / RDS MySQL

·         Query timeout issues at the application layer

·         Poor customer experience under load

MySQL’s Performance Schema provides deep visibility into SQL patterns, allowing DBAs to identify:

·         High-latency queries

·         Full table scans

·         Missing index patterns

·         SQL causing temporary tables

·         SQL responsible for heavy disk reads

·         SQL generating high row examinations

Slow query diagnostics are essential for maintaining consistent performance in production systems.


2. Production-Ready MySQL Slow Query Diagnostic Script

This script analyses execution time, latency, row scans and query patterns using Performance Schema:

/* MySQL Slow Query Diagnostic Script
   Works on: MySQL 5.7, MySQL 8.0, Aurora MySQL
*/
 
SELECT 
      DIGEST_TEXT AS Query_Sample,
      SCHEMA_NAME AS Database_Name,
      COUNT_STAR AS Execution_Count,
      ROUND(SUM_TIMER_WAIT/1000000000000, 4) AS Total_Time_Seconds,
      ROUND((SUM_TIMER_WAIT/COUNT_STAR)/1000000000000, 6) AS Avg_Time_Per_Exec,
      SUM_ROWS_EXAMINED AS Rows_Examined,
      SUM_ROWS_SENT AS Rows_Sent,
      FIRST_SEEN,
      LAST_SEEN
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME NOT IN ('mysql','sys','performance_schema','information_schema')
ORDER BY Total_Time_Seconds DESC
LIMIT 20;

This is a field-tested script used in multiple production environments including AWS RDS MySQL and Amazon Aurora MySQL.


3. Script Output Explained

Column

Meaning

Query_Sample

Normalized version of SQL for pattern analysis

Database_Name

Schema on which SQL is executed

Execution_Count

How many times the SQL pattern ran

Total_Time_Seconds

Total execution time consumed

Avg_Time_Per_Exec

Average latency per execution

Rows_Examined

Total rows scanned (detects full scans)

Rows_Sent

Rows returned by the query

FIRST_SEEN / LAST_SEEN

Time window of activity

These values help DBAs identify the highest-impact SQL patterns immediately.


4. Additional Performance Metrics You Must Watch

During slow query investigations, always check:

·         High Rows_Examined → Missing index

·         High Avg_Time_Per_Exec → Expensive joins or sorting

·         High Rows_Examined vs Rows_Sent difference → Inefficient filtering

·         High Execution_Count → Inefficient query called repeatedly

·         Repeated occurrence between FIRST_SEEN and LAST_SEEN → Ongoing issue

MySQL workload analysis becomes easy when these metrics are evaluated together.


5. Add-On Script: Top SQL by Buffer Gets

Useful for identifying CPU-heavy SQL:

SELECT 
    sql_id,
    buffer_gets,
    executions,
    ROUND(buffer_gets/EXECUTIONS, 2) AS gets_per_exec,
    sql_text
FROM performance_schema.events_statements_summary_by_digest
ORDER BY buffer_gets DESC
LIMIT 10;

MySQL Performance Schema slow query capture and analysis architecture diagram showing SQL flow from application through parser, execution engine, data collection, digest normalization and summary tables for DBA monitoring.

6. Add-On Script: Top SQL by Disk Reads

Identifies IO-intensive SQL patterns:

SELECT 
    DIGEST_TEXT,
    SUM_ROWS_EXAMINED,
    SUM_ROWS_SENT,
    SUM_CREATED_TMP_TABLES,
    SUM_CREATED_TMP_DISK_TABLES
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_CREATED_TMP_DISK_TABLES DESC
LIMIT 10;

These help diagnose latency issues caused by slow storage or inefficient joins.


7. Real-World MySQL DBA Scenario

A typical incident scenario:

1.      Application complaints about slow API response

2.      CloudWatch shows high read latency

3.      Slow query log or Performance Schema shows a SQL digest consuming high execution time

4.      SQL performs a full table scan on a large table

5.      Missing index identified on a WHERE clause or JOIN condition

6.      Index added / query refactored

7.      Latency drops, performance normalises

This is the real process DBAs follow for incident resolution.


8. How to Automate These Checks

DBAs typically automate slow query monitoring using:

·         Linux cron + shell scripts

·         Python automation with scheduling

·         n8n workflows + MySQL nodes

·         AWS CloudWatch + Lambda alerts for Aurora MySQL

·         Grafana + Prometheus exporters

·         Slack / Teams notifications for high-latency SQL

Automation ensures issues are detected before users experience downtime.


9. Interview Questions – Slow Query Diagnostics

Be ready for:

·         How do you find top slow queries in MySQL?

·         What is the advantage of Performance Schema?

·         Difference between Rows_Examined and Rows_Sent?

·         What creates temporary disk tables?

·         How do you detect missing indexes from slow queries?

·         How do you reduce query execution time?

·         How does MySQL slow query log differ from Performance Schema?

Mentioning these scripts gives you a strong technical advantage.


10. Final Summary

Slow query diagnostics are essential for maintaining high performance in MySQL, Aurora MySQL, and RDS MySQL systems. The diagnostic script provided above offers deep visibility into SQL patterns, latency contributors and row scan behaviour.

This script can be used for daily health checks, tuning analysis, or fully automated monitoring workflows.


11. FAQ – MySQL Slow Query Diagnostics

Q1: What causes slow queries in MySQL?
Missing indexes, inefficient joins, large table scans, temporary table creation, outdated statistics, or poor schema design.

Q2: Does this script work in Aurora MySQL?
Yes, it works in Aurora MySQL 2.x/3.x because Performance Schema is supported.

Q3: Should I enable slow query logs as well?
Yes, slow query logs complement Performance Schema for long-running queries.

Q4: Can this script detect full table scans?
Yes—high Rows_Examined with low Rows_Sent is a clear indicator.

Q5: Does this script impact performance?
No, Performance Schema summary tables are lightweight.

 About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows using n8n, AI tools and modern monitoring stacks. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Chetan regularly publishes expert content across Oracle, PostgreSQL, MySQL and Cloud DBA technologies—including performance tuning guides, DR architectures, monitoring tools, scripts and real incident-based case studies.

Explore More Technical Work

LinkedIn (Professional Profile & Articles)
https://www.linkedin.com/in/chetanyadavvds/

YouTube – Oracle Foundations Playlist
https://www.youtube.com/playlist?list=PL5TN6ECUWGROHQGXep_5hff-2ageWTp4b

Telegram – LevelUp_Careers DBA Tips
https://t.me/LevelUp_Careers

Instagram – Oracle/Cloud Learning Reels
https://www.instagram.com/levelup_careers/

Facebook Page – OracleDBAInfo
https://www.facebook.com/OracleDBAInfo

These platforms feature guides, scripts, diagrams, troubleshooting workflows and real-world DBA case studies designed for database professionals worldwide.

Call to Action
If you found this helpful, follow my blog and LinkedIn for deep Oracle, MySQL, PostgreSQL and Cloud DBA content. I publish real production issues, scripts, case studies and monitoring guides that help DBAs grow in their career.