Showing posts with label Oracle RAC. Show all posts
Showing posts with label Oracle RAC. Show all posts

Saturday, November 29, 2025

Oracle RAC Cluster Health Audit: The 2026 Production DBA Guide

⏱️ Estimated Reading Time: 5–6 minutes


In a production Oracle Real Application Clusters (RAC) environment, stability is everything. A single corrupt OCR, a missing Voting Disk, or an unstable CRS stack can lead to node evictions (split-brain scenarios) and unplanned downtime.

This article provides a comprehensive Shell Script for RAC Cluster Health Audits. It covers CRS status, OCR integrity, Voting Disk validation, and resource stability checks—perfect for daily monitoring or pre-patching validation.


Oracle RAC Cluster Health Audit 2026 Guide for Production DBAs showing high availability database architecture and performance metrics


Table of Contents

  1. Why You Must Audit RAC Cluster Health Daily
  2. Production-Ready RAC Health Check Script (Shell)
  3. Script Output & Analysis Explained
  4. Critical Components: OCR, Voting Disk & CRS
  5. Troubleshooting Common RAC Issues
  6. How to Automate This Audit (Cron)
  7. Interview Questions: RAC Troubleshooting
  8. Final Summary
  9. FAQ
  10. About the Author

1. Why You Must Audit RAC Cluster Health Daily

Oracle RAC relies on a complex stack of clusterware services. Neglecting these checks leads to:

  • Node Evictions: Caused by heartbeat failures or voting disk I/O timeouts.
  • OCR Corruption: Resulting in the inability to start the clusterware stack.
  • Resource Regressions: Services or VIPs flapping between nodes.
  • Split-Brain Syndrome: Where nodes lose communication and fight for control.

Running a unified audit script ensures you catch "INTERMEDIATE" or "OFFLINE" states before they become outages.


2. Production-Ready RAC Health Check Script

This shell script checks the core pillars of RAC stability: CRS Stack, OCR, Voting Disks, and Resource Status.

Note: Execute this script as the grid (or root) user.

#!/bin/bash # ==================================================== # Oracle RAC Cluster Health Audit Script # Author: Chetan Yadav # Usage: ./rac_health_check.sh # ==================================================== # Set Grid Environment (Adjust ORACLE_HOME as needed) export ORACLE_HOME=/u01/app/19.0.0/grid export PATH=$ORACLE_HOME/bin:$PATH echo "==================================================" echo " ORACLE RAC CLUSTER HEALTH AUDIT - $(date) " echo "==================================================" # 1. Check High Availability Services (OHAS) echo -e "\n[1] Checking CRS/OHAS Stack Status..." crsctl check crs # 2. Check Voting Disk Status (Quorum) echo -e "\n[2] Checking Voting Disk Configuration..." crsctl query css votedisk # 3. Check OCR Integrity (Registry) echo -e "\n[3] Checking Oracle Cluster Registry (OCR) Integrity..." # Note: Requires root or grid privileges ocrcheck # 4. Check Cluster Resources (Highlighting Issues) echo -e "\n[4] Scanning for OFFLINE or UNSTABLE Resources..." crsctl stat res -t | grep -E "OFFLINE|INTERMEDIATE|UNKNOWN" # 5. Check Cluster Interconnect (Private Network) echo -e "\n[5] Checking Cluster Interconnects..." oifcfg getif echo -e "\n==================================================" echo " AUDIT COMPLETE. CHECK LOGS FOR ANY ERRORS. " echo "=================================================="

This script consolidates five manual commands into a single health report, saving valuable time during incidents or daily checks.


3. Script Output & Analysis Explained

Check ComponentWhat "Healthy" Looks Like
crsctl check crsCSS, CRS, and EVM should all show "Online". If any are offline, the node is not part of the cluster.
Voting DiskMust show "successful discovery" and list valid disk paths (e.g., ASM disk groups).
ocrcheckLook for "Cluster registry integrity check succeeded". Ensure enough free space is available.
Resource ScanAny resource in "INTERMEDIATE" state implies it is struggling to start or stop. "OFFLINE" is only okay for idle instances.

4. Critical Components: OCR, Voting Disk & CRS

Understanding these acronyms is vital for any RAC DBA:

  • OCR (Oracle Cluster Registry): Stores configuration info (resources, nodes, instances). If this is corrupt, the cluster cannot start.
  • Voting Disk: The "heartbeat" file. Nodes write to this to prove they are alive. Loss of voting disk = immediate node eviction (reboot).
  • CRS (Cluster Ready Services): The main daemon managing high availability.

5. Troubleshooting Common RAC Issues

If the script reports errors, follow this workflow:

  1. CRS Fails to Start: Check $ORACLE_HOME/log/hostname/alerthostname.log. It is often a permission issue or network failure.
  2. Voting Disk Missing: Verify ASM disk group mounting status. Run kfod disks=all to check disk visibility at OS level.
  3. Intermittent Evictions: Check network latency on the private interconnect. High latency leads to "Missed Heartbeats".



Technical diagram of 2-Node Oracle RAC Cluster Architecture verifying Private Interconnect status, Voting Disk integrity, and OCR Registry consistency during a production health audit


6. How to Automate This Audit (Cron)

You can schedule this script to run daily at 7 AM before business hours. Add this line to the Grid user's crontab:

00 07 * * * /home/grid/scripts/rac_health_check.sh > /tmp/rac_health_$(date +\%F).log 2>&1

7. Interview Questions: RAC Troubleshooting

Prepare for these common questions during senior DBA interviews:

  • Q: What is a split-brain scenario in RAC?
    A: When nodes lose private network communication and both try to write to the database. Voting disk prevents this by fencing off one node.
  • Q: How do you backup OCR?
    A: Oracle automatically backs up OCR every 4 hours. You can also manually backup using `ocrconfig -manualbackup`.
  • Q: What command checks the private interconnect IPs?
    A: `oifcfg getif`.

8. Final Summary

A healthy RAC cluster requires vigilant monitoring of the clusterware stack, not just the database instances. The script provided above is a fundamental tool for checking CRS, OCR, and Voting Disk health instantly.

Use this script as part of your Weekly Health Check routine (as suggested in the Nov 2025 schedule) to ensure 99.999% availability.


9. FAQ

Q1: Can I run this script as the 'oracle' user?
A: Most `crsctl` check commands work, but `ocrcheck` and deep diagnostics usually require `grid` or `root` privileges.

Q2: What should I do if OCR check fails?
A: Restore from the latest automatic backup using `ocrconfig -restore`. Do not restart the stack until resolved.

Q3: Does this cause performance impact?
A: No, these are lightweight metadata queries.


About the Author

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.

He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.

Explore More Technical Work

Call to Action
If you found this helpful, follow my blog and LinkedIn for deep Oracle, MySQL, and RAC content. I publish real production issues, scripts, and monitoring guides to help you level up your DBA career.