Who This Is For: Junior to Senior Oracle DBAs managing Oracle 19c or 21c Data Guard environments who are troubleshooting standby redo logs not applying, MRP process failures, or unexplained apply lag. Freshers studying for Oracle certification will also benefit from the architecture walkthrough.
It was 2:41 AM. PagerDuty fires. The on-call message reads: "Data Guard apply lag 38 minutes and climbing." I SSH into the standby. ORA-16766 stares back at me from the alert log. A quick check of V$MANAGED_STANDBY confirms it -- MRP0 is gone. No standby redo logs are applying. The business had a 4-hour RPO commitment. We had maybe 90 minutes before the DBA team had a very uncomfortable conversation with the CTO.
Standby redo logs not applying in Oracle Data Guard is one of the highest-stress incidents a production DBA faces. It is also one of the most fixable -- if you know the exact diagnostic tree. In this post I am going to walk through every root cause I have encountered across 15 years of Oracle production work, the precise SQL to diagnose each one, and the fix you run to get apply moving again.
Environment Reference: Oracle 19c (19.19 RU), 2-node RAC primary + single-node physical standby, SYNC transport, 6 TB database, peak 12,000 TPS. All SQL verified in this environment. Steps also apply to Oracle 21c and Oracle 23ai Data Guard configurations.
In a production Oracle Real Application Clusters (RAC) environment, stability is everything. A single corrupt OCR, a missing Voting Disk, or an unstable CRS stack can lead to node evictions (split-brain scenarios) and unplanned downtime.
This article provides a comprehensive Shell Script for RAC Cluster Health Audits. It covers CRS status, OCR integrity, Voting Disk validation, and resource stability checks—perfect for daily monitoring or pre-patching validation.
Table of Contents
Why You Must Audit RAC Cluster Health Daily
Production-Ready RAC Health Check Script (Shell)
Script Output & Analysis Explained
Critical Components: OCR, Voting Disk & CRS
Troubleshooting Common RAC Issues
How to Automate This Audit (Cron)
Interview Questions: RAC Troubleshooting
Final Summary
FAQ
About the Author
1. Why You Must Audit RAC Cluster Health Daily
Oracle RAC relies on a complex stack of clusterware services. Neglecting these checks leads to:
Node Evictions: Caused by heartbeat failures or voting disk I/O timeouts.
OCR Corruption: Resulting in the inability to start the clusterware stack.
Resource Regressions: Services or VIPs flapping between nodes.
Split-Brain Syndrome: Where nodes lose communication and fight for control.
Running a unified audit script ensures you catch "INTERMEDIATE" or "OFFLINE" states before they become outages.
2. Production-Ready RAC Health Check Script
This shell script checks the core pillars of RAC stability: CRS Stack, OCR, Voting Disks, and Resource Status.
Note: Execute this script as the grid (or root) user.
Prepare for these common questions during senior DBA interviews:
Q: What is a split-brain scenario in RAC? A: When nodes lose private network communication and both try to write to the database. Voting disk prevents this by fencing off one node.
Q: How do you backup OCR? A: Oracle automatically backs up OCR every 4 hours. You can also manually backup using `ocrconfig -manualbackup`.
Q: What command checks the private interconnect IPs? A: `oifcfg getif`.
8. Final Summary
A healthy RAC cluster requires vigilant monitoring of the clusterware stack, not just the database instances. The script provided above is a fundamental tool for checking CRS, OCR, and Voting Disk health instantly.
Use this script as part of your Weekly Health Check routine (as suggested in the Nov 2025 schedule) to ensure 99.999% availability.
9. FAQ
Q1: Can I run this script as the 'oracle' user? A: Most `crsctl` check commands work, but `ocrcheck` and deep diagnostics usually require `grid` or `root` privileges.
Q2: What should I do if OCR check fails? A: Restore from the latest automatic backup using `ocrconfig -restore`. Do not restart the stack until resolved.
Q3: Does this cause performance impact? A: No, these are lightweight metadata queries.
About the Author
Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL and Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure and on-premise systems. His expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks and real-world troubleshooting.
He trains DBAs globally through deep-dive technical content, hands-on sessions and automation workflows. His mission is to help DBAs solve real production problems and advance into high-paying remote roles worldwide.
Call to Action If you found this helpful, follow my blog and LinkedIn for deep Oracle, MySQL, and RAC content. I publish real production issues, scripts, and monitoring guides to help you level up your DBA career.
If you are a DBA, you know the panic of a "Quiet Standby." The alerts are silent. The phone isn't ringing. But deep down, you wonder: Is my Disaster Recovery (DR) site actually in sync, or has it been stuck on Sequence #10452 since last Tuesday?
Too many monitoring tools (like OEM or Zabbix) only trigger an alert when the lag hits a threshold (e.g., "Lag > 30 Mins"). By then, it’s often too late. You don't just want to know if there is a lag; you need to know where the lag is.
Is it the Network (Transport Lag)? Or is it the Disk/CPU (Apply Lag)?
Below is the exact script I use in my daily health checks. It consolidates 4 different dynamic performance views (v$dataguard_stats, v$managed_standby, v$archive_gap, v$database) into one single "Truth" report.
The Script (dg_health_check.sql)
Save this as dg_health_check.sql and run it on your Standby Database.
SQL
SET LINESIZE 200 PAGESIZE 1000CHECK OFF FEEDBACK OFF ECHO OFF VERIFY OFF
COL name FORMAT a30
COL value FORMAT a20
COL unit FORMAT a30
COL time_computed FORMAT a25
COL process FORMAT a10
COL status FORMAT a15
COL sequence# FORMAT 99999999
COL block# FORMAT 999999
COL error_message FORMAT a50
PROMPT ========================================================
PROMPT ORACLE DATA GUARD HEALTH CHECK (Run on Standby)
PROMPT ========================================================
PROMPT
PROMPT 1. DATABASE ROLE & PROTECTION MODE
PROMPT ----------------------------------------SELECT name, db_unique_name, database_role, open_mode, protection_mode
FROM v$database;
PROMPT
PROMPT 2.REAL-TIME LAG STATISTICS (The Source of Truth)
PROMPT ------------------------------------------ Transport Lag = Delay in receiving data (Network Issue)-- Apply Lag = Delay in writing data (IO/CPU Issue)SELECT name, value, unit, time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag', 'apply lag', 'estimated startup time');
PROMPT
PROMPT 3. MRP (MANAGED RECOVERY PROCESS) STATUS
PROMPT ------------------------------------------ IF NO ROWS SELECTED: Your recovery is STOPPED.-- Look for 'APPLYING_LOG' or 'WAIT_FOR_LOG'SELECT process, status, thread#, sequence#, block#
FROM v$managed_standby
WHERE process LIKE'MRP%';
PROMPT
PROMPT 4. GAP DETECTION
PROMPT ------------------------------------------ If rows appear here, you have a missing archive log that FAL_SERVER could not fetch.SELECT*FROM v$archive_gap;
PROMPT
PROMPT 5. RECENT ERRORS (Last10 Events)
PROMPT ----------------------------------------SELECT TO_CHAR(timestamp, 'DD-MON-RR HH24:MI:SS') as err_time, message
FROM v$dataguard_status
WHERE severity IN ('Error','Fatal')
ANDtimestamp> sysdate-1ORDERBYtimestampDESCFETCHFIRST10ROWSONLY;
PROMPT ========================================================
PROMPT ENDOF REPORT
PROMPT ========================================================
How to Analyze the Output (Like a Senior DBA)
Scenario A: High Transport Lag
What you see:Transport Lag is high (e.g., +00 01:20:00), but Apply Lag is low.
What it means: Your Primary database is generating Redo faster than your network can ship it.
The Fix: Check your network bandwidth. If you are using Oracle 19c or 23ai, consider enabling Redo Compression in your Data Guard broker configuration (EditDatabase Set Property RedoCompression='ENABLE').
Scenario B: High Apply Lag
What you see:Transport Lag is near 0, but Apply Lag is climbing (e.g., +00 00:45:00).
What it means: The data is there (on the standby server), but the database can't write it to disk fast enough. This often happens during batch loads or index rebuilds on the Primary.
The Fix: Check I/O stats on the Standby. Ensure you are using Real-Time Apply so the MRP (Managed Recovery Process) reads directly from Standby Redo Logs (SRLs) rather than waiting for archive logs to be finalized.
Scenario C: MRP Status is "WAIT_FOR_GAP"
What you see: In Section 3, the status is WAIT_FOR_GAP.
What it means: A severe gap has occurred. The Standby is missing a specific sequence number and cannot proceed until you manually register that file.
The Fix: Run the query in Section 4 (v$archive_gap) to identify the missing sequence, restore it from backup, and register it.
Why this works in 2026
Old school scripts relied on v$archived_log, which only tells you history. In modern Oracle Cloud (OCI) and Hybrid environments, v$dataguard_stats is the only view that accurately calculates the time difference between the Primary commit and the Standby visibility.
Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL & Cloud DBA with 14+ years of experience supporting high-traffic production environments across AWS, Azure, and on-premise systems. His core expertise includes Oracle RAC, ASM, Data Guard, performance tuning, HA/DR design, monitoring frameworks, and real-world troubleshooting.
He also trains DBAs globally through deep-dive technical content, hands-on sessions, and automation workflows using n8n, AI tools, and modern monitoring stacks. His mission is to help DBAs solve real production problems and grow into high-paying remote roles worldwide.
Chetan regularly publishes expert content across Oracle, PostgreSQL, MySQL, and Cloud DBA technologies—including performance tuning guides, DR architectures, monitoring tools, scripts, and real incident-based case studies.
These platforms feature guides, scripts, diagrams, troubleshooting workflows, and real-world DBA case studies designed for database professionals worldwide.