Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive
Oracle Database: 19.18.0.0.0 Enterprise Edition • Primary: 2-Node RAC, 4.8 TB OLTP • Standby: Physical Standby (Active Data Guard)
Protection Mode: Maximum Availability (SYNC/AFFIRM) • Network: Dedicated 1 GbE WAN, 120 km, RTT 1.8 ms
Peak Load: 2,800 TPS, 180 MB/sec redo generation • Application: Core banking transaction processing
The monitoring alert fires at 11:43 PM: "Data Guard apply lag exceeds 900 seconds." Transport lag is 180 seconds. Apply lag is 900 seconds. The standby is 15 minutes behind the primary. If the primary fails right now, 15 minutes of financial transactions are at risk.
This scenario happens in production Data Guard environments more often than most teams admit. The problem looks the same from the outside every time, but the root cause is completely different each time. Transport lag and apply lag each have different causes, different diagnostic queries, and different fixes. Treating them as the same problem wastes hours of investigation.
This guide covers all six real production causes of Data Guard lag, the exact SQL to identify each one, and the specific fix for each. No guesswork. Precise diagnosis first, then precise resolution.
- Transport Lag vs Apply Lag: The Critical Distinction
- Five Diagnostic Queries to Run First
- Cause 1: Primary Storage I/O Pressure
- Cause 2: Redo Volume Spike from Batch Jobs
- Cause 3: Network Latency and Bandwidth
- Cause 4: Misconfiguration (The Silent Killer)
- Cause 5: MRP Apply Bottleneck
- Cause 6: Standby Resource Starvation
- SYNC vs ASYNC: Which Mode Creates Which Lag Pattern
- Lag Under Active Data Guard (ADG Read Workload)
- Proactive Lag Monitoring
- FAQ
- Related Reading
1. Transport Lag vs Apply Lag: The Critical Distinction
Before diagnosing lag, identify which type you have. They look identical on a dashboard but live in completely different parts of the pipeline.
| Lag Type | Definition | Location | Root Cause Area |
|---|---|---|---|
| Transport Lag | Redo generated on primary but not yet received by standby | Network pipe between sites | Network, primary I/O, redo volume |
| Apply Lag | Redo received by standby but not yet applied to datafiles | Standby MRP apply process | Standby I/O, CPU, misconfiguration |
High transport lag, low apply lag: Network or primary is the problem. The standby applies everything it receives, it just is not receiving fast enough.
Low transport lag, high apply lag: Standby is the problem. Redo arrives quickly but MRP cannot keep up.
Both high: Multiple problems, or a redo volume spike overwhelming the entire pipeline.
2. Five Diagnostic Queries to Run First
Run these immediately when lag is reported. They tell you which cause category you are dealing with before you go deeper.
3. Cause 1: Primary Storage I/O Pressure
Cause 1 LGWR cannot write redo to local online redo logs fast enough. In SYNC mode this directly delays the network send to the standby. In ASYNC mode it slows redo generation and eventually fills the NSA send buffer.
Fix Move redo log files to dedicated fast storage (NVMe or dedicated ASM disk group). Increase redo log size to reduce switch frequency. Schedule ASM rebalance outside peak hours.
4. Cause 2: Redo Volume Spike from Batch Jobs
Cause 2 A batch job, bulk load, or mass update pushes 10,50x the normal redo rate. The network pipe cannot keep up. Transport lag grows rapidly. This is the most common cause of sudden lag spikes in OLTP systems with nightly batch processing.
Fix Use NOLOGGING or APPEND hints for bulk data loads where tolerable. Schedule batch jobs during off-peak hours. Enable redo log compression on the archive destination (COMPRESSION=ENABLE in LOG_ARCHIVE_DEST).
5. Cause 3: Network Latency and Bandwidth
Cause 3 The most commonly blamed cause and the most commonly misdiagnosed. Network problems cause transport lag only. If apply lag is high and transport lag is low, stop looking at the network.
| Metric | Good Value | Problem Threshold | Impact |
|---|---|---|---|
| Round-trip latency (RTT) | < 2 ms | > 5 ms for SYNC | Every commit waits RTT in SYNC mode |
| Bandwidth utilisation | < 60% | > 80% sustained | Redo bursts cause queuing delay |
| Packet loss | 0% | > 0.01% | TCP retransmit multiplies latency |
| Jitter | < 1 ms | > 5 ms | Unpredictable LGWR stalls in SYNC mode |
A correctly configured Data Guard environment with 1.8 ms RTT had transport lag spike to 120+ seconds randomly, then clear, then spike again. Network team confirmed zero packet loss. Root cause: a next-gen firewall performing deep packet inspection on the redo transport TCP stream. Each 10-second idle gap caused the firewall to drop and re-establish the TCP state table entry, causing a 15,30 second reconnect delay on the NSS process.
Fix: Added a firewall bypass rule for the specific TCP port used by Data Guard redo transport. Lag spikes disappeared immediately.
Fix Add COMPRESSION=ENABLE to LOG_ARCHIVE_DEST_2 to reduce bandwidth by 50,70%. Implement QoS tagging for redo transport traffic. Increase NET_TIMEOUT if packet loss causes premature timeouts. Use a dedicated WAN link for redo transport, never share with application traffic.
6. Cause 4: Misconfiguration (The Silent Killer)
Cause 4 Configuration errors are silent lag causes. No errors in the alert log. No network problems. Reasonable redo rate. But lag slowly accumulates because something is set up wrong. These are the hardest to find.
A newly built Data Guard environment appeared healthy. Alert log clean. No errors. Archive destination showed SUCCESS. But during failover testing the standby was 8 hours behind the primary despite a low-latency WAN link.
Root cause: Standby Redo Logs had never been created. Without SRLs, real-time apply cannot function. MRP was waiting for archived logs to be shipped rather than receiving redo in real time, meaning the standby only caught up once per hour during log archival.
Fix: Created SRL groups matching primary online redo log size plus one extra group per thread. Apply lag dropped from 8 hours to under 60 seconds within minutes of SRL creation.
7. Cause 5: MRP Apply Bottleneck
Cause 5 Redo is arriving at the standby on time (low transport lag) but MRP cannot apply it fast enough (high apply lag). This is purely a standby-side problem. Do not touch the primary or network until you confirm this is the cause.
Fix Enable parallel apply with degree matching standby CPU count. Give MRP dedicated I/O separate from RMAN backup. If Active Data Guard is enabled, use I/O Resource Manager to prioritise MRP over read queries.
8. Cause 6: Standby Resource Starvation
Cause 6 The standby does not have enough CPU, memory, or I/O bandwidth to keep up with apply. Frequently caused by under-specifying the standby relative to the primary, or by competing workloads such as RMAN backup or ADG read queries running on the same standby host during peak apply windows.
Fix Standby should match primary hardware specification. Schedule RMAN backup on the standby during off-peak apply windows. If ADG read workload is heavy, use I/O Resource Manager to protect MRP I/O priority.
9. SYNC vs ASYNC: Which Mode Creates Which Lag Pattern
| Scenario | SYNC Mode Behaviour | ASYNC Mode Behaviour |
|---|---|---|
| Network spike (100 ms RTT) | Every commit stalls 100 ms, performance crisis on primary | Transport lag grows silently, primary unaffected |
| Standby reboots | Primary stalls (Max Protection) or falls back to async (Max Availability) | Transport lag grows until standby returns, no primary impact |
| Redo volume burst | Network saturates, commits queue behind LGWR | NSA buffer fills, lag grows but primary performance maintained |
| Steady state (good network) | Transport lag near zero, apply lag depends on MRP | Small transport lag always present (1,5 sec typical) |
| Packet loss | LGWR stalls on every retransmit, very visible on primary | Lag spikes during loss window, primary unaffected |
10. Lag Under Active Data Guard (ADG Read Workload)
Active Data Guard introduces an additional lag risk: read queries on the standby compete with MRP for I/O bandwidth. In a busy ADG reporting workload, MRP can be starved of the I/O it needs to apply redo.
11. Proactive Lag Monitoring
Build proactive monitoring that catches lag before it becomes an incident. Run this query every 5 minutes from your monitoring system.
12. FAQ
COMPRESSION=ENABLE) to reduce bandwidth by 50,70% at the cost of CPU on both sides. Upgrade network bandwidth (1 GbE to 10 GbE resolves most SYNC saturation issues). Increase redo log group count so LGWR can switch to a new group while NSS sends the previous one. Ensure SRLs are configured so MRP stays close to the arrival point.
No comments:
Post a Comment