Monday, April 6, 2026

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

6 Root Causes of Transport and Apply Lag, With Diagnostic SQL to Prove Each One
06 March 2026
Chetan Yadav, Senior Oracle & Cloud DBA
⏱️ 14 - 16 min read
⏱️ Estimated Reading Time: 14 - 16 minutes
Transport Lag, Apply Lag, SYNC vs ASYNC, Network RTT, Standby I/O, MRP Apply Bottleneck
Oracle Data Guard lag root cause map showing 6 production causes across Primary Network and Standby layers with diagnostic metric reference table
⚙️ Production Environment Referenced

Oracle Database: 19.18.0.0.0 Enterprise Edition  •  Primary: 2-Node RAC, 4.8 TB OLTP  •  Standby: Physical Standby (Active Data Guard)
Protection Mode: Maximum Availability (SYNC/AFFIRM)  •  Network: Dedicated 1 GbE WAN, 120 km, RTT 1.8 ms
Peak Load: 2,800 TPS, 180 MB/sec redo generation  •  Application: Core banking transaction processing

The monitoring alert fires at 11:43 PM: "Data Guard apply lag exceeds 900 seconds." Transport lag is 180 seconds. Apply lag is 900 seconds. The standby is 15 minutes behind the primary. If the primary fails right now, 15 minutes of financial transactions are at risk.

This scenario happens in production Data Guard environments more often than most teams admit. The problem looks the same from the outside every time, but the root cause is completely different each time. Transport lag and apply lag each have different causes, different diagnostic queries, and different fixes. Treating them as the same problem wastes hours of investigation.

This guide covers all six real production causes of Data Guard lag, the exact SQL to identify each one, and the specific fix for each. No guesswork. Precise diagnosis first, then precise resolution.

1. Transport Lag vs Apply Lag: The Critical Distinction

Before diagnosing lag, identify which type you have. They look identical on a dashboard but live in completely different parts of the pipeline.

Lag TypeDefinitionLocationRoot Cause Area
Transport Lag Redo generated on primary but not yet received by standby Network pipe between sites Network, primary I/O, redo volume
Apply Lag Redo received by standby but not yet applied to datafiles Standby MRP apply process Standby I/O, CPU, misconfiguration
The Fastest Diagnosis Rule:

High transport lag, low apply lag: Network or primary is the problem. The standby applies everything it receives, it just is not receiving fast enough.

Low transport lag, high apply lag: Standby is the problem. Redo arrives quickly but MRP cannot keep up.

Both high: Multiple problems, or a redo volume spike overwhelming the entire pipeline.

2. Five Diagnostic Queries to Run First

Run these immediately when lag is reported. They tell you which cause category you are dealing with before you go deeper.

SQL, Five Immediate Diagnostic Queries for Data Guard Lag
-- 1. Transport lag vs apply lag (run on PRIMARY) SELECT name, value, unit, time_computed FROM v$dataguard_stats WHERE name IN ( 'transport lag', 'apply lag', 'apply finish time', 'estimated startup time' ) ORDER BY name; -- 2. Current redo generation rate MB/sec (run on PRIMARY) SELECT inst_id, ROUND(value / 1024 / 1024, 2) AS redo_mb_total FROM gv$sysstat WHERE name = 'redo size' ORDER BY inst_id; -- 3. Archive destination status and error (run on PRIMARY) SELECT dest_id, status, target, destination, error, gap_status FROM v$archive_dest_status WHERE status != 'INACTIVE' ORDER BY dest_id; -- 4. MRP apply process status (run on STANDBY) SELECT process, status, thread#, sequence#, block#, blocks, delay_mins FROM v$managed_standby ORDER BY process; -- 5. Archive gap detection (run on PRIMARY) -- Any rows returned = serious problem, act immediately SELECT thread#, low_sequence#, high_sequence# FROM v$archive_gap ORDER BY thread#;

3. Cause 1: Primary Storage I/O Pressure

Cause 1 LGWR cannot write redo to local online redo logs fast enough. In SYNC mode this directly delays the network send to the standby. In ASYNC mode it slows redo generation and eventually fills the NSA send buffer.

SQL, Diagnose Primary Storage I/O as Lag Root Cause
-- LGWR wait events (run on PRIMARY) -- High log file parallel write = primary disk problem SELECT event, total_waits, ROUND(average_wait, 3) AS avg_wait_ms, wait_class FROM v$system_event WHERE event IN ( 'log file sync', 'log file parallel write', 'log buffer space' ) ORDER BY time_waited DESC; -- Datafile write latency (DBWR pressure on LGWR) SELECT d.name, ROUND(f.writetim / NULLIF(f.phywrts, 0), 2) AS avg_write_ms, f.phywrts FROM v$filestat f JOIN v$datafile d ON f.file# = d.file# WHERE f.phywrts > 1000 ORDER BY avg_write_ms DESC FETCH FIRST 10 ROWS ONLY; -- Check ASM rebalance (can throttle I/O on primary) SELECT group_number, operation, state, power, est_minutes FROM v$asm_operation;

Fix Move redo log files to dedicated fast storage (NVMe or dedicated ASM disk group). Increase redo log size to reduce switch frequency. Schedule ASM rebalance outside peak hours.

4. Cause 2: Redo Volume Spike from Batch Jobs

Cause 2 A batch job, bulk load, or mass update pushes 10,50x the normal redo rate. The network pipe cannot keep up. Transport lag grows rapidly. This is the most common cause of sudden lag spikes in OLTP systems with nightly batch processing.

SQL, Identify Redo Volume Spike and Source
-- Archive log generation rate per hour (run on PRIMARY) -- Sudden spike = redo volume surge SELECT TO_CHAR(first_time, 'YYYY-MM-DD HH24') AS hour, COUNT(*) AS logs_generated, ROUND(SUM(blocks * block_size)/1024/1024) AS total_mb FROM v$archived_log WHERE first_time > SYSDATE - 1 AND standby_dest = 'NO' GROUP BY TO_CHAR(first_time, 'YYYY-MM-DD HH24') ORDER BY hour DESC; -- Top SQL generating most redo (run on PRIMARY) SELECT sql_id, executions, ROUND(buffer_gets / 1024.0, 0) AS buffer_gets_k, last_active_time, SUBSTR(sql_text, 1, 80) AS sql_preview FROM v$sql WHERE executions > 0 ORDER BY buffer_gets DESC FETCH FIRST 20 ROWS ONLY;

Fix Use NOLOGGING or APPEND hints for bulk data loads where tolerable. Schedule batch jobs during off-peak hours. Enable redo log compression on the archive destination (COMPRESSION=ENABLE in LOG_ARCHIVE_DEST).

5. Cause 3: Network Latency and Bandwidth

Cause 3 The most commonly blamed cause and the most commonly misdiagnosed. Network problems cause transport lag only. If apply lag is high and transport lag is low, stop looking at the network.

MetricGood ValueProblem ThresholdImpact
Round-trip latency (RTT)< 2 ms> 5 ms for SYNCEvery commit waits RTT in SYNC mode
Bandwidth utilisation< 60%> 80% sustainedRedo bursts cause queuing delay
Packet loss0%> 0.01%TCP retransmit multiplies latency
Jitter< 1 ms> 5 msUnpredictable LGWR stalls in SYNC mode
SQL, Diagnose Network as Data Guard Lag Cause
-- Redo transport network write statistics (run on PRIMARY) SELECT dest_id, srl_network_writes, ROUND(srl_network_write_time / NULLIF(srl_network_writes, 0), 3) AS avg_network_write_ms FROM v$standby_log_stats ORDER BY dest_id; -- NSS/NSA send process status (run on PRIMARY) SELECT process, status, client_process, sequence#, delay_mins FROM v$managed_standby WHERE process LIKE 'NS%' ORDER BY process; -- Archive destination error (run on PRIMARY) SELECT dest_id, status, error, gap_status, db_unique_name FROM v$archive_dest_status WHERE target = 'STANDBY' AND status != 'INACTIVE';
Incident: Firewall Stateful Inspection Causing Transport Lag Spikes

A correctly configured Data Guard environment with 1.8 ms RTT had transport lag spike to 120+ seconds randomly, then clear, then spike again. Network team confirmed zero packet loss. Root cause: a next-gen firewall performing deep packet inspection on the redo transport TCP stream. Each 10-second idle gap caused the firewall to drop and re-establish the TCP state table entry, causing a 15,30 second reconnect delay on the NSS process.

Fix: Added a firewall bypass rule for the specific TCP port used by Data Guard redo transport. Lag spikes disappeared immediately.

Fix Add COMPRESSION=ENABLE to LOG_ARCHIVE_DEST_2 to reduce bandwidth by 50,70%. Implement QoS tagging for redo transport traffic. Increase NET_TIMEOUT if packet loss causes premature timeouts. Use a dedicated WAN link for redo transport, never share with application traffic.

6. Cause 4: Misconfiguration (The Silent Killer)

Cause 4 Configuration errors are silent lag causes. No errors in the alert log. No network problems. Reasonable redo rate. But lag slowly accumulates because something is set up wrong. These are the hardest to find.

SQL, Verify Data Guard Configuration for Common Misconfigurations
-- Check 1: Standby Redo Logs exist and are correct size -- SRL must match online redo log size -- Need at least (online_log_groups + 1) SRL per thread SELECT group#, thread#, sequence#, bytes/1024/1024 AS size_mb, archived, status FROM v$standby_log ORDER BY thread#, group#; -- Check 2: AFFIRM setting matches intended protection mode SELECT dest_id, dest_name, affirm, net_timeout, compression, status FROM v$archive_dest WHERE target = 'STANDBY' AND status = 'VALID'; -- Check 3: Real-time apply is running (not archive-based) -- Status must be WAIT_FOR_LOG not WAIT_FOR_GAP SELECT process, status, sequence# FROM v$managed_standby WHERE process = 'MRP0'; -- Check 4: Protection mode matches protection level -- If they differ, primary has silently fallen back SELECT protection_mode, -- configured mode protection_level -- actual effective mode FROM v$database;
Incident: Missing Standby Redo Logs, 8-Hour Apply Gap

A newly built Data Guard environment appeared healthy. Alert log clean. No errors. Archive destination showed SUCCESS. But during failover testing the standby was 8 hours behind the primary despite a low-latency WAN link.

Root cause: Standby Redo Logs had never been created. Without SRLs, real-time apply cannot function. MRP was waiting for archived logs to be shipped rather than receiving redo in real time, meaning the standby only caught up once per hour during log archival.

Fix: Created SRL groups matching primary online redo log size plus one extra group per thread. Apply lag dropped from 8 hours to under 60 seconds within minutes of SRL creation.

7. Cause 5: MRP Apply Bottleneck

Cause 5 Redo is arriving at the standby on time (low transport lag) but MRP cannot apply it fast enough (high apply lag). This is purely a standby-side problem. Do not touch the primary or network until you confirm this is the cause.

SQL, Diagnose and Fix MRP Apply Bottleneck
-- Check MRP status and wait events (run on STANDBY) SELECT event, total_waits, ROUND(average_wait, 3) AS avg_wait_ms, wait_class FROM v$system_event WHERE wait_class NOT IN ('Idle', 'Network') ORDER BY time_waited DESC FETCH FIRST 15 ROWS ONLY; -- Check parallel apply worker count (run on STANDBY) -- If only MRP0 shown, parallel apply is disabled SELECT process, status, thread# FROM v$managed_standby WHERE process LIKE 'PR%' OR process LIKE 'MRP%' ORDER BY process; -- Enable parallel apply (run on STANDBY) -- Stop current apply first ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL; -- Restart with parallel apply (match to standby CPU count) ALTER DATABASE RECOVER MANAGED STANDBY DATABASE PARALLEL 8 USING CURRENT LOGFILE DISCONNECT FROM SESSION; -- Verify parallel workers started SELECT process, status FROM v$managed_standby WHERE process LIKE 'PR%' OR process LIKE 'MRP%' ORDER BY process;

Fix Enable parallel apply with degree matching standby CPU count. Give MRP dedicated I/O separate from RMAN backup. If Active Data Guard is enabled, use I/O Resource Manager to prioritise MRP over read queries.

8. Cause 6: Standby Resource Starvation

Cause 6 The standby does not have enough CPU, memory, or I/O bandwidth to keep up with apply. Frequently caused by under-specifying the standby relative to the primary, or by competing workloads such as RMAN backup or ADG read queries running on the same standby host during peak apply windows.

SQL, Diagnose Standby Resource Starvation
-- OS resource pressure (run on STANDBY) SELECT stat_name, value FROM v$osstat WHERE stat_name IN ( 'NUM_CPUS', 'LOAD', 'BUSY_TIME', 'IDLE_TIME' ); -- RMAN backup running during peak apply (run on STANDBY) SELECT input_type, status, TO_CHAR(start_time, 'YYYY-MM-DD HH24:MI') AS start_time, input_bytes_display, output_bytes_display FROM v$rman_backup_job_details WHERE start_time > SYSDATE - 1 ORDER BY start_time DESC; -- Datafile write latency on standby SELECT d.name, ROUND(f.writetim / NULLIF(f.phywrts, 0), 2) AS avg_write_ms, f.phywrts FROM v$filestat f JOIN v$datafile d ON f.file# = d.file# WHERE (f.phyrds + f.phywrts) > 1000 ORDER BY avg_write_ms DESC FETCH FIRST 10 ROWS ONLY;

Fix Standby should match primary hardware specification. Schedule RMAN backup on the standby during off-peak apply windows. If ADG read workload is heavy, use I/O Resource Manager to protect MRP I/O priority.

9. SYNC vs ASYNC: Which Mode Creates Which Lag Pattern

ScenarioSYNC Mode BehaviourASYNC Mode Behaviour
Network spike (100 ms RTT)Every commit stalls 100 ms, performance crisis on primaryTransport lag grows silently, primary unaffected
Standby rebootsPrimary stalls (Max Protection) or falls back to async (Max Availability)Transport lag grows until standby returns, no primary impact
Redo volume burstNetwork saturates, commits queue behind LGWRNSA buffer fills, lag grows but primary performance maintained
Steady state (good network)Transport lag near zero, apply lag depends on MRPSmall transport lag always present (1,5 sec typical)
Packet lossLGWR stalls on every retransmit, very visible on primaryLag spikes during loss window, primary unaffected

10. Lag Under Active Data Guard (ADG Read Workload)

Active Data Guard introduces an additional lag risk: read queries on the standby compete with MRP for I/O bandwidth. In a busy ADG reporting workload, MRP can be starved of the I/O it needs to apply redo.

SQL, Monitor ADG Read vs MRP Apply I/O Competition
-- Check open mode on standby (ADG = READ ONLY WITH APPLY) SELECT open_mode, database_role, db_unique_name FROM v$database; -- Active read sessions vs MRP on standby SELECT s.type, s.status, COUNT(*) AS session_count FROM v$session s WHERE s.type = 'USER' OR s.program LIKE '%MRP%' GROUP BY s.type, s.status; -- Apply lag while ADG read load is active SELECT name, value, time_computed FROM v$dataguard_stats WHERE name IN ('transport lag', 'apply lag') ORDER BY name;

11. Proactive Lag Monitoring

Build proactive monitoring that catches lag before it becomes an incident. Run this query every 5 minutes from your monitoring system.

SQL, Data Guard Lag Monitoring Alert Query
-- Lag monitoring with alert thresholds (run on PRIMARY every 5 min) SELECT db_unique_name, name AS metric, value AS lag_value, CASE WHEN name = 'transport lag' AND TO_DSINTERVAL(value) > INTERVAL '0 00:00:30' HOUR TO SECOND THEN 'ALERT: Transport lag exceeds 30 seconds' WHEN name = 'apply lag' AND TO_DSINTERVAL(value) > INTERVAL '0 00:02:00' HOUR TO SECOND THEN 'ALERT: Apply lag exceeds 2 minutes' ELSE 'OK' END AS alert_status, time_computed FROM v$dataguard_stats WHERE name IN ('transport lag', 'apply lag') ORDER BY name; -- Sequence gap detection (run on PRIMARY) -- Any rows = immediate action required SELECT thread#, low_sequence#, high_sequence# FROM v$archive_gap ORDER BY thread#; -- DGMGRL health check (run from broker CLI) -- dgmgrl / "show configuration" -- dgmgrl / "validate database verbose standby_db"

12. FAQ

Transport lag shows zero but apply lag is 2 hours. What is happening?
Transport lag of zero means redo is arriving at the standby immediately. The network and primary are fine. Apply lag of 2 hours means MRP is 2 hours behind applying what it has already received. This is a pure standby-side problem. Check MRP parallel degree, standby I/O performance, RMAN backup scheduling, and whether ADG read workload is starving MRP. Do not look at the network for this pattern.
Should I increase redo log size to reduce transport lag?
Increasing redo log size reduces log switch frequency, which smooths out the burst pattern of redo transport. However it does not reduce total redo volume, so if the problem is sustained bandwidth saturation, larger logs will not help. The sweet spot for most production systems is 500 MB to 2 GB per redo log group, switching every 10,30 minutes under normal load.
Can I reduce lag without switching from SYNC to ASYNC?
Yes. Enable redo log compression (COMPRESSION=ENABLE) to reduce bandwidth by 50,70% at the cost of CPU on both sides. Upgrade network bandwidth (1 GbE to 10 GbE resolves most SYNC saturation issues). Increase redo log group count so LGWR can switch to a new group while NSS sends the previous one. Ensure SRLs are configured so MRP stays close to the arrival point.
Comment DG below if you want a specific lag scenario diagnosed.
Drop a comment with DG and describe your lag pattern: what are your transport lag and apply lag values, what is your protection mode, and what workload triggers the spike. I will give you a specific diagnostic path based on your exact scenario.

About the Author

Chetan Yadav

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL, and Cloud DBA with 15+ years of hands-on experience managing production databases across on-premises, hybrid, and cloud environments.

Chetan has diagnosed and resolved Data Guard lag incidents across financial, healthcare, and e-commerce production environments including the 8-hour apply gap from missing Standby Redo Logs and the firewall stateful inspection transport lag described in this article.

This blog focuses on real-world DBA problems, career growth, and practical learning, not theoretical documentation or vendor marketing.

No comments:

Post a Comment