Chetan Yadav: Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

6 Root Causes of Transport and Apply Lag, With Diagnostic SQL to Prove Each One

06 March 2026

Chetan Yadav, Senior Oracle & Cloud DBA

⏱️ 14 - 16 min read

⏱️ Estimated Reading Time: 14 - 16 minutes

Transport Lag, Apply Lag, SYNC vs ASYNC, Network RTT, Standby I/O, MRP Apply Bottleneck

Oracle Data Guard lag root cause map showing 6 production causes across Primary Network and Standby layers with diagnostic metric reference table

⚙️ Production Environment Referenced

Oracle Database: 19.18.0.0.0 Enterprise Edition • Primary: 2-Node RAC, 4.8 TB OLTP • Standby: Physical Standby (Active Data Guard)
Protection Mode: Maximum Availability (SYNC/AFFIRM) • Network: Dedicated 1 GbE WAN, 120 km, RTT 1.8 ms
Peak Load: 2,800 TPS, 180 MB/sec redo generation • Application: Core banking transaction processing

The monitoring alert fires at 11:43 PM: "Data Guard apply lag exceeds 900 seconds." Transport lag is 180 seconds. Apply lag is 900 seconds. The standby is 15 minutes behind the primary. If the primary fails right now, 15 minutes of financial transactions are at risk.

This scenario happens in production Data Guard environments more often than most teams admit. The problem looks the same from the outside every time, but the root cause is completely different each time. Transport lag and apply lag each have different causes, different diagnostic queries, and different fixes. Treating them as the same problem wastes hours of investigation.

This guide covers all six real production causes of Data Guard lag, the exact SQL to identify each one, and the specific fix for each. No guesswork. Precise diagnosis first, then precise resolution.

 Table of Contents

Transport Lag vs Apply Lag: The Critical Distinction
Five Diagnostic Queries to Run First
Cause 1: Primary Storage I/O Pressure
Cause 2: Redo Volume Spike from Batch Jobs
Cause 3: Network Latency and Bandwidth
Cause 4: Misconfiguration (The Silent Killer)
Cause 5: MRP Apply Bottleneck
Cause 6: Standby Resource Starvation
SYNC vs ASYNC: Which Mode Creates Which Lag Pattern
Lag Under Active Data Guard (ADG Read Workload)
Proactive Lag Monitoring
FAQ
Related Reading

1. Transport Lag vs Apply Lag: The Critical Distinction

Before diagnosing lag, identify which type you have. They look identical on a dashboard but live in completely different parts of the pipeline.

Lag Type	Definition	Location	Root Cause Area
Transport Lag	Redo generated on primary but not yet received by standby	Network pipe between sites	Network, primary I/O, redo volume
Apply Lag	Redo received by standby but not yet applied to datafiles	Standby MRP apply process	Standby I/O, CPU, misconfiguration

The Fastest Diagnosis Rule:

High transport lag, low apply lag: Network or primary is the problem. The standby applies everything it receives, it just is not receiving fast enough.

Low transport lag, high apply lag: Standby is the problem. Redo arrives quickly but MRP cannot keep up.

Both high: Multiple problems, or a redo volume spike overwhelming the entire pipeline.

2. Five Diagnostic Queries to Run First

Run these immediately when lag is reported. They tell you which cause category you are dealing with before you go deeper.

SQL, Five Immediate Diagnostic Queries for Data Guard Lag

-- 1. Transport lag vs apply lag (run on PRIMARY)
SELECT name,
       value,
       unit,
       time_computed
FROM   v$dataguard_stats
WHERE  name IN (
    'transport lag',
    'apply lag',
    'apply finish time',
    'estimated startup time'
)
ORDER  BY name;

-- 2. Current redo generation rate MB/sec (run on PRIMARY)
SELECT inst_id,
       ROUND(value / 1024 / 1024, 2)  AS redo_mb_total
FROM   gv$sysstat
WHERE  name = 'redo size'
ORDER  BY inst_id;

-- 3. Archive destination status and error (run on PRIMARY)
SELECT dest_id,
       status,
       target,
       destination,
       error,
       gap_status
FROM   v$archive_dest_status
WHERE  status != 'INACTIVE'
ORDER  BY dest_id;

-- 4. MRP apply process status (run on STANDBY)
SELECT process,
       status,
       thread#,
       sequence#,
       block#,
       blocks,
       delay_mins
FROM   v$managed_standby
ORDER  BY process;

-- 5. Archive gap detection (run on PRIMARY)
-- Any rows returned = serious problem, act immediately
SELECT thread#,
       low_sequence#,
       high_sequence#
FROM   v$archive_gap
ORDER  BY thread#;

3. Cause 1: Primary Storage I/O Pressure

Cause 1 LGWR cannot write redo to local online redo logs fast enough. In SYNC mode this directly delays the network send to the standby. In ASYNC mode it slows redo generation and eventually fills the NSA send buffer.

SQL, Diagnose Primary Storage I/O as Lag Root Cause

-- LGWR wait events (run on PRIMARY)
-- High log file parallel write = primary disk problem
SELECT event,
       total_waits,
       ROUND(average_wait, 3)  AS avg_wait_ms,
       wait_class
FROM   v$system_event
WHERE  event IN (
    'log file sync',
    'log file parallel write',
    'log buffer space'
)
ORDER  BY time_waited DESC;

-- Datafile write latency (DBWR pressure on LGWR)
SELECT d.name,
       ROUND(f.writetim / NULLIF(f.phywrts, 0), 2)  AS avg_write_ms,
       f.phywrts
FROM   v$filestat  f
JOIN   v$datafile  d ON f.file# = d.file#
WHERE  f.phywrts > 1000
ORDER  BY avg_write_ms DESC
FETCH FIRST 10 ROWS ONLY;

-- Check ASM rebalance (can throttle I/O on primary)
SELECT group_number,
       operation,
       state,
       power,
       est_minutes
FROM   v$asm_operation;

Fix Move redo log files to dedicated fast storage (NVMe or dedicated ASM disk group). Increase redo log size to reduce switch frequency. Schedule ASM rebalance outside peak hours.

4. Cause 2: Redo Volume Spike from Batch Jobs

Cause 2 A batch job, bulk load, or mass update pushes 10,50x the normal redo rate. The network pipe cannot keep up. Transport lag grows rapidly. This is the most common cause of sudden lag spikes in OLTP systems with nightly batch processing.

SQL, Identify Redo Volume Spike and Source

-- Archive log generation rate per hour (run on PRIMARY)
-- Sudden spike = redo volume surge
SELECT TO_CHAR(first_time, 'YYYY-MM-DD HH24')  AS hour,
       COUNT(*)                                  AS logs_generated,
       ROUND(SUM(blocks * block_size)/1024/1024) AS total_mb
FROM   v$archived_log
WHERE  first_time     > SYSDATE - 1
AND    standby_dest   = 'NO'
GROUP  BY TO_CHAR(first_time, 'YYYY-MM-DD HH24')
ORDER  BY hour DESC;

-- Top SQL generating most redo (run on PRIMARY)
SELECT sql_id,
       executions,
       ROUND(buffer_gets / 1024.0, 0)   AS buffer_gets_k,
       last_active_time,
       SUBSTR(sql_text, 1, 80)          AS sql_preview
FROM   v$sql
WHERE  executions > 0
ORDER  BY buffer_gets DESC
FETCH FIRST 20 ROWS ONLY;

Fix Use NOLOGGING or APPEND hints for bulk data loads where tolerable. Schedule batch jobs during off-peak hours. Enable redo log compression on the archive destination (COMPRESSION=ENABLE in LOG_ARCHIVE_DEST).

5. Cause 3: Network Latency and Bandwidth

Cause 3 The most commonly blamed cause and the most commonly misdiagnosed. Network problems cause transport lag only. If apply lag is high and transport lag is low, stop looking at the network.

Metric	Good Value	Problem Threshold	Impact
Round-trip latency (RTT)	< 2 ms	> 5 ms for SYNC	Every commit waits RTT in SYNC mode
Bandwidth utilisation	< 60%	> 80% sustained	Redo bursts cause queuing delay
Packet loss	0%	> 0.01%	TCP retransmit multiplies latency
Jitter	< 1 ms	> 5 ms	Unpredictable LGWR stalls in SYNC mode

SQL, Diagnose Network as Data Guard Lag Cause

-- Redo transport network write statistics (run on PRIMARY)
SELECT dest_id,
       srl_network_writes,
       ROUND(srl_network_write_time /
             NULLIF(srl_network_writes, 0), 3)  AS avg_network_write_ms
FROM   v$standby_log_stats
ORDER  BY dest_id;

-- NSS/NSA send process status (run on PRIMARY)
SELECT process,
       status,
       client_process,
       sequence#,
       delay_mins
FROM   v$managed_standby
WHERE  process LIKE 'NS%'
ORDER  BY process;

-- Archive destination error (run on PRIMARY)
SELECT dest_id,
       status,
       error,
       gap_status,
       db_unique_name
FROM   v$archive_dest_status
WHERE  target    = 'STANDBY'
AND    status   != 'INACTIVE';

Incident: Firewall Stateful Inspection Causing Transport Lag Spikes

A correctly configured Data Guard environment with 1.8 ms RTT had transport lag spike to 120+ seconds randomly, then clear, then spike again. Network team confirmed zero packet loss. Root cause: a next-gen firewall performing deep packet inspection on the redo transport TCP stream. Each 10-second idle gap caused the firewall to drop and re-establish the TCP state table entry, causing a 15,30 second reconnect delay on the NSS process.

Fix: Added a firewall bypass rule for the specific TCP port used by Data Guard redo transport. Lag spikes disappeared immediately.

Fix Add COMPRESSION=ENABLE to LOG_ARCHIVE_DEST_2 to reduce bandwidth by 50,70%. Implement QoS tagging for redo transport traffic. Increase NET_TIMEOUT if packet loss causes premature timeouts. Use a dedicated WAN link for redo transport, never share with application traffic.

6. Cause 4: Misconfiguration (The Silent Killer)

Cause 4 Configuration errors are silent lag causes. No errors in the alert log. No network problems. Reasonable redo rate. But lag slowly accumulates because something is set up wrong. These are the hardest to find.

SQL, Verify Data Guard Configuration for Common Misconfigurations

-- Check 1: Standby Redo Logs exist and are correct size
-- SRL must match online redo log size
-- Need at least (online_log_groups + 1) SRL per thread
SELECT group#,
       thread#,
       sequence#,
       bytes/1024/1024  AS size_mb,
       archived,
       status
FROM   v$standby_log
ORDER  BY thread#, group#;

-- Check 2: AFFIRM setting matches intended protection mode
SELECT dest_id,
       dest_name,
       affirm,
       net_timeout,
       compression,
       status
FROM   v$archive_dest
WHERE  target  = 'STANDBY'
AND    status  = 'VALID';

-- Check 3: Real-time apply is running (not archive-based)
-- Status must be WAIT_FOR_LOG not WAIT_FOR_GAP
SELECT process,
       status,
       sequence#
FROM   v$managed_standby
WHERE  process = 'MRP0';

-- Check 4: Protection mode matches protection level
-- If they differ, primary has silently fallen back
SELECT protection_mode,    -- configured mode
       protection_level    -- actual effective mode
FROM   v$database;

Incident: Missing Standby Redo Logs, 8-Hour Apply Gap

A newly built Data Guard environment appeared healthy. Alert log clean. No errors. Archive destination showed SUCCESS. But during failover testing the standby was 8 hours behind the primary despite a low-latency WAN link.

Root cause: Standby Redo Logs had never been created. Without SRLs, real-time apply cannot function. MRP was waiting for archived logs to be shipped rather than receiving redo in real time, meaning the standby only caught up once per hour during log archival.

Fix: Created SRL groups matching primary online redo log size plus one extra group per thread. Apply lag dropped from 8 hours to under 60 seconds within minutes of SRL creation.

7. Cause 5: MRP Apply Bottleneck

Cause 5 Redo is arriving at the standby on time (low transport lag) but MRP cannot apply it fast enough (high apply lag). This is purely a standby-side problem. Do not touch the primary or network until you confirm this is the cause.

SQL, Diagnose and Fix MRP Apply Bottleneck

-- Check MRP status and wait events (run on STANDBY)
SELECT event,
       total_waits,
       ROUND(average_wait, 3)  AS avg_wait_ms,
       wait_class
FROM   v$system_event
WHERE  wait_class NOT IN ('Idle', 'Network')
ORDER  BY time_waited DESC
FETCH FIRST 15 ROWS ONLY;

-- Check parallel apply worker count (run on STANDBY)
-- If only MRP0 shown, parallel apply is disabled
SELECT process,
       status,
       thread#
FROM   v$managed_standby
WHERE  process LIKE 'PR%'
OR     process LIKE 'MRP%'
ORDER  BY process;

-- Enable parallel apply (run on STANDBY)
-- Stop current apply first
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

-- Restart with parallel apply (match to standby CPU count)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
    PARALLEL 8
    USING CURRENT LOGFILE
    DISCONNECT FROM SESSION;

-- Verify parallel workers started
SELECT process,
       status
FROM   v$managed_standby
WHERE  process LIKE 'PR%'
OR     process LIKE 'MRP%'
ORDER  BY process;

Fix Enable parallel apply with degree matching standby CPU count. Give MRP dedicated I/O separate from RMAN backup. If Active Data Guard is enabled, use I/O Resource Manager to prioritise MRP over read queries.

8. Cause 6: Standby Resource Starvation

Cause 6 The standby does not have enough CPU, memory, or I/O bandwidth to keep up with apply. Frequently caused by under-specifying the standby relative to the primary, or by competing workloads such as RMAN backup or ADG read queries running on the same standby host during peak apply windows.

SQL, Diagnose Standby Resource Starvation

-- OS resource pressure (run on STANDBY)
SELECT stat_name,
       value
FROM   v$osstat
WHERE  stat_name IN (
    'NUM_CPUS',
    'LOAD',
    'BUSY_TIME',
    'IDLE_TIME'
);

-- RMAN backup running during peak apply (run on STANDBY)
SELECT input_type,
       status,
       TO_CHAR(start_time, 'YYYY-MM-DD HH24:MI')  AS start_time,
       input_bytes_display,
       output_bytes_display
FROM   v$rman_backup_job_details
WHERE  start_time > SYSDATE - 1
ORDER  BY start_time DESC;

-- Datafile write latency on standby
SELECT d.name,
       ROUND(f.writetim / NULLIF(f.phywrts, 0), 2)  AS avg_write_ms,
       f.phywrts
FROM   v$filestat  f
JOIN   v$datafile  d ON f.file# = d.file#
WHERE  (f.phyrds + f.phywrts) > 1000
ORDER  BY avg_write_ms DESC
FETCH FIRST 10 ROWS ONLY;

Fix Standby should match primary hardware specification. Schedule RMAN backup on the standby during off-peak apply windows. If ADG read workload is heavy, use I/O Resource Manager to protect MRP I/O priority.

9. SYNC vs ASYNC: Which Mode Creates Which Lag Pattern

Scenario	SYNC Mode Behaviour	ASYNC Mode Behaviour
Network spike (100 ms RTT)	Every commit stalls 100 ms, performance crisis on primary	Transport lag grows silently, primary unaffected
Standby reboots	Primary stalls (Max Protection) or falls back to async (Max Availability)	Transport lag grows until standby returns, no primary impact
Redo volume burst	Network saturates, commits queue behind LGWR	NSA buffer fills, lag grows but primary performance maintained
Steady state (good network)	Transport lag near zero, apply lag depends on MRP	Small transport lag always present (1,5 sec typical)
Packet loss	LGWR stalls on every retransmit, very visible on primary	Lag spikes during loss window, primary unaffected

10. Lag Under Active Data Guard (ADG Read Workload)

Active Data Guard introduces an additional lag risk: read queries on the standby compete with MRP for I/O bandwidth. In a busy ADG reporting workload, MRP can be starved of the I/O it needs to apply redo.

SQL, Monitor ADG Read vs MRP Apply I/O Competition

-- Check open mode on standby (ADG = READ ONLY WITH APPLY)
SELECT open_mode,
       database_role,
       db_unique_name
FROM   v$database;

-- Active read sessions vs MRP on standby
SELECT s.type,
       s.status,
       COUNT(*)  AS session_count
FROM   v$session s
WHERE  s.type = 'USER'
OR     s.program LIKE '%MRP%'
GROUP  BY s.type, s.status;

-- Apply lag while ADG read load is active
SELECT name,
       value,
       time_computed
FROM   v$dataguard_stats
WHERE  name IN ('transport lag', 'apply lag')
ORDER  BY name;

11. Proactive Lag Monitoring

Build proactive monitoring that catches lag before it becomes an incident. Run this query every 5 minutes from your monitoring system.

SQL, Data Guard Lag Monitoring Alert Query

-- Lag monitoring with alert thresholds (run on PRIMARY every 5 min)
SELECT db_unique_name,
       name                                         AS metric,
       value                                        AS lag_value,
       CASE
           WHEN name = 'transport lag'
                AND TO_DSINTERVAL(value) > INTERVAL '0 00:00:30' HOUR TO SECOND
           THEN 'ALERT: Transport lag exceeds 30 seconds'
           WHEN name = 'apply lag'
                AND TO_DSINTERVAL(value) > INTERVAL '0 00:02:00' HOUR TO SECOND
           THEN 'ALERT: Apply lag exceeds 2 minutes'
           ELSE 'OK'
       END                                          AS alert_status,
       time_computed
FROM   v$dataguard_stats
WHERE  name IN ('transport lag', 'apply lag')
ORDER  BY name;

-- Sequence gap detection (run on PRIMARY)
-- Any rows = immediate action required
SELECT thread#,
       low_sequence#,
       high_sequence#
FROM   v$archive_gap
ORDER  BY thread#;

-- DGMGRL health check (run from broker CLI)
-- dgmgrl / "show configuration"
-- dgmgrl / "validate database verbose standby_db"

12. FAQ

Transport lag shows zero but apply lag is 2 hours. What is happening?

Transport lag of zero means redo is arriving at the standby immediately. The network and primary are fine. Apply lag of 2 hours means MRP is 2 hours behind applying what it has already received. This is a pure standby-side problem. Check MRP parallel degree, standby I/O performance, RMAN backup scheduling, and whether ADG read workload is starving MRP. Do not look at the network for this pattern.

Should I increase redo log size to reduce transport lag?

Increasing redo log size reduces log switch frequency, which smooths out the burst pattern of redo transport. However it does not reduce total redo volume, so if the problem is sustained bandwidth saturation, larger logs will not help. The sweet spot for most production systems is 500 MB to 2 GB per redo log group, switching every 10,30 minutes under normal load.

Can I reduce lag without switching from SYNC to ASYNC?

Yes. Enable redo log compression (COMPRESSION=ENABLE) to reduce bandwidth by 50,70% at the cost of CPU on both sides. Upgrade network bandwidth (1 GbE to 10 GbE resolves most SYNC saturation issues). Increase redo log group count so LGWR can switch to a new group while NSS sends the previous one. Ensure SRLs are configured so MRP stays close to the arrival point.

Comment DG below if you want a specific lag scenario diagnosed.

Drop a comment with DG and describe your lag pattern: what are your transport lag and apply lag values, what is your protection mode, and what workload triggers the spike. I will give you a specific diagnostic path based on your exact scenario.

About the Author

Chetan Yadav

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL, and Cloud DBA with 15+ years of hands-on experience managing production databases across on-premises, hybrid, and cloud environments.

Chetan has diagnosed and resolved Data Guard lag incidents across financial, healthcare, and e-commerce production environments including the 8-hour apply gap from missing Standby Redo Logs and the firewall stateful inspection transport lag described in this article.

This blog focuses on real-world DBA problems, career growth, and practical learning, not theoretical documentation or vendor marketing.

Chetan Yadav

Pages

Monday, April 6, 2026

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

1. Transport Lag vs Apply Lag: The Critical Distinction

2. Five Diagnostic Queries to Run First

3. Cause 1: Primary Storage I/O Pressure

4. Cause 2: Redo Volume Spike from Batch Jobs

5. Cause 3: Network Latency and Bandwidth

6. Cause 4: Misconfiguration (The Silent Killer)

7. Cause 5: MRP Apply Bottleneck

8. Cause 6: Standby Resource Starvation

9. SYNC vs ASYNC: Which Mode Creates Which Lag Pattern

10. Lag Under Active Data Guard (ADG Read Workload)

11. Proactive Lag Monitoring

12. FAQ

About the Author

No comments:

Post a Comment

Pages

Monday, April 6, 2026

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

Why Data Guard Lag Happens in Production: Sync, I/O and Network Deep Dive

1. Transport Lag vs Apply Lag: The Critical Distinction

2. Five Diagnostic Queries to Run First

3. Cause 1: Primary Storage I/O Pressure

4. Cause 2: Redo Volume Spike from Batch Jobs

5. Cause 3: Network Latency and Bandwidth

6. Cause 4: Misconfiguration (The Silent Killer)

7. Cause 5: MRP Apply Bottleneck

8. Cause 6: Standby Resource Starvation

9. SYNC vs ASYNC: Which Mode Creates Which Lag Pattern

10. Lag Under Active Data Guard (ADG Read Workload)

11. Proactive Lag Monitoring

12. FAQ

13. Related Reading from Real Production Systems

About the Author

No comments:

Post a Comment