⏱️ Estimated Reading Time: 14–16 minutes
Transport Lag • Apply Lag • SYNC vs ASYNC • Network RTT • Standby I/O • MRP Apply Bottleneck
⚙️ Production Environment Referenced in This Article
Oracle Database: 19.18.0.0.0 Enterprise Edition •
Primary: 2-Node RAC, 4.8 TB OLTP •
Standby: Physical Standby (Active Data Guard)
Protection Mode: Maximum Availability (SYNC/AFFIRM) •
Network: Dedicated 1 GbE WAN (120 km distance, RTT 1.8 ms)
Peak Load: 2,800 TPS, 180 MB/sec redo generation •
Application: Core banking transaction processing
The alert arrives at 11:43 PM: "Data Guard apply lag exceeds 900 seconds." The DBA on call opens the monitoring dashboard. Transport lag is 180 seconds. Apply lag is 900 seconds. The standby is 15 minutes behind the primary. If the primary fails right now, 15 minutes of financial transactions could be at risk.
This scenario plays out in production Data Guard environments more often than most teams admit. Lag is not a single problem , it is six different problems that look identical from the outside. Transport lag and apply lag each have completely different root causes, different diagnostic queries, and completely different fixes. Treating them the same wastes hours of investigation time.
This guide covers every real cause of Data Guard lag I have diagnosed in production, the exact SQL to prove which one you are dealing with, and the specific fix for each. No guesswork. No generic advice about "check your network." Precise diagnosis first, then precise resolution.
1. Transport Lag vs Apply Lag: Understanding the Difference
Before diagnosing lag, you must understand which type you have. They look the same on a monitoring dashboard but live in completely different parts of the Data Guard pipeline.
| Lag Type | Definition | Lives In | Primary Cause |
| Transport Lag |
Redo generated on primary but not yet received by standby |
Network pipe between primary and standby |
Network, primary I/O, redo volume |
| Apply Lag |
Redo received by standby but not yet applied to datafiles |
Standby MRP apply process |
Standby I/O, CPU, misconfiguration |
Critical Distinction:
If transport lag is high but apply lag is low , your network or primary is the problem. The standby is applying redo as fast as it receives it; it just isn't receiving it fast enough.
If transport lag is low but apply lag is high , your standby is the problem. Redo is arriving quickly but MRP cannot apply it fast enough.
If both are high , you have multiple problems or a severe redo volume spike that is overwhelming the entire pipeline.
2. The Diagnostic Starting Point: Five Queries to Run First
Run these five queries immediately when lag is reported. They tell you which cause category you are dealing with before you go deeper.
-- Query 1: Transport lag vs apply lag (run on PRIMARY)
SELECT name,
value,
unit,
time_computed
FROM v$dataguard_stats
WHERE name IN (
'transport lag',
'apply lag',
'apply finish time',
'estimated startup time'
)
ORDER BY name;
-- Query 2: Current redo generation rate (run on PRIMARY)
-- High rate = redo volume problem
SELECT inst_id,
ROUND(value / 1024 / 1024, 2) AS redo_mb_per_sec
FROM gv$sysstat
WHERE name = 'redo size'
AND inst_id IN (SELECT inst_id FROM gv$instance);
-- Query 3: Archive dest status and error (run on PRIMARY)
SELECT dest_id,
dest_name,
status,
target,
archiver,
schedule,
destination,
error,
applied_scn,
gap_status
FROM v$archive_dest_status
WHERE status != 'INACTIVE'
ORDER BY dest_id;
-- Query 4: MRP apply process status (run on STANDBY)
SELECT process,
status,
thread#,
sequence#,
block#,
blocks,
delay_mins
FROM v$managed_standby
ORDER BY process;
-- Query 5: Standby redo log receive vs apply position (run on STANDBY)
SELECT thread#,
low_sequence#,
high_sequence#,
applied
FROM v$archived_log
WHERE standby_dest = 'YES'
ORDER BY thread#, high_sequence# DESC
FETCH FIRST 10 ROWS ONLY;
3. Cause 1: Primary Storage I/O Pressure
Cause 1 LGWR cannot write redo to local online redo logs fast enough. Every redo write is now slower than normal. In SYNC mode this directly delays the network send to the standby. In ASYNC mode it slows redo generation which eventually causes the NSA buffer to fill up.
How to Confirm It
-- Check LGWR wait events (run on PRIMARY)
-- High log file sync or log file parallel write = LGWR I/O problem
SELECT event,
total_waits,
time_waited,
ROUND(average_wait, 3) AS avg_wait_ms,
wait_class
FROM v$system_event
WHERE event IN (
'log file sync',
'log file parallel write',
'log buffer space'
)
ORDER BY time_waited DESC;
-- Check datafile I/O latency (high write latency = DBWR pressure on LGWR)
SELECT name,
phyrds,
phywrts,
ROUND(readtim / NULLIF(phyrds, 0), 2) AS avg_read_ms,
ROUND(writetim / NULLIF(phywrts, 0), 2) AS avg_write_ms
FROM v$filestat f
JOIN v$datafile d ON f.file# = d.file#
WHERE phywrts > 1000
ORDER BY avg_write_ms DESC
FETCH FIRST 10 ROWS ONLY;
-- Check ASM rebalance activity (can throttle I/O)
SELECT group_number,
operation,
state,
power,
actual,
sofar,
est_minutes
FROM v$asm_operation;
Fix Separate redo log files onto dedicated fast storage (NVMe or dedicated ASM disk group). Increase redo log size to reduce switch frequency. Schedule ASM rebalance outside peak hours. If on Exadata, verify Smart Flash Cache is handling redo log writes.
4. Cause 2: Redo Volume Spike
Cause 2 The primary is generating far more redo than normal. A batch job, bulk load, or mass update suddenly pushes 10–50x the normal redo rate. The network pipe cannot keep up. Transport lag grows rapidly. This is the most common cause of sudden lag spikes in production OLTP systems with nightly batch processing.
-- Check redo generation rate per instance (run on PRIMARY)
SELECT inst_id,
ROUND(value / 1024 / 1024, 2) AS redo_mb_total
FROM gv$sysstat
WHERE name = 'redo size'
ORDER BY inst_id;
-- Find top SQL generating most redo (run on PRIMARY)
SELECT sql_id,
executions,
ROUND(sorts * 1024 / 1024, 2) AS sorts_mb,
ROUND(buffer_gets / 1024.0, 0) AS buffer_gets_k,
ROUND(direct_writes / 1024.0, 0) AS direct_writes_k,
last_active_time
FROM v$sql
WHERE executions > 0
ORDER BY sorts DESC
FETCH FIRST 20 ROWS ONLY;
-- Check archive log generation rate (run on PRIMARY)
-- Sudden spike in log count = redo volume surge
SELECT TO_CHAR(first_time, 'YYYY-MM-DD HH24') AS hour,
COUNT(*) AS logs_generated,
ROUND(SUM(blocks * block_size)/1024/1024) AS total_mb
FROM v$archived_log
WHERE first_time > SYSDATE - 1
AND standby_dest = 'NO'
GROUP BY TO_CHAR(first_time, 'YYYY-MM-DD HH24')
ORDER BY hour DESC;
Fix Use NOLOGGING or APPEND hints for bulk data loads where tolerable. Schedule heavy batch jobs during off-peak hours. Implement redo log compression on the archive destination (COMPRESSION=ENABLE in LOG_ARCHIVE_DEST). For ASYNC mode consider increasing async_blocks parameter.
5. Cause 3: Network Latency and Bandwidth
Cause 3 This is the most commonly blamed cause , and the most commonly misdiagnosed. Network problems cause transport lag. They do not cause apply lag. If apply lag is high and transport lag is low, stop looking at the network.
Network Metrics That Matter for Data Guard
| Metric | Good Value | Problem Threshold | Impact |
| Round-trip latency (RTT) | < 2 ms | > 5 ms for SYNC | Every commit waits RTT in SYNC mode |
| Bandwidth utilisation | < 60% | > 80% sustained | Redo bursts cause queuing delay |
| Packet loss | 0% | > 0.01% | TCP retransmit multiplies latency |
| Jitter (latency variance) | < 1 ms | > 5 ms | Unpredictable LGWR stalls in SYNC |
-- Check redo transport network statistics (run on PRIMARY)
SELECT dest_id,
srl_network_writes,
srl_network_write_time,
ROUND(srl_network_write_time /
NULLIF(srl_network_writes, 0), 3) AS avg_network_write_ms
FROM v$standby_log_stats
ORDER BY dest_id;
-- Check NSA (async) or NSS (sync) send process (run on PRIMARY)
SELECT process,
status,
client_process,
sequence#,
block#,
blocks,
delay_mins
FROM v$managed_standby
WHERE process LIKE 'NS%'
ORDER BY process;
-- Check archive destination error status
SELECT dest_id,
status,
error,
gap_status,
db_unique_name
FROM v$archive_dest_status
WHERE target = 'STANDBY'
AND status != 'INACTIVE';
-- OS-level network check (run from primary server shell)
-- ping -c 100 standby_host | tail -5
-- traceroute -n standby_host
-- iperf3 -c standby_host -t 30 -P 4 (bandwidth test)
Fix Enable redo log compression: add COMPRESSION=ENABLE to LOG_ARCHIVE_DEST_2. Implement QoS tagging on redo transport traffic on your WAN appliance. Increase NET_TIMEOUT if packet loss causes premature timeouts. Switch from SYNC to ASYNC if RTT exceeds 5 ms and strict zero data loss is not mandatory. Use dedicated WAN link for redo transport , never share with application traffic.
Incident: Firewall Stateful Inspection Causing Transport Lag
A Data Guard environment was configured correctly with 1.8 ms RTT on the dedicated WAN link. Transport lag would spike to 120+ seconds randomly, then clear, then spike again. Network team confirmed zero packet loss. Root cause: a next-gen firewall performing deep packet inspection on the redo transport TCP stream. Each 10-second idle gap caused the firewall to drop and re-establish the TCP state table entry, causing a 15–30 second reconnect delay on the NSS process.
Fix: Added a firewall bypass rule for the specific TCP port used by Data Guard redo transport (LOG_ARCHIVE_DEST SERVICE port). Lag spikes disappeared immediately.
6. Cause 4: Misconfiguration
Cause 4 Configuration errors are silent lag causes. Everything looks operationally normal , no errors in alert log, no network problems, reasonable redo rate. But lag slowly accumulates because something is configured wrong. These are the hardest to find because they require knowing what correct configuration looks like.
-- Check 1: Standby Redo Logs exist and are correct size
-- SRL must be same size as online redo logs
-- Must have at least (online_log_groups + 1) SRL per thread
SELECT group#,
thread#,
sequence#,
bytes/1024/1024 AS size_mb,
archived,
status
FROM v$standby_log
ORDER BY thread#, group#;
-- Compare SRL size vs online redo log size (run on PRIMARY)
-- These must match
SELECT 'online' AS log_type,
group#,
bytes/1024/1024 AS size_mb
FROM v$log
UNION ALL
SELECT 'standby',
group#,
bytes/1024/1024
FROM v$standby_log
ORDER BY log_type, group#;
-- Check 2: AFFIRM setting matches protection mode intention
SELECT dest_id,
dest_name,
affirm,
async_blocks,
net_timeout,
delay,
compression,
status
FROM v$archive_dest
WHERE target = 'STANDBY'
AND status = 'VALID';
-- Check 3: Real-time apply is running (not archive-based apply)
-- Must show WAIT_FOR_LOG not WAIT_FOR_GAP
SELECT process,
status,
sequence#
FROM v$managed_standby
WHERE process = 'MRP0';
-- Check 4: Protection mode vs protection level match
-- If they differ, primary has fallen back silently
SELECT protection_mode, -- configured
protection_level -- actual effective
FROM v$database;
Incident: Missing Standby Redo Logs , 8-Hour Apply Gap
A newly built Data Guard environment appeared healthy. Alert log clean. No errors. Archive destination showed SUCCESS. But during failover testing, the standby was 8 hours behind the primary despite a low-latency WAN link.
Root cause: Standby Redo Logs had never been created on the standby. Without SRLs, real-time apply cannot function. The MRP process was waiting for archived logs to be shipped rather than receiving redo in real time. The standby was only catching up when logs were archived , once per hour by default.
Fix: Created SRL groups matching primary online redo log size and count + 1. Real-time apply started immediately. Apply lag dropped from 8 hours to under 60 seconds within minutes.
7. Cause 5: Standby Apply Bottleneck
Cause 5 Redo is arriving at the standby on time (low transport lag) but MRP cannot apply it fast enough. Apply lag grows. This is purely a standby-side problem , do not touch the primary or network until you have confirmed this is the cause.
-- Check MRP apply rate and wait events (run on STANDBY)
SELECT process,
status,
thread#,
sequence#,
block#,
blocks
FROM v$managed_standby
WHERE process LIKE 'MRP%'
OR process = 'RFS';
-- Check what MRP is waiting for (run on STANDBY)
SELECT event,
total_waits,
ROUND(average_wait, 3) AS avg_wait_ms,
wait_class
FROM v$system_event
WHERE wait_class NOT IN ('Idle', 'Network')
ORDER BY time_waited DESC
FETCH FIRST 15 ROWS ONLY;
-- Check parallel apply worker count (run on STANDBY)
-- If only MRP0 shown, parallel apply is disabled
SELECT process,
status,
thread#
FROM v$managed_standby
WHERE process LIKE 'PR%' -- parallel recovery slaves
OR process LIKE 'MRP%'
ORDER BY process;
-- Enable parallel apply (run on STANDBY , requires brief MRP restart)
-- Stop current apply
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
-- Restart with parallel apply (match to standby CPU count)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
PARALLEL 8
USING CURRENT LOGFILE
DISCONNECT FROM SESSION;
-- Verify parallel workers started
SELECT process, status
FROM v$managed_standby
WHERE process LIKE 'PR%' OR process LIKE 'MRP%'
ORDER BY process;
Fix Enable parallel apply with a degree matching standby CPU count. Ensure standby has dedicated I/O for datafile writes , separate from RMAN backup disk. If Active Data Guard is enabled, read queries compete with MRP for I/O , use I/O Resource Manager to prioritise MRP. Check standby UNDO tablespace for contention if apply waits show undo-related events.
8. Cause 6: Standby Resource Starvation
Cause 6 The standby server does not have enough CPU, memory, or I/O bandwidth to keep up with apply. This is frequently caused by under-specifying the standby relative to the primary, or by running competing workloads (RMAN backup, ADG reads, OS processes) on the same standby host.
-- Check CPU and I/O pressure on standby (run on STANDBY)
SELECT stat_name,
value
FROM v$osstat
WHERE stat_name IN (
'NUM_CPUS',
'LOAD',
'BUSY_TIME',
'IDLE_TIME',
'PHYSICAL_MEMORY_BYTES'
);
-- Check if RMAN backup is running during peak apply (run on STANDBY)
SELECT session_key,
input_type,
status,
TO_CHAR(start_time, 'YYYY-MM-DD HH24:MI') AS start_time,
input_bytes_display,
output_bytes_display
FROM v$rman_backup_job_details
WHERE start_time > SYSDATE - 1
ORDER BY start_time DESC;
-- Check I/O contention on standby datafiles (run on STANDBY)
SELECT name,
ROUND(readtim / NULLIF(phyrds, 0), 2) AS avg_read_ms,
ROUND(writetim / NULLIF(phywrts, 0), 2) AS avg_write_ms,
phyrds,
phywrts
FROM v$filestat f
JOIN v$datafile d ON f.file# = d.file#
WHERE (phyrds + phywrts) > 1000
ORDER BY avg_write_ms DESC
FETCH FIRST 10 ROWS ONLY;
-- Use I/O Resource Manager to protect MRP (run on STANDBY)
-- Create plan that gives MRP highest I/O priority
BEGIN
DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();
DBMS_RESOURCE_MANAGER.CREATE_PLAN(
plan => 'STANDBY_MGMT_PLAN',
comment => 'Protect MRP apply from ADG read workload'
);
DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
plan => 'STANDBY_MGMT_PLAN',
group_or_subplan => '_ORACLE_BACKGROUND_GROUP_',
comment => 'MRP background priority',
mgmt_p1 => 80
);
DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
END;
/
9. SYNC vs ASYNC: Which Mode Causes Which Lag Pattern
| Scenario | SYNC Mode Behaviour | ASYNC Mode Behaviour |
| Network spike (100ms RTT) | Every commit stalls 100ms , performance crisis on primary | Transport lag grows silently , primary unaffected |
| Standby reboots | Primary stalls (Max Protection) or falls back to async (Max Availability) | Transport lag grows until standby returns; no primary impact |
| Redo volume burst | Network saturates; commits queue behind LGWR | NSA buffer fills; lag grows but primary performance maintained |
| Steady state (good network) | Transport lag near zero; apply lag depends on MRP | Small transport lag always present (1–5 sec typical) |
| Network intermittent loss | LGWR stalls on each retransmit; very visible on primary | Lag spikes during loss window; primary unaffected |
10. Lag Under Active Data Guard (Read-Only Standby)
Active Data Guard (ADG) introduces an additional lag risk: read queries on the standby compete with MRP for I/O bandwidth. In a busy ADG reporting workload, MRP can be starved of the I/O it needs to apply redo quickly.
-- Check open mode on standby (ADG = READ ONLY WITH APPLY)
SELECT open_mode,
database_role,
db_unique_name
FROM v$database;
-- Monitor read sessions vs MRP on standby
SELECT s.type,
s.status,
COUNT(*) AS session_count
FROM v$session s
WHERE s.type = 'USER'
OR s.program LIKE '%MRP%'
GROUP BY s.type, s.status;
-- Check I/O split between read sessions and MRP (run on STANDBY)
SELECT n.name,
s.value
FROM v$sesstat s
JOIN v$statname n ON s.statistic# = n.statistic#
JOIN v$session ss ON s.sid = ss.sid
WHERE ss.program LIKE '%MRP%'
AND n.name IN (
'physical reads',
'physical writes',
'redo size'
);
-- Temporarily pause ADG reads to let MRP catch up (emergency use)
-- This closes standby to read sessions until apply lag clears
-- ALTER DATABASE CLOSE;
-- ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
-- USING CURRENT LOGFILE DISCONNECT;
-- Then re-open: ALTER DATABASE OPEN READ ONLY;
11. Building a Lag Monitoring Alert
Do not wait for a pager alert at 11 PM. Build proactive lag monitoring that catches problems before they become incidents.
-- Comprehensive lag monitoring query (run on PRIMARY every 5 minutes)
-- Alerts when transport lag > 30s or apply lag > 120s
SELECT db_unique_name,
name AS metric,
value AS lag_value,
CASE
WHEN name = 'transport lag'
AND TO_DSINTERVAL(value) > INTERVAL '0 00:00:30' HOUR TO SECOND
THEN 'ALERT: Transport lag exceeds 30 seconds'
WHEN name = 'apply lag'
AND TO_DSINTERVAL(value) > INTERVAL '0 00:02:00' HOUR TO SECOND
THEN 'ALERT: Apply lag exceeds 2 minutes'
ELSE 'OK'
END AS alert_status,
time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag', 'apply lag')
ORDER BY name;
-- Archive gap detection (run on PRIMARY)
-- Any rows = sequence gap = serious problem
SELECT thread#,
low_sequence#,
high_sequence#
FROM v$archive_gap
ORDER BY thread#;
-- Full Data Guard health check (run on PRIMARY via DGMGRL)
-- dgmgrl / "show configuration"
-- dgmgrl / "validate database verbose standby_db"
12. FAQ
My transport lag shows +00 00:00:00 but apply lag is 2 hours. What is happening?
Transport lag of zero means redo is arriving at the standby immediately , the network and primary are fine. Apply lag of 2 hours means MRP is 2 hours behind in applying what it has already received. This is a pure standby-side problem: check MRP parallel degree, standby I/O performance, RMAN backup scheduling, and whether ADG read workload is starving MRP. Do not waste time looking at the network or primary for this pattern.
Should I increase redo log size to reduce transport lag?
Increasing redo log size reduces log switch frequency, which smooths out redo transport by reducing the burst pattern of frequent log switches. However it does not reduce the total volume of redo being generated , so if the problem is sustained bandwidth saturation, larger logs will not help. Larger logs also mean a longer checkpoint interval, which can increase crash recovery time. The sweet spot for most production systems is 500 MB to 2 GB per redo log group, switched every 10–30 minutes under normal load.
Can Data Guard lag cause data corruption?
Lag itself does not cause corruption , it causes data loss risk if the primary fails while the standby is behind. Corruption is a different failure mode (block corruption on primary or standby) which Data Guard can actually detect and prevent using the DB_BLOCK_CHECKING and DB_LOST_WRITE_PROTECT parameters. Enable DB_LOST_WRITE_PROTECT=TYPICAL on both primary and standby to detect lost writes that could cause divergence between the two databases.
How do I reduce lag without changing from SYNC to ASYNC mode?
Several options that preserve SYNC transport: (1) Enable redo log compression (COMPRESSION=ENABLE) to reduce bandwidth by 50–70% at the cost of CPU. (2) Upgrade network bandwidth , going from 1 GbE to 10 GbE typically resolves SYNC lag caused by saturation. (3) Increase redo log group count so LGWR can switch to a new group while NSS sends the previous one. (4) Ensure the standby uses real-time apply (SRLs configured) so MRP stays close to arrival. (5) Schedule batch redo-heavy jobs during off-peak to avoid bursting the SYNC transport pipe.
About the Author
Chetan Yadav
Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL, and Cloud DBA with 15+ years of hands-on experience managing production databases across on-premises, hybrid, and cloud environments. He specialises in high availability architecture, performance tuning, disaster recovery, and database migrations.
Chetan has diagnosed and resolved Data Guard lag incidents across financial, healthcare, and e-commerce production environments , including the 8-hour apply gap from missing Standby Redo Logs and the firewall stateful inspection transport lag described in this article. His writing focuses on precise diagnosis and precise resolution, not generic advice.
This blog focuses on real-world DBA problems, career growth, and practical learning , not theoretical documentation or vendor marketing.
No comments:
Post a Comment