⏱️ Estimated Reading Time: 12 - 14 minutes
Missing SRLs, Parallel Apply, Network Compression, RMAN Conflict, Protection Mode Mismatch
⚙️ Environment Referenced
Oracle Database: 19.18.0.0.0 Enterprise Edition •
Standby Type: Physical Standby (Active Data Guard) •
Protection Mode: Maximum Availability (SYNC/AFFIRM)
Primary: 2-Node RAC, 4.8 TB OLTP •
Network: Dedicated 1 GbE WAN, RTT 1.8 ms •
Peak Load: 2,800 TPS
Data Guard lag is one of the most stressful production alerts a DBA receives. The standby is falling behind the primary. Every second of lag is a second of potential data loss if the primary fails right now. The pressure to fix it quickly is real.
The problem is that "Data Guard lag" is not one problem. It is five different problems that all show the same symptom. Applying the wrong fix wastes time and can make things worse. This guide gives you the exact decision path, the exact diagnostic queries, and the exact fix commands for each root cause, in the order you should check them.
Follow the steps in order. Each step either identifies your problem and gives you the fix, or clears that cause and moves you to the next. Most Data Guard lag issues are resolved within Steps 1 to 3.
1. Before You Start: Run These 4 Queries First
Do not start applying fixes before running diagnostics. Two minutes of diagnosis saves two hours of incorrect troubleshooting.
-- QUERY 1: Identify lag type (run on PRIMARY)
-- This tells you which direction to go
SELECT name,
value,
time_computed
FROM v$dataguard_stats
WHERE name IN (
'transport lag',
'apply lag',
'apply finish time'
)
ORDER BY name;
-- QUERY 2: Check Standby Redo Logs (run on STANDBY)
-- No rows = missing SRLs = most common cause of apply lag
SELECT group#,
thread#,
bytes/1024/1024 AS size_mb,
archived,
status
FROM v$standby_log
ORDER BY thread#, group#;
-- QUERY 3: Check MRP parallel workers (run on STANDBY)
-- Only MRP0 with no PR* processes = parallel apply disabled
SELECT process,
status,
thread#,
sequence#
FROM v$managed_standby
WHERE process LIKE 'MRP%'
OR process LIKE 'PR%'
ORDER BY process;
-- QUERY 4: Check for archive gap (run on PRIMARY)
-- Any rows = URGENT, fix this before everything else
SELECT thread#,
low_sequence#,
high_sequence#
FROM v$archive_gap
ORDER BY thread#;
Read Query 4 Result First:
If Query 4 returns any rows, you have a sequence gap. This means the standby is missing archived redo logs and cannot apply continuously. Fix the gap by manually copying the missing archives from the primary before doing anything else. A sequence gap takes priority over every other fix.
2. Step 1: Check Transport Lag vs Apply Lag
1
Identify Which Type of Lag You Have
From Query 1 output:
- Transport lag high Redo is not reaching the standby fast enough. Go to Step 4 (network) or check primary I/O.
- Apply lag high, transport lag low Redo is arriving fine but MRP cannot apply it. Go to Step 2 (SRLs) then Step 3 (parallel apply).
- Both high Check SRLs first (Step 2), then network (Step 4).
| Query 1 Result | What It Means | Go To |
| transport lag: +00 05:00:00, apply lag: +00 00:10:00 | Network or primary I/O problem | Step 4 |
| transport lag: +00 00:00:00, apply lag: +02 00:00:00 | MRP apply bottleneck | Step 2 then Step 3 |
| transport lag: +00 01:00:00, apply lag: +01 00:00:00 | Multiple issues | Step 2, then Step 3, then Step 4 |
| Both show +00 00:00:00 | Lag cleared or intermittent | Step 9 (verify and monitor) |
3. Step 2: Fix Missing Standby Redo Logs
Check This First Missing Standby Redo Logs (SRLs) is the single most common cause of unexpectedly high apply lag in new or recently rebuilt Data Guard environments. Without SRLs, real-time apply cannot function. The standby only applies redo when archive logs are shipped, typically once per hour, creating a massive lag gap.
2
Diagnose and Fix Missing SRLs
-- Step 2a: Check primary online redo log size (run on PRIMARY)
-- SRLs must match this exact size
SELECT group#,
members,
bytes/1024/1024 AS size_mb,
status
FROM v$log
ORDER BY group#;
-- Step 2b: Check SRLs on standby (run on STANDBY)
-- If no rows returned = SRLs are missing
SELECT group#,
thread#,
bytes/1024/1024 AS size_mb,
archived,
status
FROM v$standby_log
ORDER BY thread#, group#;
-- Step 2c: Create SRLs if missing (run on STANDBY)
-- Rules: same size as online redo logs
-- Need (online_log_groups_per_thread + 1) SRL groups per thread
-- Example: Primary has 3 redo groups per thread, 2 threads (RAC)
-- Need 4 SRL groups for thread 1, 4 for thread 2
-- For ASM storage (recommended):
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 11 ('+DATA') SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 12 ('+DATA') SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 13 ('+DATA') SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 14 ('+DATA') SIZE 500M;
-- For single-instance primary (only thread 1 needed):
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 11 ('+DATA') SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 12 ('+DATA') SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 13 ('+DATA') SIZE 500M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
GROUP 14 ('+DATA') SIZE 500M;
-- Step 2d: Verify SRLs created and MRP switches to real-time apply
SELECT group#,
thread#,
bytes/1024/1024 AS size_mb,
status
FROM v$standby_log
ORDER BY thread#, group#;
-- Check MRP switched to WAIT_FOR_LOG (real-time) not WAIT_FOR_GAP
SELECT process,
status,
sequence#
FROM v$managed_standby
WHERE process = 'MRP0';
Expected Result After Creating SRLs:
Within 2,5 minutes of creating SRLs, MRP status should change from WAIT_FOR_GAP to WAIT_FOR_LOG. Apply lag should begin dropping immediately and typically reaches under 60 seconds within 10,15 minutes if the only issue was missing SRLs. In one production environment, apply lag dropped from 8 hours to 42 seconds within 12 minutes of SRL creation.
4. Step 3: Enable Parallel Apply
If SRLs are present and correctly sized but apply lag is still growing, the MRP process is the bottleneck. By default, MRP is single-threaded. Enabling parallel apply allows MRP to use multiple CPU cores to apply redo simultaneously, increasing apply throughput by 3,8x on most workloads.
3
Enable Parallel Apply on the Standby
-- Step 3a: Check current apply mode (run on STANDBY)
-- If only MRP0 shown and no PR* processes = single-threaded
SELECT process,
status,
thread#,
sequence#
FROM v$managed_standby
WHERE process LIKE 'MRP%'
OR process LIKE 'PR%'
ORDER BY process;
-- Step 3b: Stop current MRP (run on STANDBY)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
-- Verify MRP stopped
SELECT process, status
FROM v$managed_standby
WHERE process LIKE 'MRP%';
-- Step 3c: Restart with parallel apply
-- Set PARALLEL to match standby CPU count (start with CPU_count/2)
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
PARALLEL 8
USING CURRENT LOGFILE
DISCONNECT FROM SESSION;
-- Step 3d: Verify parallel workers started
-- Should see MRP0 + PR00, PR01, PR02... workers
SELECT process,
status,
thread#,
sequence#
FROM v$managed_standby
WHERE process LIKE 'MRP%'
OR process LIKE 'PR%'
ORDER BY process;
-- Step 3e: Monitor apply lag after enabling parallel apply
SELECT name,
value,
time_computed
FROM v$dataguard_stats
WHERE name IN ('transport lag', 'apply lag')
ORDER BY name;
How Many Parallel Apply Workers to Set?
Start with half your standby CPU count. For an 8-CPU standby, start with PARALLEL 4. Monitor apply lag and CPU usage. If apply lag is still growing and CPU has headroom, increase to PARALLEL 8. Do not set parallel degree higher than CPU count. If the standby runs Active Data Guard (read workload), leave at least 25% CPU for read sessions.
5. Step 4: Fix Network Latency and Bandwidth
If transport lag is high and apply lag is low, the problem is between the primary and the standby. The redo is being generated but not arriving fast enough.
4
Reduce Redo Transport Bandwidth Usage
-- Step 4a: Check current transport configuration (run on PRIMARY)
SELECT dest_id,
dest_name,
status,
affirm,
compression,
net_timeout,
async_blocks,
error
FROM v$archive_dest
WHERE target = 'STANDBY'
AND status = 'VALID';
-- Step 4b: Enable compression on redo transport (run on PRIMARY)
-- Reduces bandwidth by 30-70% at cost of CPU on both sides
ALTER SYSTEM SET log_archive_dest_2 =
'SERVICE=standby_db
SYNC
AFFIRM
COMPRESSION=ENABLE
NET_TIMEOUT=30
VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
DB_UNIQUE_NAME=standby_db'
SCOPE=BOTH SID='*';
-- Step 4c: Verify compression is now enabled
SELECT dest_id,
compression,
status,
error
FROM v$archive_dest
WHERE target = 'STANDBY'
AND status = 'VALID';
-- Step 4d: Check current network write latency
SELECT dest_id,
srl_network_writes,
ROUND(srl_network_write_time /
NULLIF(srl_network_writes,0), 3) AS avg_network_write_ms
FROM v$standby_log_stats
ORDER BY dest_id;
-- OS level network test (run from primary server shell)
-- ping -c 50 standby_host_ip
-- iperf3 -c standby_host_ip -t 30 -P 4
6. Step 5: Resolve RMAN vs MRP I/O Conflict
A very common but often overlooked cause of apply lag spikes: RMAN backup running on the standby at the same time as peak MRP apply activity. Both compete for the same disk I/O. RMAN is aggressive by default and can starve MRP of the I/O it needs.
5
Fix RMAN and MRP I/O Conflict
-- Step 5a: Check if RMAN is running during lag spike (run on STANDBY)
SELECT session_key,
input_type,
status,
TO_CHAR(start_time, 'YYYY-MM-DD HH24:MI') AS start_time,
TO_CHAR(end_time, 'YYYY-MM-DD HH24:MI') AS end_time,
input_bytes_display,
output_bytes_display
FROM v$rman_backup_job_details
WHERE start_time > SYSDATE - 2
ORDER BY start_time DESC;
-- Step 5b: Check datafile I/O latency on standby
-- High write time during backup = RMAN starving MRP
SELECT d.name,
ROUND(f.writetim / NULLIF(f.phywrts,0), 2) AS avg_write_ms,
f.phyrds,
f.phywrts
FROM v$filestat f
JOIN v$datafile d ON f.file# = d.file#
WHERE f.phywrts > 1000
ORDER BY avg_write_ms DESC
FETCH FIRST 10 ROWS ONLY;
-- Step 5c: Throttle RMAN to protect MRP I/O (run on STANDBY)
-- Connect RMAN and set channel throttle:
-- RMAN> RUN {
-- ALLOCATE CHANNEL c1 DEVICE TYPE DISK
-- RATE 100M; -- limit to 100MB/sec
-- BACKUP DATABASE;
-- }
-- Step 5d: Use DBMS_SCHEDULER to move backup to off-peak window
-- Check current backup schedule
SELECT job_name,
enabled,
next_run_date,
last_start_date
FROM dba_scheduler_jobs
WHERE job_name LIKE '%RMAN%'
OR job_name LIKE '%BACKUP%'
ORDER BY next_run_date;
7. Step 6: Fix Protection Mode Mismatch
A silent lag builder. The protection mode shown in v$database has two columns: protection_mode (what you configured) and protection_level (what is actually in effect). If the standby was temporarily unreachable, the primary may have silently fallen back to a lower level. This can cause steady, unexplained lag accumulation.
-- Check effective protection mode (run on PRIMARY)
-- If these two values differ, primary fell back silently
SELECT name,
db_unique_name,
protection_mode, -- what you configured
protection_level -- what is actually active
FROM v$database;
-- Check archive destination status for errors
SELECT dest_id,
status,
error,
gap_status,
db_unique_name
FROM v$archive_dest_status
WHERE target = 'STANDBY'
AND status != 'INACTIVE';
-- Restore protection mode via DGMGRL (after standby reconnects)
-- dgmgrl /
-- DGMGRL> SHOW CONFIGURATION;
-- DGMGRL> EDIT CONFIGURATION SET PROTECTION MODE AS MAXAVAILABILITY;
-- DGMGRL> SHOW CONFIGURATION;
-- Verify protection_level now matches protection_mode
SELECT protection_mode,
protection_level
FROM v$database;
8. Fix for Active Data Guard (ADG) Apply Lag
If your standby runs Active Data Guard (open read-only while applying), read queries from reporting users compete with MRP for I/O. During heavy reporting workloads, MRP can be starved of I/O and apply lag grows even when all other configuration is correct.
-- Check open mode on standby
SELECT open_mode,
database_role
FROM v$database;
-- Monitor MRP vs read session I/O competition
SELECT s.program,
s.status,
COUNT(*) AS sessions
FROM v$session s
WHERE s.type = 'USER'
OR s.program LIKE '%MRP%'
GROUP BY s.program, s.status
ORDER BY sessions DESC
FETCH FIRST 10 ROWS ONLY;
-- Apply lag while ADG read load active
SELECT name, value
FROM v$dataguard_stats
WHERE name IN ('transport lag', 'apply lag');
-- Create I/O Resource Manager plan to protect MRP (run on STANDBY)
BEGIN
DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();
DBMS_RESOURCE_MANAGER.CREATE_PLAN(
plan => 'STANDBY_IO_PLAN',
comment => 'Protect MRP apply from ADG read workload'
);
DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
plan => 'STANDBY_IO_PLAN',
group_or_subplan => '_ORACLE_BACKGROUND_GROUP_',
comment => 'MRP gets priority I/O',
mgmt_p1 => 80
);
DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
END;
/
ALTER SYSTEM SET resource_manager_plan = 'STANDBY_IO_PLAN'
SCOPE = BOTH SID = '*';
9. After the Fix: Verify and Monitor
After applying any fix, do not assume the problem is resolved. Monitor lag for at least 30 minutes under production load before declaring the incident closed.
-- Run this every 5 minutes after the fix (run on PRIMARY)
SELECT name,
value,
time_computed
FROM v$dataguard_stats
WHERE name IN (
'transport lag',
'apply lag',
'apply finish time'
)
ORDER BY name;
-- Confirm no sequence gaps (run on PRIMARY)
SELECT thread#,
low_sequence#,
high_sequence#
FROM v$archive_gap;
-- Confirm MRP running in real-time mode (run on STANDBY)
-- Status must be WAIT_FOR_LOG not WAIT_FOR_GAP
SELECT process,
status,
sequence#,
delay_mins
FROM v$managed_standby
WHERE process LIKE 'MRP%'
OR process LIKE 'PR%'
ORDER BY process;
-- Full DGMGRL health check
-- dgmgrl /
-- DGMGRL> SHOW CONFIGURATION;
-- DGMGRL> VALIDATE DATABASE VERBOSE standby_db;
10. Prevent Lag From Returning: Configuration Checklist
SRL and Transport Configuration
- Standby Redo Logs created with same size as online redo logs
- SRL count = online redo log groups per thread + 1 per thread
- MRP status shows WAIT_FOR_LOG (real-time apply active)
- COMPRESSION=ENABLE set in LOG_ARCHIVE_DEST if WAN bandwidth is a concern
- NET_TIMEOUT set appropriately for WAN latency (30,60 seconds)
Apply Performance
- Parallel apply enabled with degree matching standby CPU count
- RMAN backup scheduled outside peak MRP apply window
- Standby hardware sized equal to or greater than primary
- Dedicated I/O path for standby datafiles (not shared with backup)
- I/O Resource Manager plan active if ADG read workload is present
Monitoring
- Automated alert when transport lag exceeds 30 seconds
- Automated alert when apply lag exceeds 2 minutes
- Archive gap check runs every 15 minutes
- DGMGRL VALIDATE DATABASE runs weekly
- Failover tested under production load every quarter
FAQ
Can I create SRLs without stopping the standby database?
Yes. You can add Standby Redo Log groups to a mounted or open standby database without any downtime. The ALTER DATABASE ADD STANDBY LOGFILE command runs online. MRP does not need to be stopped. After creating SRLs, MRP automatically detects them and switches to real-time apply mode. You should verify the switch by checking that MRP status changes from WAIT_FOR_GAP to WAIT_FOR_LOG within a few minutes.
Will enabling parallel apply cause any data consistency risk?
No. Oracle's parallel apply mechanism uses the same redo-based recovery infrastructure as single-threaded MRP. Oracle guarantees transactional consistency regardless of the parallel degree. Redo records within the same transaction are always applied in the correct order. The only consideration is that enabling parallel apply requires briefly cancelling and restarting MRP, which causes a few seconds of apply pause. During that pause, apply lag may increase slightly before recovering.
How do I fix a sequence gap in Data Guard?
A sequence gap means the standby is missing one or more archived redo log files and cannot apply continuously. First run SELECT * FROM v$archive_gap on the primary to identify the missing sequence numbers. Then manually copy the corresponding archive log files from the primary to the standby using RMAN or OS copy, register them on the standby with ALTER DATABASE REGISTER LOGFILE, and verify MRP resumes applying. If the gap is very old and the archive logs no longer exist on the primary, you may need to rebuild the standby using RMAN active duplication.
Transport lag is zero but apply lag keeps growing slowly every day. Why?
Slow, steady apply lag growth with zero transport lag almost always points to one of three causes: (1) parallel apply is not enabled and MRP is falling behind a high-volume OLTP workload, (2) RMAN backup is running daily and competing with MRP during peak hours, or (3) Active Data Guard read workload is starving MRP I/O. Run the MRP parallel worker check (Query 3) and the RMAN schedule check (Step 5) to identify which one is active in your environment.
About the Author
Chetan Yadav
Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL, and Cloud DBA with 15+ years of hands-on experience managing production databases across on-premises, hybrid, and cloud environments.
Chetan has resolved Data Guard lag incidents across financial and healthcare production environments, including the 8-hour apply lag from missing SRLs and the firewall-induced transport spike described in related guides. His writing focuses on precise diagnosis and exact resolution commands, not generic advice.
This blog focuses on real-world DBA problems, career growth, and practical learning, not theoretical documentation or vendor marketing.
No comments:
Post a Comment