Oracle Data Guard Protection Modes: Zero Data Loss Design Strategy
Oracle Database: 19.18.0.0.0 & 23ai Enterprise Edition • Data Guard Broker: Enabled with Fast-Start Failover
Primary: 2-Node RAC, 6 TB OLTP • Standby: Physical Standby, separate data center (120 km)
Network: Dedicated 10 GbE WAN link • RTO Target: < 30 seconds • RPO Target: Zero data loss
Application: Core banking transaction processing system
2:14 AM. A storage array at the primary data center suffered a catastrophic controller failure. The entire primary RAC cluster went offline. No warning. No graceful shutdown. Just gone.
Within 23 seconds, Oracle Data Guard Fast-Start Failover had detected the outage, confirmed quorum with the observer, and automatically activated the physical standby database. The application connection pool reconnected. Transactions resumed. The on-call team received the alert after the failover had already completed.
That outcome — 23 seconds, zero data loss, no manual intervention — was not luck. It was the result of choosing the right Data Guard protection mode, designing the redo transport correctly, and validating failover under load before the incident happened. Most organizations running Data Guard never validate their failover until disaster strikes. By then it is too late to discover the configuration was wrong.
This guide explains Oracle Data Guard protection modes in depth: what each mode actually guarantees, how redo transport works under the hood, when to use each mode, and the production design decisions that determine whether your standby database saves you or fails you at the worst possible moment.
- Data Guard Architecture: How It Actually Works
- The Three Protection Modes Explained
- Redo Transport Deep Dive: SYNC vs ASYNC
- Standby Database Types: Physical vs Logical vs Snapshot
- Fast-Start Failover: Automatic DR Activation
- Active Data Guard: Read-Only Standby While Applying Redo
- Data Guard Broker: Centralized Configuration Management
- Real Production Failover: What Actually Happens
- Protection Mode Decision Framework
- FAQ
- Related Reading from Real Production Systems
1. Data Guard Architecture: How It Actually Works
Oracle Data Guard maintains one or more synchronized copies of a primary database called standby databases. When the primary fails, a standby can take over as the new primary — either automatically via Fast-Start Failover or manually via a DBA-initiated failover command.
The core mechanism is redo log shipping. Every change made to the primary database is recorded in redo logs. Data Guard ships those redo records to the standby in real time, where they are applied to keep the standby synchronized.
Key Data Guard Components
| Component | Location | Role |
|---|---|---|
| LGWR / DMON | Primary | Captures redo and initiates shipping to standby |
| NSA / NSS Process | Primary | Network Server process — sends redo over the network |
| RFS Process | Standby | Remote File Server — receives redo from primary |
| MRP Process | Standby | Managed Recovery Process — applies redo to standby |
| LSP Process | Standby (logical) | LogMiner Server Process — applies SQL for logical standby |
| DMON Process | Both | Data Guard Monitor — manages broker configuration |
| Observer | Third site | Monitors primary for Fast-Start Failover quorum |
2. The Three Protection Modes Explained
Oracle Data Guard offers three protection modes. Each makes a different trade-off between data loss guarantee, performance impact, and availability. Choosing the wrong mode for your workload is one of the most common — and most dangerous — Data Guard mistakes.
Maximum Protection
Every redo record must be written to at least one standby redo log on the standby database before the primary acknowledges the transaction commit. If the standby becomes unreachable, the primary database shuts itself down rather than risk any data divergence.
- Transport:
SYNC— synchronous redo shipping - Affirm:
AFFIRM— standby must confirm disk write - RPO: Zero — absolutely no data loss under any failure scenario
- RTO: Fast — standby is fully synchronized at all times
- Risk: Primary shuts down if standby is unreachable — availability depends on standby health
- Use case: Regulatory compliance, core banking, financial settlement systems
A core banking client ran Maximum Protection mode. During a routine standby server reboot for OS patching, the primary database shut itself down — taking the entire production system offline for 18 minutes. The team had forgotten that Maximum Protection means the primary cannot run without the standby. Always maintain a second standby or use Maximum Availability instead for most production workloads.
Maximum Availability
The default recommended mode for most production systems. Like Maximum Protection, it uses synchronous redo shipping. However, if the standby becomes unreachable, the primary automatically falls back to asynchronous mode and continues running. When the standby reconnects, it resynchronizes and the primary returns to synchronous mode automatically.
- Transport:
SYNC— synchronous when standby is available - Affirm:
AFFIRM— confirms disk write on standby - RPO: Zero under normal conditions; seconds of exposure during fallback
- RTO: Fast — standby synchronized within seconds of primary failure
- Risk: Brief data loss window during the fallback period
- Use case: Enterprise OLTP, e-commerce, healthcare systems
Maximum Performance
The default mode when Data Guard is first configured. Redo is shipped asynchronously — the primary does not wait for the standby to confirm receipt before acknowledging the commit. This provides the best primary performance but at the cost of a potential data loss window.
- Transport:
ASYNC— asynchronous redo shipping - Affirm:
NOAFFIRM— no standby disk write confirmation required - RPO: Seconds to minutes depending on network latency and redo volume
- RTO: Fast but standby may lag behind primary
- Risk: Data loss equal to the apply lag at time of failure
- Use case: Reporting standbys, development/test environments, geographically distant DR sites
3. Redo Transport Deep Dive: SYNC vs ASYNC
The protection mode you choose determines the redo transport mechanism. Understanding the difference between SYNC and ASYNC transport is essential for making the right design choice.
Synchronous Transport (SYNC)
With synchronous transport, the primary database Log Writer (LGWR) process sends the redo record to the standby at the same time it writes to the local online redo log. The commit does not complete until both the local write and the standby acknowledgment have been received. This introduces latency equal to the round-trip time between primary and standby.
For synchronous redo transport, the round-trip network latency between primary and standby should be under 5 ms for OLTP workloads. Beyond 5 ms, commit latency becomes noticeable to applications. Beyond 20 ms, SYNC transport typically causes unacceptable performance degradation. In our production environment with a 120 km dark fiber link, round-trip latency is 1.8 ms — well within acceptable range for Maximum Availability mode.
Asynchronous Transport (ASYNC)
With asynchronous transport, the primary LGWR writes to the local redo log and acknowledges the commit immediately. A separate background process (NSA — Network Server Async) ships redo to the standby independently. The standby typically lags behind the primary by the amount of redo generated during the network transfer time.
Standby Redo Logs (SRL) — Critical Requirement
Standby Redo Logs (SRLs) are required for real-time apply and for synchronous transport. They must be the same size as the primary online redo logs and there must be at least one more SRL group per thread than the number of online redo log groups on the primary.
4. Standby Database Types: Physical vs Logical vs Snapshot
| Type | Apply Method | Open for Reads? | Best Use Case |
|---|---|---|---|
| Physical Standby | Block-for-block redo apply (MRP) | Yes (Active Data Guard, extra license) | Primary DR target, zero data loss |
| Logical Standby | SQL Apply via LogMiner (LSP) | Yes (read/write, some restrictions) | Reporting, DDL testing, heterogeneous environments |
| Snapshot Standby | Redo buffered but not applied | Yes (full read/write) | Testing, QA, patch validation against production data |
5. Fast-Start Failover: Automatic DR Activation
Fast-Start Failover (FSFO) enables Oracle Data Guard to automatically fail over to the standby database without any DBA intervention when the primary becomes unavailable. This is what enabled the 23-second failover described in the introduction.
FSFO Requirements
- Data Guard Broker must be enabled and configured
- Observer process running on a third independent host
- Protection mode must be Maximum Availability or Maximum Protection
- Standby Redo Logs must be configured on the standby
- Flashback Database recommended for reinstating the old primary
Always run the FSFO observer on a third independent host — not on the primary server and not on the standby server. If both the primary and the observer are on the same network segment and that segment fails, the standby may not receive quorum to activate FSFO. In our production setup, the observer runs on a small VM in a separate availability zone with network access to both primary and standby. Test FSFO every quarter by deliberately shutting down the primary during a low-traffic window.
6. Active Data Guard: Read-Only Standby While Applying Redo
Active Data Guard (ADG) allows the physical standby database to be open in read-only mode while redo apply continues in the background. This means the standby can serve reporting queries, offload read traffic from the primary, and still be ready to fail over at any moment.
Active Data Guard requires the Oracle Active Data Guard option, which is a separately licensed Enterprise Edition add-on. A physical standby open in read-only mode without redo apply is available without ADG license. Real-time redo apply while open read-only requires the ADG license. Always verify licensing before enabling ADG in production.
7. Data Guard Broker: Centralized Configuration Management
Data Guard Broker provides a unified interface for managing the entire Data Guard configuration. Without Broker, you manage each database individually through LOG_ARCHIVE_DEST parameters and manual commands. With Broker, you manage the entire configuration through a single DGMGRL command-line interface.
8. Real Production Failover: What Actually Happens
These are real Data Guard incidents from production environments and the lessons they produced.
A financial services client had Data Guard configured but in Maximum Performance mode (the default). They believed they had zero data loss protection. During a primary storage failure, the standby was 47 seconds behind the primary due to ASYNC transport lag. 47 seconds of financial transactions were permanently lost.
Lesson: Always verify protection mode AND transport mode with
SELECT protection_mode, protection_level FROM v$database. The protection_level column shows the actual effective mode — it differs from protection_mode when the standby has fallen back to async.
A team set up Data Guard correctly — or so they thought. Failover testing revealed the standby was 8 hours behind the primary. Investigation showed Standby Redo Logs had never been created. Without SRLs, real-time apply cannot function and the standby only catches up during log archive shipping, leaving a massive gap.
Lesson: Always verify SRLs exist and are active:
SELECT * FROM v$standby_log. If this view is empty, real-time apply is not running.
Configuration that made it work: Maximum Availability mode, SYNC/AFFIRM transport on 1.8 ms latency link, Standby Redo Logs sized identically to primary, FSFO enabled with threshold of 15 seconds, observer on independent host, Flashback Database enabled on both primary and standby, and quarterly failover testing under full production load.
Key lesson: Fast-Start Failover only works reliably when every component is validated together under real load — not just configured and assumed to work.
9. Protection Mode Decision Framework
Use this framework to choose the right protection mode for your workload.
Choose Maximum Protection if ALL of these are true:
- Regulatory zero data loss is a hard legal requirement (banking, finance, government)
- You have a second standby or can tolerate primary shutdown if standby fails
- Network latency between primary and standby is under 2 ms
- Your application can tolerate slightly higher commit latency
Choose Maximum Availability if ANY of these are true:
- You need near-zero data loss but cannot accept primary shutdown risk
- Network latency is under 5 ms and workload is standard OLTP
- You want Fast-Start Failover with automatic activation
- This is your primary DR configuration for an enterprise system
Choose Maximum Performance if ANY of these are true:
- This standby is for reporting or dev/test only — not primary DR
- The standby is geographically distant (WAN latency > 10 ms)
- Some data loss is acceptable and primary performance is the priority
- You are running a second standby in addition to a sync standby
10. FAQ
Failover is an unplanned activation of the standby when the primary is unavailable. Data loss depends on the protection mode in effect at the time. The old primary must be reinstated (using Flashback Database) before it can rejoin as a standby. Fast-Start Failover automates this process.
DROP TABLE or DELETE without WHERE. The corruption is applied to the standby within seconds. For protection against logical corruption, use Flashback Database (which can rewind both primary and standby to a point before the error) or maintain a time-delayed standby using the DELAY parameter in LOG_ARCHIVE_DEST.
No comments:
Post a Comment