Chetan Yadav: Oracle 23ai

Showing posts with label Oracle 23ai. Show all posts

Monday, March 16, 2026

Oracle Database 23ai Architecture: AI-Native Internals for DBAs

What Actually Changed Inside the Engine — Beyond the Marketing

16 March 2026

Chetan Yadav — Senior Oracle & Cloud DBA

⏱️ 11–12 min read

⏱️ Estimated Reading Time: 11–12 minutes

AI Vector Search • Select AI • JSON Duality • True Cache • SQL Domains • Lock-Free Reservations

Oracle Database 23ai complete AI-native architecture diagram showing AI layer, SQL engine, security, core engine, HA/DR and connectivity

⚙️ Environment Used in This Article

Oracle Database: 23ai (23.4) Enterprise Edition • Platform: Oracle Linux 8.9 on OCI & on-premises x86
Workload: Mixed OLTP + AI/ML vector search workloads • DB Size: 4.2 TB
New Features Tested: AI Vector Search, Select AI, JSON Duality Views, True Cache, SQL Domains, Lock-Free Reservations, Boolean datatype, Schema Privileges

Oracle called it "23ai" for a reason. This is not a routine point release with incremental improvements. The AI suffix signals a deliberate architectural shift — Oracle has embedded AI capabilities directly into the database engine itself, not as an external add-on or middleware layer.

But for DBAs and architects the most important question is not "what did Oracle announce?" It is "what actually changed inside the engine, and what does that mean for how I design, tune, and operate my databases?" The marketing materials will tell you Oracle 23ai is revolutionary. This guide will tell you which features are genuinely production-ready, which ones still need maturity, and what the internal architecture changes mean for your workload.

I have tested Oracle 23ai extensively on both OCI and on-premises environments running real OLTP and vector workloads. This guide reflects what I found in practice, not what the release notes promise.

Oracle Data Guard Protection Modes: Zero Data Loss Design Strategy

Production HA/DR Architecture for DBAs and Architects

12 March 2026

Chetan Yadav — Senior Oracle & Cloud DBA

⏱️ 12–13 min read

⏱️ Estimated Reading Time: 12–13 minutes

Maximum Protection • Maximum Availability • Maximum Performance — Choose Wisely

Oracle Data Guard Protection Modes architecture diagram showing Primary, Standby, GCS, GES, redo transport and RPO RTO comparison

⚙️ Production Environment Referenced in This Article

Oracle Database: 19.18.0.0.0 & 23ai Enterprise Edition • Data Guard Broker: Enabled with Fast-Start Failover
Primary: 2-Node RAC, 6 TB OLTP • Standby: Physical Standby, separate data center (120 km)
Network: Dedicated 10 GbE WAN link • RTO Target: < 30 seconds • RPO Target: Zero data loss
Application: Core banking transaction processing system

2:14 AM. A storage array at the primary data center suffered a catastrophic controller failure. The entire primary RAC cluster went offline. No warning. No graceful shutdown. Just gone.

Within 23 seconds, Oracle Data Guard Fast-Start Failover had detected the outage, confirmed quorum with the observer, and automatically activated the physical standby database. The application connection pool reconnected. Transactions resumed. The on-call team received the alert after the failover had already completed.

That outcome — 23 seconds, zero data loss, no manual intervention — was not luck. It was the result of choosing the right Data Guard protection mode, designing the redo transport correctly, and validating failover under load before the incident happened. Most organizations running Data Guard never validate their failover until disaster strikes. By then it is too late to discover the configuration was wrong.

This guide explains Oracle Data Guard protection modes in depth: what each mode actually guarantees, how redo transport works under the hood, when to use each mode, and the production design decisions that determine whether your standby database saves you or fails you at the worst possible moment.

 Table of Contents

Data Guard Architecture: How It Actually Works
The Three Protection Modes Explained
Redo Transport Deep Dive: SYNC vs ASYNC
Standby Database Types: Physical vs Logical vs Snapshot
Fast-Start Failover: Automatic DR Activation
Active Data Guard: Read-Only Standby While Applying Redo
Data Guard Broker: Centralized Configuration Management
Real Production Failover: What Actually Happens
Protection Mode Decision Framework
FAQ
Related Reading from Real Production Systems

1. Data Guard Architecture: How It Actually Works

Oracle Data Guard maintains one or more synchronized copies of a primary database called standby databases. When the primary fails, a standby can take over as the new primary — either automatically via Fast-Start Failover or manually via a DBA-initiated failover command.

The core mechanism is redo log shipping. Every change made to the primary database is recorded in redo logs. Data Guard ships those redo records to the standby in real time, where they are applied to keep the standby synchronized.

Key Data Guard Components

Component	Location	Role
LGWR / DMON	Primary	Captures redo and initiates shipping to standby
NSA / NSS Process	Primary	Network Server process — sends redo over the network
RFS Process	Standby	Remote File Server — receives redo from primary
MRP Process	Standby	Managed Recovery Process — applies redo to standby
LSP Process	Standby (logical)	LogMiner Server Process — applies SQL for logical standby
DMON Process	Both	Data Guard Monitor — manages broker configuration
Observer	Third site	Monitors primary for Fast-Start Failover quorum

SQL — Verify Data Guard Configuration Status

-- Check Data Guard status on primary
SELECT name,
       db_unique_name,
       database_role,
       protection_mode,
       protection_level,
       switchover_status,
       dataguard_broker
FROM   v$database;

-- Check redo transport status to all standbys
SELECT dest_id,
       dest_name,
       status,
       target,
       archiver,
       schedule,
       destination,
       applied_scn,
       error
FROM   v$archive_dest
WHERE  status != 'INACTIVE'
ORDER  BY dest_id;

-- Check standby apply lag
SELECT name,
       value,
       datum_time
FROM   v$dataguard_stats
WHERE  name IN ('transport lag', 'apply lag', 'apply finish time')
ORDER  BY name;

2. The Three Protection Modes Explained

Oracle Data Guard offers three protection modes. Each makes a different trade-off between data loss guarantee, performance impact, and availability. Choosing the wrong mode for your workload is one of the most common — and most dangerous — Data Guard mistakes.

Maximum Protection

Every redo record must be written to at least one standby redo log on the standby database before the primary acknowledges the transaction commit. If the standby becomes unreachable, the primary database shuts itself down rather than risk any data divergence.

Transport: SYNC — synchronous redo shipping
Affirm: AFFIRM — standby must confirm disk write
RPO: Zero — absolutely no data loss under any failure scenario
RTO: Fast — standby is fully synchronized at all times
Risk: Primary shuts down if standby is unreachable — availability depends on standby health
Use case: Regulatory compliance, core banking, financial settlement systems

Production Reality — Maximum Protection:

A core banking client ran Maximum Protection mode. During a routine standby server reboot for OS patching, the primary database shut itself down — taking the entire production system offline for 18 minutes. The team had forgotten that Maximum Protection means the primary cannot run without the standby. Always maintain a second standby or use Maximum Availability instead for most production workloads.

Maximum Availability

The default recommended mode for most production systems. Like Maximum Protection, it uses synchronous redo shipping. However, if the standby becomes unreachable, the primary automatically falls back to asynchronous mode and continues running. When the standby reconnects, it resynchronizes and the primary returns to synchronous mode automatically.

Transport: SYNC — synchronous when standby is available
Affirm: AFFIRM — confirms disk write on standby
RPO: Zero under normal conditions; seconds of exposure during fallback
RTO: Fast — standby synchronized within seconds of primary failure
Risk: Brief data loss window during the fallback period
Use case: Enterprise OLTP, e-commerce, healthcare systems

Maximum Performance

The default mode when Data Guard is first configured. Redo is shipped asynchronously — the primary does not wait for the standby to confirm receipt before acknowledging the commit. This provides the best primary performance but at the cost of a potential data loss window.

Transport: ASYNC — asynchronous redo shipping
Affirm: NOAFFIRM — no standby disk write confirmation required
RPO: Seconds to minutes depending on network latency and redo volume
RTO: Fast but standby may lag behind primary
Risk: Data loss equal to the apply lag at time of failure
Use case: Reporting standbys, development/test environments, geographically distant DR sites

SQL — Change Protection Mode (via Data Guard Broker)

-- Connect to DGMGRL as SYSDG or SYSDBA
-- dgmgrl sys/password@primary

-- Check current protection mode
SHOW CONFIGURATION;
SHOW DATABASE VERBOSE primary_db;

-- Change to Maximum Availability (recommended for most production)
EDIT CONFIGURATION SET PROTECTION MODE AS MAXAVAILABILITY;

-- Change to Maximum Protection (zero data loss, primary shuts down without standby)
EDIT CONFIGURATION SET PROTECTION MODE AS MAXPROTECTION;

-- Change to Maximum Performance (async, best performance)
EDIT CONFIGURATION SET PROTECTION MODE AS MAXPERFORMANCE;

-- Verify after change
SHOW CONFIGURATION;

3. Redo Transport Deep Dive: SYNC vs ASYNC

The protection mode you choose determines the redo transport mechanism. Understanding the difference between SYNC and ASYNC transport is essential for making the right design choice.

Synchronous Transport (SYNC)

With synchronous transport, the primary database Log Writer (LGWR) process sends the redo record to the standby at the same time it writes to the local online redo log. The commit does not complete until both the local write and the standby acknowledgment have been received. This introduces latency equal to the round-trip time between primary and standby.

Network Latency Rule for SYNC Transport

For synchronous redo transport, the round-trip network latency between primary and standby should be under 5 ms for OLTP workloads. Beyond 5 ms, commit latency becomes noticeable to applications. Beyond 20 ms, SYNC transport typically causes unacceptable performance degradation. In our production environment with a 120 km dark fiber link, round-trip latency is 1.8 ms — well within acceptable range for Maximum Availability mode.

Asynchronous Transport (ASYNC)

With asynchronous transport, the primary LGWR writes to the local redo log and acknowledges the commit immediately. A separate background process (NSA — Network Server Async) ships redo to the standby independently. The standby typically lags behind the primary by the amount of redo generated during the network transfer time.

SQL — Configure Redo Transport (LOG_ARCHIVE_DEST)

-- Maximum Availability: SYNC transport with AFFIRM
ALTER SYSTEM SET log_archive_dest_2 =
    'SERVICE=standby_db
      SYNC
      AFFIRM
      NET_TIMEOUT=30
      VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
      DB_UNIQUE_NAME=standby_db'
SCOPE=BOTH SID='*';

-- Maximum Performance: ASYNC transport with NOAFFIRM
ALTER SYSTEM SET log_archive_dest_2 =
    'SERVICE=standby_db
      ASYNC
      NOAFFIRM
      NET_TIMEOUT=30
      VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE)
      DB_UNIQUE_NAME=standby_db'
SCOPE=BOTH SID='*';

-- Verify redo transport configuration
SELECT dest_id,
       dest_name,
       target,
       archiver,
       net_timeout,
       affirm,
       async_blocks,
       status
FROM   v$archive_dest
WHERE  dest_id = 2;

Standby Redo Logs (SRL) — Critical Requirement

Standby Redo Logs (SRLs) are required for real-time apply and for synchronous transport. They must be the same size as the primary online redo logs and there must be at least one more SRL group per thread than the number of online redo log groups on the primary.

SQL — Add Standby Redo Logs on Standby Database

-- Check existing online redo log sizes on primary
SELECT group#,
       members,
       bytes/1024/1024  AS size_mb,
       status
FROM   v$log
ORDER  BY group#;

-- Add standby redo logs on standby (same size, one extra group per thread)
-- For a 2-node RAC primary with 3 redo groups per thread:
-- Add 4 SRL groups per thread (3 + 1 = 4)
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
    GROUP 11 ('+DATA/stdby/srl_t1_g11.log') SIZE 200M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
    GROUP 12 ('+DATA/stdby/srl_t1_g12.log') SIZE 200M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
    GROUP 13 ('+DATA/stdby/srl_t1_g13.log') SIZE 200M;
ALTER DATABASE ADD STANDBY LOGFILE THREAD 1
    GROUP 14 ('+DATA/stdby/srl_t1_g14.log') SIZE 200M;

-- Verify SRL configuration
SELECT group#,
       thread#,
       sequence#,
       bytes/1024/1024  AS size_mb,
       archived,
       status
FROM   v$standby_log
ORDER  BY thread#, group#;

4. Standby Database Types: Physical vs Logical vs Snapshot

Type	Apply Method	Open for Reads?	Best Use Case
Physical Standby	Block-for-block redo apply (MRP)	Yes (Active Data Guard, extra license)	Primary DR target, zero data loss
Logical Standby	SQL Apply via LogMiner (LSP)	Yes (read/write, some restrictions)	Reporting, DDL testing, heterogeneous environments
Snapshot Standby	Redo buffered but not applied	Yes (full read/write)	Testing, QA, patch validation against production data

DGMGRL — Convert Physical Standby to Snapshot Standby

-- Convert to snapshot standby (opens for read/write, buffers redo)
CONVERT DATABASE standby_db TO SNAPSHOT STANDBY;

-- Verify conversion
SHOW DATABASE standby_db;

-- Convert back to physical standby (discards all changes, resumes apply)
CONVERT DATABASE standby_db TO PHYSICAL STANDBY;

5. Fast-Start Failover: Automatic DR Activation

Fast-Start Failover (FSFO) enables Oracle Data Guard to automatically fail over to the standby database without any DBA intervention when the primary becomes unavailable. This is what enabled the 23-second failover described in the introduction.

FSFO Requirements

Data Guard Broker must be enabled and configured
Observer process running on a third independent host
Protection mode must be Maximum Availability or Maximum Protection
Standby Redo Logs must be configured on the standby
Flashback Database recommended for reinstating the old primary

DGMGRL — Configure Fast-Start Failover

-- Enable Fast-Start Failover
ENABLE FAST_START FAILOVER;

-- Set failover threshold (seconds primary must be unreachable before failover)
EDIT CONFIGURATION SET PROPERTY FastStartFailoverThreshold = 30;

-- Set lag limit (max apply lag allowed when FSFO is enabled)
EDIT CONFIGURATION SET PROPERTY FastStartFailoverLagLimit = 30;

-- Start the observer (run on third independent host)
-- dgmgrl sys/password@primary
START OBSERVER;

-- Verify FSFO configuration
SHOW FAST_START FAILOVER;
SHOW CONFIGURATION VERBOSE;

-- Simulate failover for testing (without data loss)
-- Run from observer host
FAILOVER TO standby_db;

Production FSFO Design Tip:

Always run the FSFO observer on a third independent host — not on the primary server and not on the standby server. If both the primary and the observer are on the same network segment and that segment fails, the standby may not receive quorum to activate FSFO. In our production setup, the observer runs on a small VM in a separate availability zone with network access to both primary and standby. Test FSFO every quarter by deliberately shutting down the primary during a low-traffic window.

6. Active Data Guard: Read-Only Standby While Applying Redo

Active Data Guard (ADG) allows the physical standby database to be open in read-only mode while redo apply continues in the background. This means the standby can serve reporting queries, offload read traffic from the primary, and still be ready to fail over at any moment.

Licensing Note

Active Data Guard requires the Oracle Active Data Guard option, which is a separately licensed Enterprise Edition add-on. A physical standby open in read-only mode without redo apply is available without ADG license. Real-time redo apply while open read-only requires the ADG license. Always verify licensing before enabling ADG in production.

SQL — Enable Active Data Guard (Real-Time Apply + Read-Only)

-- On standby database: open read-only with real-time apply
ALTER DATABASE OPEN READ ONLY;
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
    USING CURRENT LOGFILE DISCONNECT FROM SESSION;

-- Verify ADG status
SELECT open_mode,
       database_role,
       db_unique_name
FROM   v$database;

-- Monitor apply lag on ADG standby
SELECT name,
       value,
       unit,
       time_computed
FROM   v$dataguard_stats
WHERE  name IN ('transport lag',
                'apply lag',
                'apply finish time',
                'estimated startup time')
ORDER  BY name;

-- Check active sessions using standby for reads
SELECT COUNT(*)         AS read_sessions,
       module,
       action
FROM   gv$session
WHERE  status = 'ACTIVE'
AND    type   = 'USER'
GROUP  BY module, action
ORDER  BY read_sessions DESC
FETCH FIRST 10 ROWS ONLY;

7. Data Guard Broker: Centralized Configuration Management

Data Guard Broker provides a unified interface for managing the entire Data Guard configuration. Without Broker, you manage each database individually through LOG_ARCHIVE_DEST parameters and manual commands. With Broker, you manage the entire configuration through a single DGMGRL command-line interface.

SQL & DGMGRL — Data Guard Broker Setup and Health Check

-- Enable Data Guard Broker on both primary and standby
ALTER SYSTEM SET dg_broker_start = TRUE SCOPE=BOTH SID='*';

-- Connect to broker (run on primary or standby)
-- dgmgrl /

-- Create broker configuration
CREATE CONFIGURATION dg_config
    AS PRIMARY DATABASE IS primary_db
    CONNECT IDENTIFIER IS primary_db;

-- Add standby database to configuration
ADD DATABASE standby_db
    AS CONNECT IDENTIFIER IS standby_db
    MAINTAINED AS PHYSICAL;

-- Enable the configuration
ENABLE CONFIGURATION;

-- Full health check
SHOW CONFIGURATION;
SHOW DATABASE VERBOSE primary_db;
SHOW DATABASE VERBOSE standby_db;

-- Validate configuration (checks for issues proactively)
VALIDATE DATABASE primary_db;
VALIDATE DATABASE standby_db;

DGMGRL — Switchover (Planned Role Transition)

-- Switchover: planned, zero data loss, reversible
-- Use for maintenance, patching, or DR testing

-- Verify configuration is ready for switchover
VALIDATE DATABASE primary_db;
SHOW CONFIGURATION;

-- Initiate switchover to standby
SWITCHOVER TO standby_db;

-- After switchover: verify new primary is active
SHOW CONFIGURATION;
SHOW DATABASE VERBOSE standby_db;   -- now primary
SHOW DATABASE VERBOSE primary_db;   -- now standby

-- Switchover back when maintenance is complete
SWITCHOVER TO primary_db;

8. Real Production Failover: What Actually Happens

These are real Data Guard incidents from production environments and the lessons they produced.

Incident 1: Protection Mode Mismatch Caused Silent Data Loss

A financial services client had Data Guard configured but in Maximum Performance mode (the default). They believed they had zero data loss protection. During a primary storage failure, the standby was 47 seconds behind the primary due to ASYNC transport lag. 47 seconds of financial transactions were permanently lost.

Lesson: Always verify protection mode AND transport mode with SELECT protection_mode, protection_level FROM v$database. The protection_level column shows the actual effective mode — it differs from protection_mode when the standby has fallen back to async.

Incident 2: Standby Redo Logs Not Configured — Real-Time Apply Silently Disabled

A team set up Data Guard correctly — or so they thought. Failover testing revealed the standby was 8 hours behind the primary. Investigation showed Standby Redo Logs had never been created. Without SRLs, real-time apply cannot function and the standby only catches up during log archive shipping, leaving a massive gap.

Lesson: Always verify SRLs exist and are active: SELECT * FROM v$standby_log. If this view is empty, real-time apply is not running.

Success: 23-Second Automatic Failover (The Introduction Story)

Configuration that made it work: Maximum Availability mode, SYNC/AFFIRM transport on 1.8 ms latency link, Standby Redo Logs sized identically to primary, FSFO enabled with threshold of 15 seconds, observer on independent host, Flashback Database enabled on both primary and standby, and quarterly failover testing under full production load.

Key lesson: Fast-Start Failover only works reliably when every component is validated together under real load — not just configured and assumed to work.

SQL — Data Guard Health Check Queries

-- 1. Verify effective protection mode (must match intended mode)
SELECT name,
       db_unique_name,
       database_role,
       protection_mode,    -- configured mode
       protection_level    -- actual effective mode (may differ)
FROM   v$database;

-- 2. Check transport lag and apply lag
SELECT name,
       value,
       datum_time
FROM   v$dataguard_stats
WHERE  name IN ('transport lag', 'apply lag')
ORDER  BY name;

-- 3. Verify standby redo logs exist and are active
SELECT group#,
       thread#,
       sequence#,
       bytes/1024/1024  AS size_mb,
       archived,
       status
FROM   v$standby_log
ORDER  BY thread#, group#;

-- 4. Check MRP (apply) process is running on standby
SELECT process,
       status,
       thread#,
       sequence#,
       block#,
       blocks
FROM   v$managed_standby
WHERE  process LIKE 'MRP%'
OR     process LIKE 'RFS%'
ORDER  BY process;

9. Protection Mode Decision Framework

Use this framework to choose the right protection mode for your workload.

Choose Maximum Protection if ALL of these are true:

Regulatory zero data loss is a hard legal requirement (banking, finance, government)
You have a second standby or can tolerate primary shutdown if standby fails
Network latency between primary and standby is under 2 ms
Your application can tolerate slightly higher commit latency

Choose Maximum Availability if ANY of these are true:

You need near-zero data loss but cannot accept primary shutdown risk
Network latency is under 5 ms and workload is standard OLTP
You want Fast-Start Failover with automatic activation
This is your primary DR configuration for an enterprise system

Choose Maximum Performance if ANY of these are true:

This standby is for reporting or dev/test only — not primary DR
The standby is geographically distant (WAN latency > 10 ms)
Some data loss is acceptable and primary performance is the priority
You are running a second standby in addition to a sync standby

10. FAQ

What is the difference between Switchover and Failover?

Switchover is a planned, graceful role transition. Both primary and standby participate cooperatively. There is zero data loss and the operation is fully reversible. Use switchover for planned maintenance, patching, or DR testing.

Failover is an unplanned activation of the standby when the primary is unavailable. Data loss depends on the protection mode in effect at the time. The old primary must be reinstated (using Flashback Database) before it can rejoin as a standby. Fast-Start Failover automates this process.

Does Data Guard protect against logical corruption?

No. Data Guard replicates every redo record including those caused by user errors like accidental DROP TABLE or DELETE without WHERE. The corruption is applied to the standby within seconds. For protection against logical corruption, use Flashback Database (which can rewind both primary and standby to a point before the error) or maintain a time-delayed standby using the DELAY parameter in LOG_ARCHIVE_DEST.

Can Data Guard work with Oracle RAC on the primary?

Yes. RAC primary with a single-instance physical standby is a very common and fully supported architecture. The standby receives redo from all RAC threads and applies them in order. The standby can also be a RAC database for maximum availability on both sides. This RAC + Data Guard combination is called MAA (Maximum Availability Architecture) — Oracle's reference architecture for mission-critical systems.

How often should I test failover?

At minimum, quarterly — and ideally monthly for critical systems. Test under realistic load, not during off-peak hours with zero activity. A failover that works at 2 AM with no users connected may fail at 2 PM under full load due to session cleanup issues, long-running transactions, or application reconnection pool exhaustion. Document RTO measurements from every test and track them over time.

Should I mention Data Guard on my resume?

Absolutely — with specifics. Instead of "Oracle Data Guard experience," write: "Designed and maintained Oracle Data Guard Maximum Availability configuration with Fast-Start Failover for a 6 TB core banking database. Achieved RTO of 23 seconds and RPO of zero through SYNC/AFFIRM transport on dedicated 10 GbE WAN link. Conducted quarterly failover validation under production load." Metrics and architecture decisions demonstrate real expertise.

About the Author

Chetan Yadav

Chetan Yadav is a Senior Oracle, PostgreSQL, MySQL, and Cloud DBA with 15+ years of hands-on experience managing production databases across on-premises, hybrid, and cloud environments. He specializes in high availability architecture, performance tuning, disaster recovery, and database migrations.

Throughout his career, Chetan has designed and implemented Oracle Data Guard configurations for mission-critical systems in finance, healthcare, and e-commerce sectors. He has architected zero data loss DR solutions with sub-30-second RTO targets and validated them through real failover testing under production load.

This blog focuses on real-world DBA problems, career growth, and practical learning — not theoretical documentation or vendor marketing.

Monday, March 2, 2026

Oracle Performance Engineering Guide: AWR, ASH and SQL Monitor in 19c and 23ai

Master the essential tools for diagnosing and resolving real-world performance issues

Date: Tuesday, March 10, 2026

Author: Chetan Yadav

Read Time: 20-24 minutes

Oracle Performance Diagnostics workflow showing data flow from Applications through Oracle Database Instance to Active Session History, AWR Repository, SQL Monitor and finally to DBA Performance Analysis

Estimated Reading Time: 20-24 minutes

Real-World Performance Diagnostics - From Baseline Metrics to SQL Execution Plans

Production Environment Context

It's 3 AM on a Tuesday. The monitoring dashboard lights up red. Response times have jumped from 200ms to 5+ seconds. Users are reporting timeouts on critical batch jobs. Your manager's Slack message is already waiting: "Database issue?" You log into the database, check CPU utilization (45%), memory (78% used), disk I/O latency (120ms). Everything looks elevated but not catastrophically bad. Where do you even start investigating?

This is where AWR (Automatic Workload Repository), ASH (Active Session History), and SQL Monitor become your diagnostic lifeline. Over the past 15+ years managing large-scale Oracle databases in production, I have debugged thousands of performance incidents—from runaway SQL queries consuming 800GB of I/O in 10 minutes, to massive lock contention blocking 400+ sessions, to redo log I/O stalls freezing the entire database. These three tools have never failed me. They transform performance troubleshooting from educated guessing into data-driven root cause analysis.

In this guide, you will learn the exact methodology I use in production: how to leverage AWR, ASH, and SQL Monitor to identify root causes in under 15 minutes, pinpoint the exact problematic SQL statements, analyze their execution plans, and implement fixes. Whether you're managing a 3-node RAC cluster running Oracle 19c or a cloud-native 23ai environment, the diagnostic principles remain constant.

Let's dig into real production scenarios and techniques.

Automating Backup and Restore in Oracle 19c and 23ai: Complete DBA Guide

Production-Tested RMAN Automation Scripts and Recovery Strategies

📅 February 05, 2026

👤 Chetan Yadav - Senior Oracle & Cloud DBA

⏱️ 18-20 min read

⏱️ Estimated Reading Time: 18–20 minutes

💾 Oracle RMAN Automation - From Manual Backups to Fully Automated Recovery

At 3 AM on a Tuesday, our production Oracle 19c database crashed. Corrupted datafile. The application team was screaming. The CTO was on the call. Everyone looked at me.

I typed one command: ./restore_prod.sh PRODDB 2026-02-04_23:00. Twenty-three minutes later, the database was back online with zero data loss. The automated backup and restore framework I'd built six months earlier just saved our jobs.

Automated backup systems and data storage visualization representing Oracle RMAN backup automation and disaster recovery infrastructure

Manual backup and restore processes are where Oracle DBAs lose the most time. Automating RMAN backups isn't about convenience—it's about reliability, consistency, and being able to restore in minutes instead of hours when production is down.

This guide covers production-tested automation frameworks for Oracle 19c and 23ai. If you're still running manual RMAN scripts or struggling with backup consistency, these patterns will save you hours every week and make disaster recovery predictable.

Oracle Database 23ai: Revolutionizing Data Distribution Across the Globe

A Journey Through Distributed Database Innovation with François Pons

📅 January 23, 2026 👤 Chetan Yadav - Oracle ACE Apprentice ⏱️ 10-15 min read

Oracle 23ai Globally Distributed Database

🌍 Oracle Globally Distributed Database - Global Scale, Local Performance

⏱️ Estimated Reading Time: 10-15 minutes

🎯 My Journey as an Oracle ACE Apprentice: Uncovering Database Innovation

When I first received my acceptance into the Oracle ACE Apprentice program, I knew I'd be diving deep into Oracle technologies. One of my initial tasks was to review and showcase product releases through demonstrations and write-ups. I chose to explore Oracle Database 23ai's Globally Distributed Database feature, and what I discovered genuinely surprised me.

This wasn't just another database update—this was a complete reimagining of how we think about data distribution, scalability, and geographic compliance. The presentation by François Pons, Senior Principal Product Manager at Oracle, opened my eyes to capabilities I didn't even know were possible in enterprise databases.

💡 Why This Matters: As part of my Oracle ACE Apprentice journey, I'm required to demonstrate Oracle product usage by submitting three demonstrations within the first 60 days. This deep dive into globally distributed databases represents one of those demonstrations, and it turned out to be far more inspiring than I initially expected.

🎤 What Makes This Presentation Stand Out

François Pons doesn't just walk through technical specifications; he tells a story about solving real business problems. From the moment he begins explaining distributed databases, you realize this technology addresses challenges that keep CTOs awake at night: how to scale infinitely, how to survive disasters, and how to comply with data sovereignty laws across multiple countries.

What struck me most was the elegance of the solution. Oracle hasn't just bolted on distributed capabilities to their existing database—they've fundamentally rethought how data can be spread across the globe while maintaining the full power of SQL and ACID transactions.

"All the benefits of a distributed database, without the compromises. Why settle for less?" - François Pons

Basic Distributed Database Architecture: Application connects to multiple shards

🧩 Understanding Distributed Databases: Breaking It Down

Let me share what I learned from this presentation in a way that makes sense, even if you're new to distributed database concepts.

The Core Concept

A distributed database stores data across multiple physical locations instead of keeping everything in one place. Think of it like having multiple bank branches instead of one central vault. Each location (called a "shard") stores a subset of your data, but applications interact with it as if it were a single, unified database.

The beauty? Your applications don't need to know where the data physically resides. Oracle handles all the complexity behind the scenes.

Why This Matters in 2026

François highlighted two primary use cases that resonated with me:

1️⃣ Ultimate Scalability and Survivability

When your application grows beyond what a single database can handle—even a powerful clustered database—distributed architecture becomes essential. Oracle's approach lets you scale horizontally by adding more shards, each potentially running on commodity hardware or in different cloud providers.

2️⃣ Data Sovereignty Compliance

With regulations like GDPR in Europe, data localization laws in China, and similar requirements worldwide, companies need to ensure specific data stays in specific geographic regions. Oracle's value-based sharding makes this straightforward: European customer data stays on European servers, American data stays in America, and so on.

Value-Based Sharding: Data distributed by geography for sovereignty compliance

🚀 The Technical Innovations That Impressed Me

Multiple Data Distribution Methods

Oracle doesn't force you into a one-size-fits-all approach. François explains four different distribution strategies:

Value-Based Sharding: Distribute data by specific values like country or product category. Perfect for data sovereignty requirements where you need to guarantee data residency.
System-Managed (Hash-Based) Sharding: Uses consistent hashing to evenly distribute data across shards. Ideal when you need balanced performance and don't have geographic constraints.
Composite Sharding: Combines value-based and hash-based approaches. For example, first distribute by country, then within each country distribute evenly across multiple shards by customer ID.
Duplicated Tables: Small, read-mostly reference tables can be duplicated across all shards to avoid cross-shard queries.

Replication Strategies: Where Innovation Shines

🆕 Raft-Based Replication (New in 23ai)

This is the game-changer François seemed most excited about. Based on the popular Raft consensus protocol, it provides:

Automatic failover in under 3 seconds
Zero data loss through synchronous replication
Active-active symmetric configuration where each shard accepts both reads and writes
No need to configure Data Guard or GoldenGate separately

⚡ Performance Note: The Raft implementation particularly impressed me because it addresses a common distributed database challenge: achieving both high availability and data consistency without complex manual configuration.

🌐 Deployment Flexibility: Oracle Meets You Where You Are

One aspect François emphasized that I found particularly practical: Oracle doesn't dictate your infrastructure choices. You can deploy shards:

On independent commodity servers (simple, low-cost)
On fault-tolerant RAC clusters (combining distributed and clustered architectures)
Across multiple clouds (OCI, AWS, Azure)
In hybrid on-premises and cloud configurations

💼 Real-World Use Cases

François showcased several application types already using Oracle Globally Distributed Database:

📱 Mobile messaging platforms: Require massive scale and low latency worldwide
💳 Payment processing: Needs transaction consistency and regulatory compliance
🔍 Credit card fraud detection: Demands real-time processing across regions
🌐 IoT applications: Like smart power meters generating enormous data volumes
🖥️ Internet infrastructure: Supporting critical distributed services

🤖 The Autonomous Advantage

While François covered the core distributed database technology, he also highlighted Oracle Globally Distributed Autonomous Database, which adds automated management to eliminate operational complexity.

🎬 What the Demo Revealed

The live demonstration François provided showed just how straightforward the setup process has become. Using the Oracle Cloud interface, he displayed a map-based configuration where you simply click regions to place shards.

💡 My Key Takeaways as an ACE Apprentice

Key Insights

Oracle is solving real business problems, not just adding features. Every capability François described addresses actual challenges companies face when scaling globally.
The convergence of distributed and clustered architectures is powerful. You don't have to choose between RAC's local performance and sharding's global scale—you can have both.
Raft replication represents a significant step forward. Three-second automatic failover with zero data loss is exactly what distributed applications need.

🔮 Looking Forward: The Broader Implications

Multi-cloud becomes practical

When you can seamlessly deploy across OCI, AWS, and Azure in a single distributed database, you're no longer locked into one vendor's ecosystem.

Global applications become easier

Developers can focus on application logic rather than data distribution complexity.

📚 Resources and Next Steps

If you're interested in exploring Oracle Database 23ai's Globally Distributed Database further, I recommend:

Watch François Pons's complete presentation on the Oracle Developers YouTube channel
Visit oracle.com/database/distributed-database for comprehensive documentation
Try the free tier on Oracle Cloud to experiment hands-on
Review the Oracle 23ai documentation on Raft replication

#OracleDatabase #Oracle23ai #DistributedDatabases #OracleACE #CloudDatabases #RaftReplication

About the Author

Chetan Yadav

Oracle ACE Apprentice | Senior Oracle & Cloud DBA

This blog post was created as part of my Oracle ACE Apprentice journey, where I'm exploring and demonstrating Oracle product innovations. The insights shared here come from my review of François Pons's excellent presentation on Oracle Database 23ai's Globally Distributed Database capabilities.

Connect & Learn More:
📊 LinkedIn Profile | 🎥 YouTube Channel

Pages

Monday, March 16, 2026

Oracle Database 23ai Architecture: AI-Native Internals for DBAs

Oracle Database 23ai Architecture: AI-Native Internals for DBAs

Thursday, March 12, 2026

Oracle Data Guard Protection Modes: Zero Data Loss Design Strategy

Oracle Data Guard Protection Modes: Zero Data Loss Design Strategy

1. Data Guard Architecture: How It Actually Works

Key Data Guard Components

2. The Three Protection Modes Explained

Maximum Protection

Maximum Availability

Maximum Performance

3. Redo Transport Deep Dive: SYNC vs ASYNC

Synchronous Transport (SYNC)

Asynchronous Transport (ASYNC)

Standby Redo Logs (SRL) — Critical Requirement

4. Standby Database Types: Physical vs Logical vs Snapshot

5. Fast-Start Failover: Automatic DR Activation

FSFO Requirements

6. Active Data Guard: Read-Only Standby While Applying Redo

7. Data Guard Broker: Centralized Configuration Management

8. Real Production Failover: What Actually Happens

9. Protection Mode Decision Framework

Choose Maximum Protection if ALL of these are true:

Choose Maximum Availability if ANY of these are true:

Choose Maximum Performance if ANY of these are true:

10. FAQ

11. Related Reading from Real Production Systems

About the Author

Monday, March 2, 2026

Oracle Performance Engineering Guide: AWR, ASH and SQL Monitor in 19c and 23ai

Oracle Performance Engineering Guide: AWR, ASH and SQL Monitor in 19c and 23ai

Production Environment Context

Monday, February 16, 2026

Automating Backup and Restore in Oracle 19c and 23ai: Complete DBA Guide

Automating Backup and Restore in Oracle 19c and 23ai: Complete DBA Guide

Friday, January 23, 2026

Oracle Database 23ai: Revolutionizing Data Distribution Across the Globe

Oracle Database 23ai: Revolutionizing Data Distribution Across the Globe

📚 Table of Contents

🎯 My Journey as an Oracle ACE Apprentice: Uncovering Database Innovation

🎤 What Makes This Presentation Stand Out

🧩 Understanding Distributed Databases: Breaking It Down

The Core Concept

Why This Matters in 2026

1️⃣ Ultimate Scalability and Survivability

2️⃣ Data Sovereignty Compliance

🚀 The Technical Innovations That Impressed Me

Multiple Data Distribution Methods

Replication Strategies: Where Innovation Shines

🆕 Raft-Based Replication (New in 23ai)

🌐 Deployment Flexibility: Oracle Meets You Where You Are

💼 Real-World Use Cases

🤖 The Autonomous Advantage

🎬 What the Demo Revealed

💡 My Key Takeaways as an ACE Apprentice

Key Insights

🔮 Looking Forward: The Broader Implications

Multi-cloud becomes practical

Global applications become easier

📚 Resources and Next Steps

📢 Found this helpful? Share it!

About the Author

Chetan Yadav