Oracle RAC Internals Explained: Cache Fusion and Cluster Design Lessons
Oracle Database: 19.18.0.0.0 Enterprise Edition • Cluster: 4-Node Oracle RAC on Oracle Linux 8.7
Storage: Oracle ASM, 12 TB shared (Normal Redundancy) • DB Size: 8.2 TB (6.8 TB data + 1.4 TB indexes)
Workload: Mixed OLTP/Batch • Peak Load: 3,200 concurrent sessions, 2,400 TPS
Interconnect: Dual 10GbE bonded private network • Application: Financial transaction processing system
3:47 AM. Pager alert: "RAC Node 2 evicted — cluster performance degraded." I logged into the surviving node running Oracle Database 19.18.0.0.0. The cluster had automatically failed over, but performance had collapsed. What should have been 2,400 transactions per second was now limping at 900 TPS.
I checked interconnect statistics immediately. The gc cr block receive time averaged 247 milliseconds — it should be under 1 millisecond. This wasn't a failed-node problem; this was network infrastructure failure. The private interconnect switch had undergone a firmware upgrade during the maintenance window. The new firmware version had a packet forwarding bug causing random 200ms+ delays in Cache Fusion block transfers. Applications were technically connected, but every cross-node block request was timing out and retrying. We initiated emergency failover to the DR site while network engineering rolled back the switch firmware.
Oracle RAC is not just "multiple databases sharing storage." It's a distributed cache coherency system where every node maintains its own buffer cache, but all nodes must coordinate which version of each data block is current. Cache Fusion is the mechanism that makes this work — transferring blocks between nodes over the private interconnect instead of forcing disk writes. Understanding this is the difference between an operational RAC cluster and a ticking time bomb.
This guide covers real Oracle RAC internals: how Cache Fusion actually works, why interconnect design matters more than CPU, what causes split-brain scenarios, and the production lessons learned from managing RAC clusters that can't afford downtime.
- RAC Architecture Fundamentals: Beyond the Marketing
- Cache Fusion Explained: How Blocks Move Between Nodes
- Global Cache Services (GCS) and Global Enqueue Services (GES)
- Cluster Interconnect: The Most Critical Component
- Split-Brain Scenarios and Voting Disk Protection
- RAC Performance Tuning: What Actually Matters
- Real Production Failures and Lessons Learned
- When RAC Makes Sense (And When It Doesn't)
- FAQ
- Related Reading from Real Production Systems
1. RAC Architecture Fundamentals: Beyond the Marketing
Oracle RAC is sold as "high availability and scalability." Reality is more nuanced.
What RAC Actually Provides
| Capability | Reality | Common Misconception |
|---|---|---|
| High Availability | Survives single node failure | "Zero downtime" — not true during network failures |
| Scalability | Read scaling works well | "Linear scaling" — write workloads don't scale linearly |
| Load Balancing | Distributes connections | "Automatic query routing" — application must handle |
| Maintenance | Rolling patches possible | "No downtime patches" — some still require outage |
Core RAC Components
Every RAC cluster requires:
- Shared Storage: ASM or certified cluster filesystem — all nodes access the same datafiles
- Private Interconnect: Dedicated network for Cache Fusion messages (1 GB minimum, 10 GB+ recommended)
- Voting Disks: Quorum mechanism to prevent split-brain (typically 3 or 5)
- OCR (Oracle Cluster Registry): Cluster configuration database
- Clusterware: Grid Infrastructure managing node membership and resources
The queries in this article use dynamic performance views (
v$ and gv$ views) which are available in all Oracle Database editions without additional licensing. When analyzing historical performance data, AWR and ASH queries require the Oracle Diagnostics Pack license. For unlicensed environments, use Statspack (free) or real-time v$ views as shown above.
Single Instance vs RAC: Architectural Differences
Single Instance:
- One SGA, one buffer cache
- No coordination overhead
- Simple lock management
- Straightforward troubleshooting
RAC Cluster:
- Multiple SGAs — one per node
- Cache Fusion coordination required
- Global lock management via GES
- Complex distributed troubleshooting
2. Cache Fusion Explained: How Blocks Move Between Nodes
Cache Fusion is Oracle's distributed shared cache architecture used in Oracle Real Application Clusters (RAC). It was fully introduced with Oracle RAC in Oracle 9i, replacing the disk-based block pinging architecture used in earlier Oracle Parallel Server (OPS) environments.
Instead of forcing modified blocks to be written to disk before another instance reads them, RAC transfers blocks directly between instance buffer caches over the private interconnect. This memory-to-memory block transfer dramatically reduces latency compared with disk-based synchronization.
The Problem Cache Fusion Solves
Without Cache Fusion (Oracle Parallel Server 8i architecture):
- Node 1 modifies block 1234567 in its buffer cache (8 KB block size)
- Node 2 requests the same block for a SELECT query
- Node 1 must write the dirty block to shared storage via LGWR and DBWR
- Node 2 reads the block from disk via
db file sequential readwait event - Result: Forced disk I/O averaging 8–15 ms latency (ping-pong effect)
- Scalability ceiling: 2–3 nodes maximum due to I/O contention
With Cache Fusion (Oracle 19.18.0.0.0 RAC):
- Node 1 holds dirty block 1234567 in buffer cache (current mode)
- Node 2 requests the block via Global Cache Services message
- GCS coordinates transfer — Node 1 identified as master for this resource
- Node 1 ships the block directly over the private interconnect (10 GbE)
- Transfer completes in 0.5–2.0 milliseconds (10x faster than disk)
- Node 2 receives the block in its buffer cache without disk I/O
- Result: Memory-to-memory transfer; disk write deferred until checkpoint
- Scalability: Proven deployments up to 16+ nodes in production
Cache Fusion Block Transfer Modes
Current Mode Block Transfer (gc current): When a session requests the most recent version of a block for UPDATE or DELETE operations, Oracle transfers the current mode block. In our 19.18.0.0.0 production RAC environment with 10 GbE interconnect, current mode transfers average 1.2 ms during peak load. If the block is dirty, the owning instance retains a past image (PI) for instance crash recovery purposes.
Consistent Read Mode Block Transfer (gc cr): For SELECT queries requiring read consistency, Oracle may construct consistent read (CR) versions of blocks using undo data. In our testing on Oracle 19.18.0.0.0, CR block transfers show slightly higher latency (1.5–2.0 ms average) because they may require block reconstruction from multiple undo records before transfer. The gc cr block receive time metric in v$system_event directly measures this latency.
Cache Fusion Wait Events in Oracle 19.18.0.0.0
| Wait Event | Description | Typical Latency | Production Impact |
|---|---|---|---|
gc current block 2-way |
Current block transfer between 2 instances | 0.5–2.0 ms (10 GbE) 3–8 ms (1 GbE) |
Most common; acceptable if under 2 ms average |
gc current block 3-way |
Block transfer requiring 3-instance coordination | 1.5–4.0 ms (10 GbE) | Higher cost; occurs when block has past images on multiple nodes |
gc cr block 2-way |
Consistent read block constructed and transferred | 1.0–2.5 ms | Read-heavy workloads; check undo contention if high |
gc current block busy |
Waiting for in-flight block transfer to complete | Variable | Hot block contention; redesign needed if persistent |
gc buffer busy acquire |
Multiple sessions contending for the same buffer | Variable | Severe: indicates same block being modified by multiple nodes simultaneously |
During peak batch processing at 11 PM, we observed
gc current block 2-way latency spike to 12 ms (baseline 1.2 ms). Analysis revealed the batch job was performing mass updates on a single table with a right-growing index (order_id sequence). All four RAC instances were contending for the rightmost leaf block of the index.Solution: We partitioned the index by range and implemented four separate sequences with
CACHE 1000 and ORDER settings. Post-change, gc current latency returned to baseline 1.3 ms and batch completion time reduced from 4.2 hours to 2.8 hours.
3. Global Cache Services (GCS) and Global Enqueue Services (GES)
GCS and GES are the coordination layers that make RAC work.
Global Cache Services (GCS)
Responsibilities:
- Tracks which node holds which blocks
- Maintains block ownership information
- Coordinates block transfers between nodes
- Manages cache coherency across the cluster
Global Enqueue Services (GES)
Responsibilities:
- Manages global enqueues across the RAC cluster
- Coordinates locking for shared database resources
- Ensures consistent lock state across all instances
- Maintains global enqueue structures for cluster coordination
Resource Mastering
Each resource (block, lock) has a master node responsible for coordinating access.
Master node responsibilities:
- Tracks current owner of the resource
- Grants access to requesting nodes
- Maintains resource state information
Remastering occurs when:
- A node joins or leaves the cluster
- Resource access patterns change significantly
- Manual remastering is triggered by DBA
4. Cluster Interconnect: The Most Critical Component
The interconnect is the most important part of RAC. If the interconnect fails, the cluster fails.
Interconnect Requirements
| Metric | Minimum | Recommended | Why It Matters |
|---|---|---|---|
| Bandwidth | 1 Gbps | 10+ Gbps | Cache Fusion throughput |
| Latency | < 5 ms | < 1 ms | Block transfer speed |
| Packet Loss | < 1% | < 0.1% | Message reliability |
| Redundancy | Single path | Bonded NICs | Failover capability |
Common Interconnect Problems
- Risk Shared switches: Interconnect traffic mixed with public traffic
- Risk Insufficient bandwidth: 1 Gbps not enough for high-transaction workloads
- Risk High latency: Geographic distance between nodes (>1 ms)
- Risk Single point of failure: One switch, one cable
Interconnect Design Best Practices
- Best Dedicated network: Separate from public and backup networks
- Best 10 Gbps minimum: For all production workloads
- Best Low-latency switches: Purpose-built for interconnect
- Best NIC bonding: Redundant paths for automatic failover
- Best Jumbo frames: MTU 9000 for better throughput
5. Split-Brain Scenarios and Voting Disk Protection
Split-brain is the nightmare scenario where a cluster partitions and both sides believe they are primary.
What is Split-Brain?
Consider a 3-node RAC cluster running normally. If a network partition occurs (interconnect fails), Node 1 can no longer reach Nodes 2 and 3. Both sides believe the other side has failed. Both sides attempt to become primary. If both sides write to shared storage simultaneously the result is data corruption.
How Voting Disks Prevent Split-Brain
Voting disks implement a quorum mechanism:
- Typically 3 or 5 voting disks are configured
- A node must access a majority of voting disks to survive
- With 3 voting disks, a node needs access to at least 2
- With 5 voting disks, a node needs access to at least 3
- The losing side evicts itself automatically — no manual intervention required
Node Eviction Process
When a node is evicted the following sequence occurs:
- Cluster detects node unresponsiveness (missed heartbeats)
- Voting disk quorum check fails for that node
- Clusterware initiates an immediate node reboot
- The instance crashes (immediate termination — no graceful shutdown)
- Surviving nodes perform instance recovery from redo logs
- Applications reconnect automatically to surviving nodes
6. RAC Performance Tuning: What Actually Matters
RAC tuning is different from single-instance tuning. The metrics that matter most are cluster-specific.
Key RAC-Specific Metrics
| Metric | Good Value | Problem Threshold | Action |
|---|---|---|---|
| GC CR block receive time | < 1 ms | > 5 ms | Check interconnect hardware |
| GC current block busy | < 1% of waits | > 5% of waits | Reduce hot blocks |
| Blocks received (per node) | Balanced across nodes | Skewed to one node | Fix application routing |
| Cache transfers | < 10% of reads | > 30% of reads | Partition data or workload |
Common RAC Performance Problems
1. Hot Blocks
A single block being accessed by multiple nodes simultaneously causes excessive Cache Fusion traffic. Solution: partition data, use sequences wisely, avoid right-growing indexes.
2. Unbalanced Load
One node handling 80% of the workload while others are underutilized. Solution: fix application-level connection distribution and service definitions.
3. Interconnect Saturation
Cache Fusion messages exceeding available bandwidth causes latency to increase dramatically. Solution: upgrade interconnect to 10 GbE or 25 GbE; reduce unnecessary block transfers through workload partitioning.
7. Real Production Failures and Lessons Learned
These are actual RAC incidents from production environments.
Network team upgraded switch firmware during the maintenance window. The new firmware had a bug causing random packet drops. The cluster detected node unresponsiveness, and all 4 nodes evicted themselves simultaneously — complete cluster failure.
Lesson: Never trust network changes without extended interconnect testing. Always run
ping and traceroute across the private interconnect for at least 30 minutes post-change before closing the maintenance window.
AWR showed high
gc cr block receive time. Initial assumption was an interconnect problem. Deep investigation revealed storage latency of 50 ms — nodes were waiting for disk I/O, not Cache Fusion.Lesson: Always check storage I/O latency before blaming RAC or the interconnect. Check
v$filestat and storage-level metrics first.
The application used a single global sequence for order IDs. Every insert required global coordination across all nodes. This caused
enq: SQ contention cluster-wide. Throughput was capped at 200 TPS against a target of 2,000+ TPS.Lesson: RAC exposes bad application design immediately. Partition sequences per node, or use local sequences with offsets to eliminate global coordination overhead.
8. When RAC Makes Sense (And When It Doesn't)
RAC is not a universal solution. It has specific use cases where it excels and others where it makes things worse.
Good Use Cases for RAC
- Good Read-heavy workloads: Reporting, analytics, read scaling
- Good High availability requirement: Cannot tolerate planned downtime for patches
- Good Partitioned workloads: Each node handles a different data subset
- Good Connection scaling: Need to support 10,000+ concurrent connections
Bad Use Cases for RAC
- Avoid Write-intensive OLTP: Cache Fusion overhead degrades write performance
- Avoid Single global sequences: Become cluster-wide bottlenecks immediately
- Avoid Budget-constrained environments: RAC requires expensive hardware and licensing
- Avoid Teams without RAC expertise: Troubleshooting requires deep knowledge
RAC Alternatives to Consider
| Requirement | RAC Solution | Alternative Solution |
|---|---|---|
| High Availability | RAC cluster | Data Guard with fast failover |
| Read Scaling | RAC nodes | Active Data Guard read replicas |
| Zero Downtime Patching | RAC rolling patch | Data Guard rolling upgrade |
| Connection Pooling | RAC load balancing | Application-level connection pool |
No comments:
Post a Comment